Next Article in Journal
Teacher Self-Regulation and Its Relationship with Student Self-Regulation in Secondary Education
Next Article in Special Issue
CRITIC-TOPSIS Based Evaluation of Smart Community Governance: A Case Study in China
Previous Article in Journal
The Sustainability of the Project-Driven Innovation of Grassroots Governance: Influencing Factors and Combination Paths
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Text Mining Applications in the Construction Industry: Current Status, Research Gaps, and Prospects

1
School of Mechanics and Civil Engineering, China University of Mining and Technology, Xuzhou 221008, China
2
Tianjin Jingang Construction Co., Ltd., Tianjin 300456, China
*
Author to whom correspondence should be addressed.
Sustainability 2022, 14(24), 16846; https://doi.org/10.3390/su142416846
Submission received: 17 November 2022 / Revised: 10 December 2022 / Accepted: 13 December 2022 / Published: 15 December 2022
(This article belongs to the Special Issue Smart City Construction and Urban Resilience)

Abstract

:
With the advent of the Industry 4.0 era, information technology has been widely developed and applied in the construction engineering field. Text mining techniques can extract interesting and important data hidden in plain text, potentially allowing problems in the construction field to be addressed. Although text mining techniques have been used in the construction field for many years, there is a lack of recent reviews focused on their development and application from a literature analysis perspective; therefore, we conducted a review with the aim of filling this gap. We use a combination of bibliometric and manual literature analyses to systematically review the text mining-based literature related to the construction field from 1997 to 2022. Specifically, publication analysis, collaboration analysis, co-citation analysis, and keyword analysis were conducted on 185 articles collected from the SCOPUS database. Based on a read-through of the 185 papers, the current research topics in text mining were manually determined and sorted, including tasks and methods, application areas, and core methods and algorithms. The presented results provide a comprehensive understanding of the current state of TM techniques, thereby contributing to the further development of TM techniques in the construction industry.

1. Introduction

Text mining (TM) refers to the process of extracting interesting and non-trivial information hidden in plain text, in order to obtain useful insights [1,2]. It has been applied to a wide range of industries, including biography, medicine, and manufacturing [3,4,5]. Studies in these areas have shown that TM techniques can offer valuable information for decision making and improving industrial productivity.
Due to the one-time feature of construction projects and organization, practitioners and academics tend to use experience-based information to make decisions. Experience-based methods, such as brainstorming, Delphi method, questionnaires, interviews, literature studies, and their combination, are primarily used to collect data and information [6,7,8]. This may lead to biased judgements, and the result will be limited by the number and knowledge of the experts. In particular, with the development of the construction industry, the density of information is increasing. The large amount of information (over 80%) stored in daily project management documents or standardized texts makes the use of traditional methods for information retrieval and management difficult [9,10]. There is a lot of potentially valuable information contained in text documents, but it is difficult to transform this information into knowledge by individuals or researchers. Therefore, in the context of Industry 4.0, there is a need to introduce emerging information technologies for the mining and research of text-based information in the construction field.
TM has been recently applied in the construction industry with the aim of extracting useful information from unstructured text documents that was not previously known and is not easily revealed. However, it should be noted that studies related to TM technology in the construction industry are still in the early stages, with research being very fragmented. There have been some review studies related to TM techniques based on qualitative analysis [11]. However, few studies have examined TM technology in the construction industry from a bibliometric and visualization perspective. Most existing reviews have focused on a certain aspect of the construction field (e.g., construction management or safety management), thereby lacking a review of the overall status of text mining applications in the construction industry. In order to address the limitations of these existing studies, in this paper, we use a combination of bibliometrics and manual analysis to analyze the research on text mining technology in the construction field, providing a more comprehensive introduction to current research hotspots and future development trends, in order to help readers understand the current situation of text mining in the construction field and provide a reference for experts, scholars, and engineering practitioners; this will help enable them to deeply grasp the research themes and trends of text mining technology in the construction field.
This paper presents a systematic and comprehensive review of the application of text mining techniques in the field of construction from 1997 to 2022, including publication analysis, collaboration analysis, co-citation analysis, keyword analysis, and topic analysis, as well as detailing some commonly used algorithmic models. Section 2 describes the data collection process and methodology; Section 3 provides a systematic econometric analysis of the data set through the use of software; Section 4 introduces some specific methodological models for text analysis; Section 5 summarizes the specific themes of text mining applications in the construction field, and Section 6 discusses future research directions for text mining in construction; and, finally, Section 7 summarizes the paper.

2. Data and Methods

For this paper, we used SCOPUS—a foreign language database—to obtain the literature data set by defining query keywords. After manual reading, we eliminated the papers that did not match the topic, and used two software programs, VOSviewer (version 1.6.18) and CiteSpace (version 5.6.R5), to conduct systematic analyses.

2.1. Research Framework and Methodology

The research framework is shown in Figure 1.
  • For data collection, we used SCOPUS, a representative journal search engine, to collect data.
  • In the data cleaning step, we aimed to filter out unrelated articles and papers, as well as normalize synonymous keywords to improve the performance of the bibliographic analysis.
  • In the bibliographic analysis, we used a word co-occurrence network (CCN) for analysis and visualization, aiming to reveal the trends of publication, most productive authors, co-authorships, and collaboration between organizations.
  • In the theme and topic status analysis, we combined human analysis and a knowledge map to discover the current research topics related to TM in the construction industry, considering the perspectives of applications of TM, tasks of TM, and TM techniques (algorithms). In this step, we also determined the research gaps and potential future research directions.
Science mapping is an information representation technique for finding and visualizing useful information; for example, by presenting it in a knowledge map [12]. VOSviewer (Centre for Science and Technology Studies, Leiden University, Leiden, The Netherlands) and CiteSpace (Chaomei Chen, China) are the most widely used software for bibliometrics analysis [13]. For this study, CiteSpace was used to depict co-authorship and co-citation networks, as well as the keyword timeline. Meanwhile, VOSviewer was used for clustering of the text data (including title and abstract).
In summary, the use of these three different approaches—bibliometric analysis, content analysis, and a quantitative systematic literature review—appropriately complemented each other and allowed us to uncover the theoretical foundations and structure of TM research, as well as to identify key research themes, uncover research gaps, and outline areas for future research.

2.2. Data Collection

The literature data were downloaded from SCOPUS, which is the world’s largest and most-cited database of its kind, involving approximately 22,700 international academic journals and conferences published by approximately 5000 publishers worldwide [14]. It offers intensive coverage of a wider range of journals compared to Web of Science. As such, SCOPUS was considered to be more suitable for our study, in terms of bibliometric analysis [15].

2.2.1. Process of Data Collection (Search Strategy)

Search queries for data collection were determined through expert consultation.
The definition of appropriate search words is significant for data collection. As TM research is still in its infancy, synonyms may exist. To achieve all the synonyms, the keywords of papers obtained from each search activity can be used to supplement the initial search words. The process of selecting search words is depicted in Figure 2.

2.2.2. Initial Search Query

In order to explore the relevant literature, we used the synonyms of TM, tasks of TM, and the application domain to define the search words. According to [16], text mining, also referred to as Text Data Mining, is roughly equivalent to Text Analytics or Knowledge-Discovery in Text (KDT). Thus, the synonyms of text mining were set as query #1.
Text mining describes a range of technologies for analyzing and processing text data. To cover a wider range of publications, the tasks of TM were also considered as search words. Typically, text mining tasks include text categorization, text clustering, information extraction, information retrieval, and sentiment analysis [9,16]. “Information extraction” and “information retrieval” may refer to a broader processing area, such as geological information extraction from point clouds [17,18], or information retrieval platforms (e.g., Google) [19]; therefore, we used object constraints to specifically target the processing of text or document(s). Thus, the tasks of text mining and object constraints were set as query #2.
In addition, considering that TM relates to multi-disciplinary research, such as building and construction, computer science, and management, we chose to use search words (i.e., construction industry and construction project) to focus on the domain literature, instead of setting a specific subject area. Thus, domain constraints were set as query #3.
Moreover, the range of data types collected was limited to articles and conference papers, as query #4.
Based on the above analysis, the initial search query was designed as the following combination of queries: (#1 OR #2 AND #3) AND #4.
The time span was not limited, as we wanted to reveal the emerging and developing trends of TM techniques. The query in SCOPUS was defined as follows, with a total of 144 articles and papers meeting the selection criteria:
1#: TITLE-ABS-KEY (“text mining” OR “text-mining” OR “text data mining” OR “text analytics” OR “knowledge-discovery in text”)
2#: TITLE-ABS-KEY ((“text categorization” OR “text clustering” OR “information extraction” OR “information retrieval” OR “sentiment analysis”) AND (“text” OR “document*”))
3#: TITLE-ABS-KEY ((“ (“construction industry” OR “construction project”))
4#: (LIMIT-TO (DOCTYPE, “ar”) or LIMIT-TO (DOCTYPE, “cp”))

2.2.3. Final Search Query

SCOPUS provides the high-occurrence keywords along with the search result. We reviewed the keywords to check whether there were any new synonyms of the search words. We performed three rounds, until no further synonyms occurred. The new words that added in each round are presented in Table 1. Thus, the final search query was defined as described below, and 125 more (a total of 269) articles and papers were obtained.
TITLE-ABS-KEY ((“text mining” OR “text-mining” OR “text data mining” OR “text analy*” OR “knowledge-discovery in text” OR “text categorization” OR “text clustering” OR “document classification” OR “document clustering” OR “information extracti*” OR “information retriev*” OR “sentiment analysis” OR “text processing”) AND (“text” OR “document*”) AND (“construction industr*” OR “construction project*” OR “building industry” OR “construction sectors” OR “construction sites” OR “construction engineering” OR “construction management”)) AND (LIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO (DOCTYPE, “cp”)).
The various types of constraints and keywords are provided in Table 2.

2.2.4. Data Screening and Collation

For better analysis performance, human inspection was conducted to filter out the unrelated articles and papers, as shown in Figure 3. We read the abstracts of the collected articles and papers, and filtered out 87 unqualified ones. They could be divided as described below. Finally, we obtained a data set consisting of 185 articles shown in Figure 3 and papers closely related to TM in the construction industry, after excluding
  • 22 papers with the search word “information retrieval”, using “information retrieval systems” to refer to information or document management systems. These papers were published before 2009;
  • 24 papers with the unproper indexed keywords. They were chosen not by the authors but by content suppliers, and are standardized based on publicly available vocabularies in SCOPUS;
  • 17 papers with the search word “information extraction” to cope with GIS data or 3D object recognition, which is not closely related to text information;
  • 7 papers with the search word “text analysis” and “information model” using non-text mining techniques (e.g., ground theory) to examine the text;
  • 13 papers with the search word “text mining” or other synonyms were review papers. Although we only chose journal articles and conference papers in the search query, review papers were still collected. These papers used a TM technique to analyze research trends and knowledge gaps in a certain area, which are not within the discussion scope of our paper;
  • 3 of the papers were proceeding prefaces, aiming to introduce a collection of conference papers. TM techniques were only considered in review papers indicating the overall research status, which did not benefit the analysis.
There were 123 journal articles and 62 conference papers, ranging from 1997–2022. All documents were downloaded on 8 September 2022. The list of the articles is provided in the Supplementary Materials (available online at https://github.com/Nina-cumt/Paper-list-of-TM/tree/master, accessed on 12 December 2022).

2.3. Data Cleaning: Thesaurus Construction

There are some problems that must be addressed when using bibliography analysis software to process keyword analysis.
(1) Improper text segmentation needs to be standardized. Bibliography analysis software, such as VOSviewer, uses nouns or noun phrases as terms to represent the document. However, there are some terms with particular meanings in the construction domain. For instance, the term of “fall” means “fall from height”, as a type of safety accident; also, the term “classical vector space model approach” denotes “vector space model”. Thus, these domain terms need to be replaced with a consistent expression. Otherwise, the co-occurrence network may be redundant and fail to depict the significant terms.
(2) Meaningless terms need to be deleted. The general terms widely used in publications, such as lack, gap, understanding, and so on, need to be filtered out to highlight the meaningful terms. Terms such as construction industry and text mining are also considered as meaningless terms in this study.
(3) Synonym terms need to be normalized; for example, similar terms such as model, modelling, and modelling should be normalized. In addition, abbreviations need to be replaced by full phrases, such as IE (information extraction), CNN (convolutional neutral network), and so on.
Thus, a thesaurus was used to cope with the above problems, in order to highlight the meaningful keywords (available online at https://github.com/Nina-cumt/Paper-list-of-TM, accessed on 12 December 2022).

2.4. Text Visualization: Word Co-Occurrence Network

A word co-occurrence network (CCN) was used to represent the relationships between co-authors, co-citations, and co-keywords. In the CCN, the sizes of a node represent the weight (usually the full count of the number of appearances) of a word; the larger the node, the heavier the weight. An edge appearing between two nodes represents that the two words have appeared together (i.e., they co-occur); the thicker the edge, the more frequently they co-occur. The distance between two nodes reflects the strength of the relation between them.

3. Bibliographic Analysis

Next, we analyzed the annual trends; the journal distribution and the most highly cited publications; the most productive players and their collaboration relationships, in terms of author, institute, and country level; and the most influential journals.

3.1. Publication Analysis

This section focuses on the analysis of annual trends in the number of publications issued, the sources of bibliographic journals, and information relating to papers with high citations in the data set.

3.1.1. Annual Trend Analysis

Figure 4 shows the annual trends in related publications. The growth rate of related publications was relatively slow between 1997 and 2009, with the number of papers published being only in the single digits. However, the volume of papers published has increased significantly since 2010. This may be related to the fact that, at this point, people began to have a strong interest in text extraction from construction documents. In addition, there is an increased awareness of the importance of knowledge storage, knowledge exploitation, and the development of effective mining tools for knowledge acquisition. Although there were some individual decreases in 2011, 2012, and 2014, they did not affect the overall trend. With the development of computational technology, the demand for information and intelligence in the construction industry has also grown. In particular, after 2018, the number of papers presented a clear upward trend, with the number of papers published in 2019 alone equivalent to the number of papers published in both 2017 and 2018, and the highest number papers (25) published in 2021. At the same time, this trend also indicates that the application of text mining in the construction industry is a new and growing research field.

3.1.2. Journal Distribution Analysis

This section mainly focuses on journal articles (123 publications), excluding conference papers. The articles were distributed in 58 journals, of which 43 journals published only one article and seven journals published two articles; only eight journals published more than three articles, representing only 0.14% of the total number of journals. The scattered distribution stems from the multi-disciplinary nature of TM applications. The top four relevant journals are listed in Table 3. A total of 54 articles, accounting for 44% of journal articles, were concentrated in four journals: Automation in Construction, Journal of Computing in Civil Engineering, Advanced Engineering Informatics, and Journal of Construction Engineering and Management. Automation in Construction was the most frequent journal for publication, with 22 articles in total. At the same time, this also revealed a problem: except for high-profile publications, most journals had only single-digit publication volume; thus, it is still worth exploring the areas of application and depth of text mining.

3.1.3. Highly Cited Publications

To highlight the most influential papers in the data set, we selected the top 10 papers with the most citations. Table 4 lists these highly cited papers, in terms of citations, title, authors, publication year, and journals.
The highest cited article, entitled “Semantic NLP-Based Information Extraction from Construction Regulatory Documents for Automated Compliance Checking”, was written by Zhang, J. and published in 2016, having 151 citations. A semantic, rule-based NLP method was presented in this study for the automated information extraction from building regulation documents. This method makes use of a set of pattern matching-based information extraction rules and conflict resolution rules. The patterns of IE and CR rules make use of a range of syntactic (syntax-related) and semantic (meaning/context-related) text properties. To lessen the number of necessary patterns, phrase tagging based on phrase structure grammar (PSG) is proposed, along with the separation and ordering of semantic information parts. To help in identifying semantic text features, ontologies are employed (concepts and relations). The 2009 International Building Code quantitative criteria were tested using the suggested IE algorithm, which yielded precision and recall values of 0.969 and 0.944, respectively. The majority of highly referenced papers were centered in the years 2014 through 2019, with a small number of excellent papers receiving numerous citations in 2002 and 2003.

3.2. Collaboration Analysis

A co-occurrence network was developed to represent the collaboration between authors, institutes, and countries. The large nodes represent the influential entities (i.e., authors, institutes, and countries), and the edges between nodes indicate collaborative relationships between entities.

3.2.1. Author Collaboration Analysis

The author co-authorship analysis is shown in Figure 5. Different colors indicate different research fields of the authors, and the lines describe publications jointly written by two researchers. The size of a node represents the number of articles published by the author. The distance between nodes indicates the relevance of co-authorship; the closer the distance between two author nodes, the stronger their relevance. The largest node in the figure is that of Zhang, J., revealing that he has published the largest number of papers. Due to mutual exchanges and cooperation between scholars, several author sub-network systems formed in the figure. The notable ones are the sub-network system headed by Zhang, J. and Xue, X. (green nodes), the sub-network system represented by Chi, S. (orange nodes), the sub-network system represented by Wang, J. (blue nodes), and the sub-network system formed by Jiang, S. and Zhang, H. (red nodes). There were few connections between several high-yield authors, indicating that scholars have insufficient awareness of communication and cooperation in text mining research, and so, academic exchanges and cooperation urgently need to be strengthened.
The top ten productive authors are listed in Table 5. Zhang, J. had the largest number of articles, followed by Kandil, A. and Rezgui, Y.; the number of articles published by these authors accounted for 25% of the total number of articles. An article may contain several authors at the same time.

3.2.2. Institute Collaboration Analysis

Figure 6 lists the top ten influential institutes. It can be seen, from this figure, that National Taiwan University, the University of Illinois at Urbana-Champaign, and Purdue University had a high number of publications. The number of articles published by these three institutions accounted for 10.3% of the total number of articles retrieved. China accounted for two of the top five institutions, in terms of publication volume. On one hand, this shows that China has many problems to be solved in the application of TM technology in the construction industry, such as the degree of automation of text mining and the learning ability of machine learning classifiers. On the other hand, it also indicates that China has invested more researchers in this field, and has accordingly developed rapidly.
Figure 7 shows the collaborative relationships between the considered institutions. When two institutions appear in the same publication, it means that they have collaborated once. Thicker lines denote closer institutional collaboration, and larger indicate show that the institution has produced more works in the topic. As can be seen from the figure, National Taiwan University, University of Illinois at Urbana Champaign, Purdue University, Deakin University, Curtin University, and Dalian University of Technology were the most influential institutions in terms of publications related to text mining, but there was almost no connection between these influential institutions, indicating that the cooperative relationships between the major institutions are very weak. Therefore, strengthening cooperation and exchanges between institutions may promote the publication of related articles and diversified development of TM in the construction industry.

3.2.3. International Collaboration Analysis

Cross-country authorship analysis is an important way to reflect the degree of communication between countries in a given field between related countries. The country co-authorship graph is depicted in Figure 8. The thickness of the node connections represents the level of cooperation between countries. The thicker the connection, the more the cooperation between the countries. The research was mainly concentrated in the United States, China, the United Kingdom, Australia, and South Korea, showing that these countries have more research results and strong research strength. In addition, there were cooperations between different countries. The link strength represents the cooperative relationship between nodes; the greater the value of the link strength, the stronger the cooperative relationship. In the figure, the connection strength between China and Australia was the highest, reaching 7. The strength of the connections between the USA and the U.K., as well as between the USA and Australia, was 2. Meanwhile, the link strength between the United Kingdom and China was 1, indicating that the cooperation between these prolific countries was not close. This may be related to the development level of the construction industry in different countries and/or the differences in construction technology between countries.

3.3. Co-Citation Analysis

When two items (e.g., references and journals) are cited in a publication reference list, they have a co-citation relationship. Therefore, reference co-citation analysis and journal co-citation analysis were conducted.

3.3.1. Reference Co-Citation Analysis

Reference co-citation analysis serves to identify whether a discipline has an inward- or outward-looking approach, links the flow of new ideas, and detects the structure and evolution path of a specific field [13]. When two publications are cited by one publication at the same time, it is regarded as a co-citation. Figure 9 shows the reference co-citation evolving with time. Different colors represent the times when the publications were cited. The size of the node indicates the number of times the document has been cited. The color changes from purple to yellow indicate the time in the past from far to near, respectively. The time span was from 1997 to 2021, in increments of one year. The size of a node represents the number of times the document has been cited, and the presence of an edge between two nodes indicates that the articles were cited at the same time.
It can be seen, from the figure, that in the reference time of the purple area (spanning from 1997 to 2000), the network connection is intricate, which means that these articles have been frequently cited and have similar research topics. The orange connection period was from 2000 to 2008, also presenting a relatively tightly structured network. The analysis of the literature indicated that the main focus has been on the classification and retrieval of information. For example, the study by Caldas C.H. [20] is representative of the research focused on improving the unique methods for information organization and access to construction project document management. On the basis of information retrieval science, Rezgui, Y. [21] has employed user profiling and document summarization techniques. For professionals in the construction sector, the conceptualization of domains through ontologies offers a practical answer for information and knowledge management. The yellow connection period indicates the duration from 2009 to 2020, and the more frequently cited authors were Zhang, F., Zou, Y., Williams, TP., and Baeza-Yates R., each cited five times. In 2014, Williams TP. published “Predicting construction cost overruns using text mining, numerical data and ensemble classifiers” in The Journal of Automation in Construction. He [22] has shown how to combine text describing construction projects with digital data and use data mining classification algorithms to determine the level of cost overruns. Prediction may be an important turning point regarding the application of text mining in the construction industry. Secondly, Zhang, F. [23] has applied text mining and natural language processing techniques to analyze construction accident reports, in order to prevent similar accidents. Building on case inference, Zou, Y. [24] has created a database of construction accidents and created a system for automatically finding related instances for risk management of building projects. This demonstrates that the application field of text mining in the construction industry has been gradually expanding.
It should be noted that, in the reference co-citation network graph, the importance of nodes did not reveal a large number of citations, but shows that this article excavated construction industry-related research topics. Betweenness Centrality measures the importance of nodes in the network. The specific calculation method involves calculating the percentage of the shortest paths between a pair of nodes in the network through a given node. In addition, the top ten cited references are also listed in Table 6. The publications mainly focused on information retrieval, building quality compliance inspections for regulations, and the use of natural language processing, text classification, and other methods to analyze accident injury reports to prevent future accidents.

3.3.2. Journal Co-Citation Analysis

This enabled us to reveal the characteristics and subject structure of the cited journals. Figure 10 displays the journal co-citation network. The figure shows the structural distribution of journals with frequencies greater than or equal to three times, including 179 nodes and 2183 connections. The larger the label node, the more times the journal had been co-cited, and the closer the distance between the nodes, the closer the connection between them. Clusters with different topics are distinguished by different colors, divided into eight clusters. For example, the green cluster contains 28 nodes (i.e., 28 journals). The red cluster mainly represents journals related to construction information, visualization, and computers. The green cluster shows journals related to building energy, environment, and sustainability. The purple cluster indicates journals related to construction accidents, safety, and risk, including safety science, risk analysis, and so on. Automation in Construction, Journal of Construction Engineering and Management, and Journal of Computing in Civil Engineering ranked in the top three nodes, and the distance between these nodes was very close. The three journals were the top three on the fault line, with 147, 66, and 64 citations each and connection strengths of 3028, 1636, and 901 overall, respectively.

3.4. Keyword Analysis

The keyword co-occurrence network and density map were used to reflect the research hotspots of TM in the construction industry.
Keywords that appeared two times or more were used for analysis. We used a thesaurus to standardize the different descriptions of keywords. Of the 1241 keywords, 287 items met the threshold. The keyword co-occurrence network is displayed in Figure 11. Nodes with the same color belong to the same cluster. Nine clusters were generated, based on the association strength of keywords, which are detailed in Table 7.
The links reflect the relationships between keywords. Taking the node “text mining” as an example, strong links with “BIM” (5), “natural language processing” (5), “data mining” (4), “safety” (4), “knowledge management” (3), “machine learning” (2), and “web crawling” (2) can be observed. The link between “text mining” and “data mining” indicates their close relationship; this means that, to some extent, the two can be interchanged. However, text mining focuses on textual information (unstructured data), while data mining focuses on structured data. The links between “text mining” and “BIM”, “safety”, and “knowledge management” (KM) indicate that text mining has mainly used been to provide solutions related to these areas. The relationships between “text mining” and “natural language processing”, “machine learning”, and “web crawling” reflect the fact that new computational technologies have been utilized to carry out automatic mining.
Table 8 shows the top 10 keywords with high total link strength and occurrence. The total link strength of a node is the sum of link strengths of the node with respect to all other nodes. The link strength between two nodes refers to the frequency of co-occurrence, which can be used as a quantitative index to depict the relationship between two nodes. These keywords are considered as the core keywords for the application of text mining in the construction industry.

3.5. Keyword Timeline Analysis

Next, we generated a timeline to the keywords, as shown in Figure 12. There were almost no related applications of text mining before 2000. From 2000 to 2006, it was focused on information retrieval, information acquisition, and so on, while, at the same time, promoting the development of knowledge management in the construction industry, thereby laying a certain foundation for follow-up research. From 2007 to 2012, with the emergence of ontology and semantics, mining tools ranged from simple information retrieval to the use of natural language to process complex construction texts, which paved the way for broader text mining. From 2013 to 2020, new breakthroughs were made in text mining, mainly in the following three aspects: the support of big data (e.g., machine learning, deep learning, and artificial intelligence); the use of information theory, building information modelling, classification algorithm, information extraction, and other tools to improve the information acquisition and retrieval functions; and the continuous development of research in the field of construction safety risk management. These factors indicate that the use of text mining to process large amounts of construction safety risk information to prevent accidents and enhance safety is a topic that people in the construction industry should focus their attention on.

4. Text-Based Analysis Methods

Text analysis techniques typically involve the following phases when used in construction-related text mining research: corpus acquisition, pre-processing, text representation, and model training. Here, the corpus refers to paper or electronic text (e.g., Word, PDF, Excel) and image files, as described in Table 8. The prior work is summarized in model training, where an algorithm is taught to extract the necessary information from the corpus data.

4.1. Text Pre-Processing

Text pre-processing, including text cleaning, error correction, formatting, word separation, lexical annotation, and de-activation filtering, is the preliminary work performed on the original text in order to adapt it to a machine-readable form. For text separation and lexical annotation, more advanced open-source NLP tools are currently available; see Table 9. The ICTCLAS Chinese word separation system has been employed in the field of construction engineering, in order to divide words and lexical annotation for construction quality acceptance requirements and documents in the area of coal mine safety [25]. Urban rail transit construction safety risk management has been conducted by utilizing the LTP method [26]. Xue and Zhang [27] have pointed out that generic lexicons are limited and the performance of open-source pre-processing tools may be degraded when dealing with domain-specific documents. Therefore, future studies will also require the manual building of dictionaries and ontologies that are relevant to the construction domain [28].

4.2. Text Representation

Text representation (i.e., text feature generation) enables the digitization of text with the help of data structures, such as vectors or matrices, characterized as machine-readable. Table 10 provides a brief overview of current feature generation methods based on modern developments. Traditional NLP techniques extract features from text data by analyzing the syntactic structure. A literature search revealed that the vector space model (VSM) is relatively simple and dominates feature generation methods. While TF has been historically popular as a metric for identifying key features, TF-IDF, first proposed by Jones [29], has become the main method for determining feature weights in documents. With the development of computer technology, deep learning algorithms based on neural networks began to appear, including Word2Vec, ELMO, and BERT.

4.3. Model Training

The last phase in text mining is model training, which uses the previously created features to carry out various tasks such as document classification, incident analysis, and compliance evaluation. Several algorithmic models that arose often in the literature analysis are listed in Table 11. The majority of earlier studies employed conventional machine learning techniques, such as SVM, KNN, and CRF, with SVM models outperforming the others in terms of performance. Convolutional neural networks (CNN), recurrent neural networks (RNN), bidirectional long and short-term memory (Bi-LSTM), and other neural network architectures have received significant attention in recent years. The BERT model was put forth by Google in 2018, and since then, self-attention mechanisms have been used in the construction field. In the years to come, the number of publications on self-attention mechanism-based approaches for construction text analysis is anticipated to rise [30].

5. Current Theme and Topic Analysis

The results of the keyword analysis provided in Section 3.4 were slightly modified, based on a thorough reading of the 185 articles. The selected articles were grouped into four application directions: Document Management (DM), Automated Compliance Checking (ACC), Security Management (SM), and Risk Management (RM). In order to better structure the analysis of the selected articles, these four main categories were further sub-divided according to text mining tasks, as detailed in Table 12.

5.1. Document Management

The main research objectives of DM can be divided into the following three areas: knowledge extraction, knowledge retrieval, and document classification/clustering.
In document knowledge extraction, Al Qady and Kandil [39] have used NLP techniques to parse contract documents into noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). By identifying subject and object triads <subj, VP, obj>, they extracted contextually relevant semantic knowledge to improving functions such as document classification and retrieval, with an F-measure score of 90%. Ren, R. et al. [81] have proposed a semantic rule-based information extraction method to automatically extract construction execution steps from construction procedure documents, reducing the workload of manually collecting information from construction procedure documents while achieving an accuracy of 97.08% and a recall of 93.23%.
Due to the availability of data and algorithms, document retrieval was the main topic of research, particularly in the beginning. NLP was progressively used for numerous applications up until 2015. This is often accomplished for document knowledge retrieval, by comparing the similarity of two representation vectors. Using TF-IDF and cosine similarity, Li and Ramani [82] have created an ontology-based design document query system that outperformed keyword-based search methods. By utilizing a Bayesian classifier to retrieve feature documents through similarity matching, Yu and Hsu [42] have developed a technique to reduce the dimensionality of VSM, enabling automatic and quick retrieval of CAD documents from 2094 Chinese annotated CAD drawings gathered from two actual building projects.
In document classification/clustering, the general process of document classification is pre-processing, text representation, and classification modelling. In the early literature, Caldas and Soibelman [33] have implemented an automated document classification system that can automatically classify construction project documents according to project components, with an average classification accuracy of 92.05% for the three levels. The recent research of Hassan and Le [83] has classified contract language into requirement and non-requirement material using Word2Vec and SVM, in order to shorten reading times and enhance comprehension of the contract scope. As a supplement to analytical tools such as CiteSpace, some researchers [84] have recently used LDA topic modelling to automatically assign one or more topics to documents, in order to achieve document tagging, which is used to analyze historical documents and extract clustered subject terms.

5.2. Automated Compliance-Checking

Automated compliance-checking using NLP techniques is another hot topic in text mining applications. Automated compliance-checking requires understanding and extracting constraints from various building regulation documents, followed by converting them into a formal format that allows for checking/reasoning. Two authors—Zhang, J. and El-Gohary—have made significant contributions to this field. In 2015, Zhang, J. and El-Gohary [53] proposed the extraction of rules based on pattern matching and conflict resolution rules. The same year, they [51] proposed a bottom-up conversion method based on semantic mapping and conflict resolution rules, in order to extract constraints and convert them into first-order logic using Prolog syntax. Building on previous research, Zhang and El-Gohary [85] have extracted regulatory concepts and industry base category (IFC) concepts from compliance documents. They then identified the relationships between each pair of regulatory and IFC concepts to create extended IFC schemas. At present, NLP-based compliance checks are mainly used to assess architectural designs [54] and work process dependencies [86].

5.3. Safety Management

The scope of safety management includes scheduling, cost, construction process, and so on. Rupasinghe et al. [76] have used support vector machines (SVM), linear regression (LR), k-nearest neighbors (KNN), decision trees (DT), plain Bayesian (NB), and integrated models to analyze construction accident reports and classify the causes of accidents. Tixier et al. [63] have developed a manual rule-based NLP program to automatically extract attributes and results from injury reports with an F1 score of 96%. Chi et al. have combined TF-IDF, principal component analysis (PCA), and SVM to classify accident categories from documents.

5.4. Risk Management

Risk management is broadly defined as the measurement, assessment, and development of contingency strategies for all aspects of the construction production process. Current research in the field of text mining related to risk management focuses on risk factor identification and analysis, as well as risk prediction. Siu et al. [73] have applied NLP software to identify 16 new risk categories for engineering contracts from unstructured text descriptions of NEC projects in Hong Kong, and used decision trees to analyze risk ratings. Kim and Kim [74] have identified factors related to building fire accidents from news articles, and then analyzed the main factors causing fire accidents in different seasons by using principal component analysis (PCA). Li et al. [87] have developed four main safety accident data sets, where the documents were represented by doc2vec vectors. As new incident reports emerged, the most similar data sets were selected based on doc2vec similarity, in order to share key factors that predict injury levels. The data sets were then trained to recommend deep learning models, based on their meta-features (e.g., proportion of category factors), in order to maximize prediction performance. Xu, N et al. [88] have proposed an information entropy-weighted term frequency (TF-H) for term importance assessment regarding the case of a Chinese metro construction project, extracting 37 safety risk factors from 221 metro construction accident reports.

6. Research Gaps and Future Studies

One may monitor the future development of a specialty field by tracking research frontiers that have been built on in the early stages of the field. This can also be achieved by monitoring emerging trends and patterns, in terms of the major dimensions in the latent semantic space spanned by each year’s publications connected to this particular cluster. On the basis of recent research, we have identified a number of shortcomings in the current research field, such as the lack of close institutional collaboration and the low quality of the sample. In response to these existing problems, we also provided future research directions for text-based research in the field of construction industry in this section.

6.1. Domain Customization Text Mining Technology

Although NLP-based text mining applications have yielded good results in the general-purpose domain, the results are not the same as those specific to the professional domain.
Corpora in the construction domain are primarily composed of accident reports or standard specifications, whereas the corpora in the medical domain are composed primarily of diagnostic reports and related books. Both of these feature document structures are very different from those for the general-purpose domain. Other sectors, including the medical and legal industries, have worked to develop specialized NLP tools and approaches in an effort to bypass these kinds of domain-specific constraints [89,90]. With the growing trend of electronic information technology and the help of domain experts, text mining can be applied to a wider range of fields.

6.2. Information Openness

The availability and growth of data sources are crucial for the development of text-based mining studies. In addition to existing electronic books and documents, the internet is one of the most significant sources of information, offering a wealth of knowledge. The majority of researchers opt to use Python-based approaches to obtain data from the internet; however, these methods may lead to unused data networks and computer language notation, which can lessen the usefulness of the training model without processing. In addition, both technical and non-technical considerations are required to prevent the unscrupulous collection of web data, which can lead to copyright or web server issues.
Safety- and regulation-related documents are the key data sources from the body of literature on text-based mining research. Regulatory documents issued by official bodies and accident reports published by public organizations are more authoritative and prescriptive, despite the fact that there are numerous stories and evaluations of construction safety accidents available through web-based media. Open data publication and information sharing are currently popular trends across the globe. Some countries have developed their own open data platforms, such as Data.gov in the USA (https://www.data.gov, accessed on 12 December 2022), Find Open Data in the U.K. (https://data.gov.uk, accessed on 12 December 2022), and the European Data Portal (http://www.europeandataportal.eu, accessed on 12 December 2022). Such open public data are expected to be of benefit to the construction industry.

6.3. Cutting-Edge NLP Technology

The NLP techniques used for text mining have progressed from rule- and dictionary-based approaches to traditional machine learning-based approaches (e.g., HMM, ME, and SVM) to modern deep learning-based approaches (e.g., CNN and Bi-LSTM). The most advanced language model in recent years is the BERT model, proposed by Google in 2018, which is a representative language model. No single algorithm can handle all activities with great performance, due to the complexity of data in the construction domain; however, researchers are currently working on more advanced computer models to make the construction industry more intelligent and digital. A combination of multiple algorithms (e.g., BiLSTM–CNN–CRF, BERT–BiLSTM–CRF) is considered more suitable to solve the problem at hand.

6.4. Building Engineering Expert System

Due to the intricate nature of the construction process and the distinctiveness of the project environment, it is challenging to completely integrate data-driven text mining research in the construction industry. The strength of construction industry experts lies not in creating cutting-edge algorithms—despite the fact that it was mentioned above that novel NLP techniques should be used more in the construction field—but rather, it lies in enhancing existing NLP and TM techniques through construction industry-specific domain knowledge, in order to broaden the application domain for TM. Academics have previously concentrated on creating useful text analysis tools or figuring out risk elements to examine in safety incident analysis. This can be advanced in the context of building automation by fusing already known knowledge with recently obtained textual information. For example, by building a knowledge classification system for coal mine construction safety management based on familiarity with pertinent mine standards and specifications, this knowledge can be used to create intelligent knowledge question-and-answer systems [91], or to create design solutions or emergency measures by combining other information (e.g., project context and safety-related knowledge). Automated expert system development will be aided by these applications.

6.5. Building a High-Quality Database

Text mining in the construction field requires a large volume of data for model training; furthermore, the higher the quality of the data format, the better the effect of model training. There are many kinds of existing engineering documents, such as accident reports, contract texts, construction documents, and so on. The document format cannot be fully unified, especially various tables, which are complicated to handle. Incorrect annotation can seriously mislead the machine learning process, and the acquisition and processing of data upfront is a major difficulty. For image and surveillance data, the current databases are limited to specific tasks, such as construction machinery inspection [92]. A deeper understanding of integrated data, based on text, images, and sensors, can bring potential advances to the construction industry, perhaps creating new knowledge and business opportunities. In addition, semi-supervised or unsupervised learning approaches, which require smaller training samples, provide a promising means to accommodate the above challenges.

6.6. Strengthening Partnerships

In text mining research in construction, several authors, including Zhang, J., Kandil, A., Jiang, S., and so on, had a higher volume of publications than others did; however, it is interesting to note that not all of the most-cited literature was authored by authors with a higher volume of publications. Some authors, such as Tixier, A J P. and Caldas, C H., had a low number of relevant publications; however, the article “Automated content analysis for construction safety: a natural language processing system to extract precursors and outcomes from unstructured injury reports” by Tixier, A J P. had a citation count of 142. Obviously, not all of the most prolific authors have established their cooperative networks. Furthermore, the findings indicated that the most important research and developments in the field of text mining take place mainly in the U.S., China, and the U.K., but the collaborative relationships between these countries are not as close, and the partnerships between the most influential institutions are very weak. This suggests that research on text mining in the field of construction is still fragmented, with insufficient partnerships between different institutions, countries, and authors. The next step is to pay attention to strengthening the exchange of research results, as well as jointly exploring richer theoretical research and more diverse application directions in the field.

6.7. The Breadth of Application Needs to Be Improved

The bibliographic analysis in Section 3 revealed that the number of articles published related to text mining has been increasing annually. Over the past two decades, a number of high-output and focused journals were identified, such as Automation in Construction, Journal of Computing in Civil Engineering, Advanced Engineering Informatics, and Journal of Construction Engineering and Management, indicating that these journals have had a significant impact on research in text mining. However, of the more than 50 journals involved in the literature data set, 43 journals had published only one article, and the research content of some of the small-volume journals was inadequate and did not address industry-related issues prominently enough, indicating that the breadth and depth of text mining in specific applications still needs to be improved.

6.8. Information Summary Generation

At present, the application of text mining in the construction field is mostly focused on document knowledge extraction and retrieval, accident cause analysis, compliance review, etc. The object text is also mostly engineering documents, and less attention is paid to engineering news information on the internet, such as accident news reports. News reports on engineering accidents are generally continuous, and different news portals have different focuses on the description of the event, so users often need to spend time browsing multiple portals to get all the information about the news event. The introduction of text mining technology can automatically crawl the news information of the portal; the event data are then extracted and processed with pre-trained models such as BERT [93] and eventually pushed to the user.

7. Conclusions

We presented a systematic analysis of TM applications in the construction industry. Our intention was twofold: first, we wished to demonstrate the potential depth of a systematic review by applying synthetic approaches. In addition to the application of literature analysis software, including CiteSpace and VOSviewer, we enriched the procedure for producing a systematic review regarding a knowledge domain by incorporating evolutionary models for a scientific specialty. The enhanced systematic procedure introduced in this article is applicable to the analysis of other domains of interest. Researchers can utilize these visual analytic tools to perform timely surveys of the literature as frequently as they wish, in order to retrieve relevant publications more effectively.
Our second goal was to reveal the research status and research frontiers of TM applied in the construction industry. Our timely survey demonstrated the highest-impact articles and the evolution of the relevant literature over time. The published articles were found to have mainly focused on the use of ontology, NLP technology, and the use of sentences and semantics to extract relevant construction industry information, as well as the use of text mining technology to classify project documents.
If the authors of the two major research centers strengthen their cooperation, rapid development of text mining technology in the construction industry may be facilitated. A wider range of applications of existing techniques will, in turn, widen our horizon and deepen our understanding of the challenges that need to be overcome in order to advance the state-of-the-art of text mining technology.

Limitations

The scope of the data was limited by the retrieval source (i.e., SCOPUS) and the composite query used. More in-depth analyses of each specialty would be more revealing, incorporating additional methods such as context analysis and studies of other aspects of scholarly publications. Patents and research grants are other types of data sources that may be considered, but for this particular review, we limited our investigation to the scientific literature indexed by SCOPUS.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/Nina-cumt/Paper-list-of-TM/tree/master.

Author Contributions

Conceptualization, N.X. and X.Z.; Data curation, N.X. and X.Z.; Formal analysis, N.X. and C.G.; Investigation, N.X. and C.G.; Methodology, X.Z. and C.G.; Project administration, N.X.; Software, N.X. and X.Z.; Supervision, N.X. and Y.H.; Validation, B.X. and F.W.; Visualization, C.G. and Y.H.; Writing—original draft, N.X. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Natural Science Foundation of China (grant number 71901206) and the social science fund of Jiangsu Province (22GLB023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to acknowledge funding from the National Natural Science Foundation of China and the social science fund of Jiangsu Province. Additionally, the authors would also like to acknowledge Bai Xiao and Fei Wei from Tianjin Jingang Construction Co., Ltd.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cheng, C.W.; Leu, S.S.; Cheng, Y.M.; Wu, T.C.; Lin, C.C. Applying data mining techniques to explore factors contributing to occupational injuries in Taiwan’s construction industry. Acid. Anal. Prev. 2012, 48, 214–222. [Google Scholar] [CrossRef] [PubMed]
  2. Miner, G.D.; Elder, J.; Nisbet, R.A. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications; Academic Press: Cambridge, MA, USA, 2012. [Google Scholar] [CrossRef]
  3. Cohen, A.M.; Hersh, W.R. A survey of current work in biomedical text mining. Brief. Bioinform. 2005, 6, 57–71. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Van Driel, M.A.; Bruggeman, J.; Vriend, G.; Brunner, H.G.; Leunissen, J.A. A text-mining analysis of the human phenome. Eur. J. Hum. Genet. 2006, 14, 535–542. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Ghose, A.; Ipeirotis, P.G. Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics. IEEE T Knowl. Data Eng. 2011, 23, 1498–1512. [Google Scholar] [CrossRef] [Green Version]
  6. Qazi, A.; Quigley, J.; Dickson, A.; Kirytopoulos, K. Project Complexity and Risk Management (ProCRiM): Towards modelling project complexity driven risk paths in construction projects. Int. J. Proj. Manag. 2016, 34, 1183–1198. [Google Scholar] [CrossRef] [Green Version]
  7. Soliman, E. Risk Identification for Building Maintenance Projects. Int. J. Constr. Manag. 2018, 10, 37–54. [Google Scholar]
  8. Tembo-Silungwe, C.K.; Khatleli, N. Identification of Enablers and Constraints of Risk Allocation Using Structuration Theory in the Construction Industry. J. Constr. Eng. M 2018, 144, 116722000. [Google Scholar] [CrossRef]
  9. Ghosh, S.; Roy, S.; Bandyopadhyay, S.K. A tutorial review on Text Mining Algorithms. Int. J. Adv. Res. Comput. Commun. Eng. 2012, 1, 16207659. [Google Scholar]
  10. Ur-Rahman, N.; Harding, J.A. Textual data mining for industrial knowledge management and text classification: A business oriented approach. Expert Syst. Appl. 2012, 39, 4729–4739. [Google Scholar] [CrossRef] [Green Version]
  11. Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutierrez, J.B.; Kochut, K. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. arXiv 2017, arXiv:1707.02919. [Google Scholar]
  12. Li, X.; Wu, P.; Shen, G.; Wang, X.Y.; Teng, Y. Mapping the knowledge domains of Building Information Modeling (BIM): A bibliometric approach. Autom. Constr. 2017, 84, 195–206. [Google Scholar] [CrossRef]
  13. Liao, H.C.; Tang, M.; Luo, L.; Li, C.Y.; Chiclana, F.; Zeng, X.J. A Bibliometric Analysis and Visualization of Medical Big Data Research. Sustainability 2018, 10, 166. [Google Scholar] [CrossRef]
  14. Lee, D.; Kim, H.; Sim, J.; Lee, D.; Cho, H.; Hong, D. Trends in 3D Printing Technology for Construction Automation Using Text Mining. Int. J. Precis. Eng. Manuf. 2019, 20, 871–882. [Google Scholar] [CrossRef]
  15. Cheng, M.M.; Edwards, D.; Darcy, S.; Redfern, K. A Tri-Method Approach to a Review of Adventure Tourism Literature: Bibliometric Analysis, Content Analysis, and a Quantitative Systematic Literature Review. J. Hosp. Tour. Res. 2018, 42, 997–1020. [Google Scholar] [CrossRef] [Green Version]
  16. Sathya, S.; Rajendran, N. A Review on Text Mining Techniques. Int. J. Comput. Sci. Eng. 2015, 3, 274–284. [Google Scholar]
  17. Czerniawski, T.; Nahangi, M.; Walbridge, S.; Haas, C. Automated Removal of Planar Clutter from 3D Point Clouds for Improving Industrial Object Recognition. Available online: https://www.iaarc.org/publications/fulltext/ISARC2016-Paper067.pdf (accessed on 12 December 2022).
  18. Yu, J.; Gan, Z.; Zhong, L.; Deng, L. Research and Practice of UAV Remore Sensing in the Monitoring and Management of Construction Projects in Riparing Areas. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, XLII-3, 2161–2165. [Google Scholar] [CrossRef] [Green Version]
  19. Fenais, A.; Ariaratnam, S.T.; Ayer, S.K.; Smilovsky, N. Integrating Geographic Information Systems and Augmented Reality for Mapping Underground Utilities. Infrastruct.-Base 2019, 4, 60. [Google Scholar] [CrossRef] [Green Version]
  20. Caldas, C.H.; Soibelman, L.; Han, J. Automated classification of construction project documents. J. Comput. Civ. Eng. 2002, 16, 234–243. [Google Scholar] [CrossRef]
  21. Rezgui, Y. Text-based domain ontology building using Tf-Idf and metric clusters techniques. Knowl. Eng. Rev. 2007, 22, 379–403. [Google Scholar] [CrossRef]
  22. Williams, T.P.; Gong, J. Predicting construction cost overruns using text mining, numerical data and ensemble classifiers. Autom. Constr. 2014, 43, 23–29. [Google Scholar] [CrossRef]
  23. Zhang, F. A hybrid structured deep neural network with Word2Vec for construction accident causes classification. Int. J. Constr. Manag. 2019, 22, 1120–1140. [Google Scholar] [CrossRef] [Green Version]
  24. Zou, Y.; Kiviniemi, A.; Jones, S.W. Retrieving similar cases for construction project risk management using Natural Language Processing techniques. Autom. Constr. 2017, 80, 66–76. [Google Scholar] [CrossRef]
  25. Hu, H.M. Construction Quality Acceptance Knowledge Modeling and Extraction; Huazhong University of Science and Technology: Wuhan, China, 2014. [Google Scholar]
  26. Wang, Y. Event Ontology in Coal Mining Safety Field and Its Application in Query Expansion; Beijing University of Technology: Beijing, China, 2015. [Google Scholar]
  27. Xue, X.R.; Zhang, J.S. Building Codes Part-of-Speech Tagging Performance Improvement by Error-Driven Transformational Rules. J. Comput. Civ. Eng. 2020, 34, 2723. [Google Scholar] [CrossRef]
  28. Zhou, P.; El-Gohary, N. Ontology-Based Multilabel Text Classification of Construction Regulatory Documents. J. Comput. Civ. Eng. 2016, 30, 530. [Google Scholar] [CrossRef]
  29. Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
  30. Fang, W.; Luo, H.; Xu, S.; Love, P.E.D.; Lu, Z.; Ye, C. Automated text classification of near-misses from safety reports: An improved deep learning approach. Adv. Eng. Inform. 2020, 44, 101060. [Google Scholar] [CrossRef]
  31. Hammad, M.M. Managing project documents using virtual Web centers. In Proceedings of the Canadian Society for Civil Engineering-30th Annual Conference: 2002 Chellenges Ahead, Montreal, QC, Canada, 1 January 2002; pp. 691–699. [Google Scholar]
  32. Caldas, C.H.; Soibelman, L.; Songer, A.D.; Miles, J.C. Implementing automated methods for document classification in construction management information systems. In Proceedings of the International Workshop on Information Technology in Civil Engineering: Computing in Civil Engineering, Washington, DC, USA, 1 January 2002; pp. 194–210. [Google Scholar]
  33. Caldas, C.H.; Soibelman, L. Automating hierarchical document classification for construction management information systems. Autom. Constr. 2003, 12, 395–406. [Google Scholar] [CrossRef]
  34. Demian, P.; Fruchter, R. Measuring relevance in support of design reuse from archives of building product models. J. Comput. Civ. Eng. 2005, 19, 119–136. [Google Scholar] [CrossRef]
  35. Lee, T.S.; Lee, D.W.; Jee, S.B.; Tommelein, I.D. Development of Knowledge Document Management System (KDMS) for sharing construction technical documents. In Proceedings of the Construction Research Congress 2005: Broadening Perspectives-Proceedings of the Congress, San Diego, CA, USA, 1 January 2005; pp. 1183–1191. [Google Scholar]
  36. Rezgui, Y. Ontology-centered knowledge management using information retrieval techniques. J. Comput. Civ. Eng. 2006, 20, 261–270. [Google Scholar] [CrossRef]
  37. Tserng, H.P.; Chang, C.H. Developing a project knowledge management framework for tunnel construction: Lessons learned in Taiwan. Can. J. Civ. Eng. 2008, 35, 333–348. [Google Scholar] [CrossRef]
  38. Nefti, S.; Oussalah, M.; Rezgui, Y. A modified fuzzy clustering for documents retrieval: Application to document categorization. J. Oper. Res. Soc. 2009, 60, 384–394. [Google Scholar] [CrossRef]
  39. Al Qady, M.; Kandil, A. Concept relation extraction from construction documents using natural language processing. J. Constr. Eng. M 2010, 136, 294–302. [Google Scholar] [CrossRef]
  40. Al Qady, M.; Kandil, A. Document discourse for managing construction project documents. J. Comput. Civ. Eng. 2013, 27, 466–475. [Google Scholar] [CrossRef]
  41. Jiang, S.; Zhang, H.; Dalian, J.Z. Research on BIM-based construction domain text information management. J. Netw. 2013, 8, 1455–1464. [Google Scholar] [CrossRef]
  42. Yu, W.; Hsu, J. Content-based text mining technique for retrieval of CAD documents. Autom. Constr. 2013, 31, 65–74. [Google Scholar] [CrossRef]
  43. Williams, T.P.; Gong, J. Construction project cost prediction using text and data mining. In Proceedings of the 14th International Conference on Civil, Structural and Environmental Engineering Computing, CC, Sardinia, Italy, 3–6 September 2013; p. 102. [Google Scholar]
  44. Chi, N.W.; Lin, K.Y.; Hsieh, S.H. On effective text classification for supporting job hazard analysis. In Proceedings of the 2013 ASCE International Workshop on Computing in Civil Engineering, IWCCE 2013, Los Angeles, CA, USA, 1 January 2013; pp. 613–620. [Google Scholar]
  45. Williams, T.P.; Katsanis, C.J.; Bedard, C. Using text mining to predict construction project cost overruns. In Proceedings of the Annual Conference of the Canadian Society for Civil Engineering 2013: Know-How-Savoir-Faire, CSCE 2013, Moncton, NB Canada, 1 January 2013; pp. 1255–1262. [Google Scholar]
  46. Al Qady, M.; Kandil, A. Automatic classification of project documents on the basis of text content. J. Comput. Civ. Eng. 2015, 29, 63. [Google Scholar] [CrossRef]
  47. Chi, N.W.; Lin, K.Y.; El-Gohary, N.; Hsieh, S.H. Evaluating the strength of text classification categories for supporting construction field inspection. Autom. Constr. 2016, 64, 78–88. [Google Scholar] [CrossRef]
  48. Hou, X.L.; Zeng, Y.; Cheng, C.B.; Zhang, H. Application of text mining in preprocessing of illness representation information of construction project. In Proceedings of the 5th International Symposium on Project Management, ISPM 2017, Wuhan, China, 1 January 2017; Aussino Academic Publishing House: Sydney, Australia, 2017; pp. 991–997. [Google Scholar]
  49. Moon, S.; Shin, Y.; Hwang, B.G.; Chi, S. Document Management System Using Text Mining for Information Acquisition of International Construction. KSCE J. Civ. Eng. 2018, 22, 4791–4798. [Google Scholar] [CrossRef]
  50. Hassan, F.U.; Le, T. Computer-assisted separation of design-build contract requirements to support subcontract drafting. Autom. Constr. 2021, 122, 103479. [Google Scholar] [CrossRef]
  51. Zhang, J.; El-Gohary, N.M. Automated information transformation for automated regulatory compliance checking in construction. J. Comput. Civ. Eng. 2015, 29, B4015001. [Google Scholar] [CrossRef] [Green Version]
  52. Zhou, P.; El-Gohary, N. Domain-specific hierarchical text classification for supporting automated environmental compliance checking. J. Comput. Civ. Eng. 2016, 30, 2. [Google Scholar] [CrossRef]
  53. Zhang, J.; El-Gohary, N.M. Semantic NLP-Based Information Extraction from Construction Regulatory Documents for Automated Compliance Checking. J. Comput. Civ. Eng. 2016, 30, 04015014. [Google Scholar] [CrossRef] [Green Version]
  54. Zhang, J.; El-Gohary, N.M. Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking. Autom. Constr. 2017, 73, 45–57. [Google Scholar] [CrossRef]
  55. Xue, X.; Zhang, J. Part-of-speech tagging of building codes empowered by deep learning and transformational rules. Adv. Eng. Inform. 2021, 47, 1235. [Google Scholar] [CrossRef]
  56. Moon, S.; Lee, G.; Chi, S. Automated system for construction specification review using natural language processing. Adv. Eng. Inform. 2022, 51, 2. [Google Scholar] [CrossRef]
  57. Lipscomb, H.J.; Glazner, J.; Bondy, J.; Lezotte, D.; Guarini, K. Analysis of text from injury reports improves understanding of construction falls. J. Occup. Env. Med. 2004, 46, 1166–1173. [Google Scholar] [CrossRef] [PubMed]
  58. Zhu, Y.; Emre Bayraktar, M.; Chen, S.C. Application of metadata modeling to dispute review report management. J. Civ. Eng. Manag. 2010, 16, 491–498. [Google Scholar] [CrossRef] [Green Version]
  59. Elghamrawy, T.; Boukamp, F. Managing construction information using RFID-based semantic contexts. Autom. Constr. 2010, 19, 1056–1066. [Google Scholar] [CrossRef]
  60. Fan, H.; Li, H. Retrieving similar cases for alternative dispute resolution in construction accidents using text mining techniques. Autom. Constr. 2013, 34, 85–91. [Google Scholar] [CrossRef]
  61. Chi, N.W.; Lin, K.Y.; Hsieh, S.H. Using ontology-based text classification to assist Job Hazard Analysis. Adv. Eng. Inf. 2014, 28, 381–394. [Google Scholar] [CrossRef]
  62. Zhao, D.; McCoy, A.P.; Kleiner, B.M.; Smith-Jackson, T.L. Control measures of electrical hazards: An analysis of construction industry. Saf. Sci. 2015, 77, 143–151. [Google Scholar] [CrossRef]
  63. Tixier, A.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Application of machine learning to construction injury prediction. Autom. Constr. 2016, 69, 102–114. [Google Scholar] [CrossRef] [Green Version]
  64. Goh, Y.M.; Ubeynarayana, C.U. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef] [PubMed]
  65. Mahfouz, T.; Kandil, A.; Davlyatov, S. Identification of latent legal knowledge in differing site condition (DSC) litigations. Autom. Constr. 2018, 94, 104–111. [Google Scholar] [CrossRef]
  66. Zhang, F.; Fleyeh, H.; Wang, X.; Lu, M. Construction site accident analysis using text mining and natural language processing techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
  67. Baker, H.; Hallowell, M.R.; Tixier, A.J.P. Automatically learning construction injury precursors from text. Autom. Constr. 2020, 118, 103145. [Google Scholar] [CrossRef]
  68. Cheng, M.Y.; Kusoemo, D.; Gosno, R.A. Text mining-based construction site accident classification using hybrid supervised machine learning. Autom. Constr. 2020, 118, 103265. [Google Scholar] [CrossRef]
  69. Yu, W.D.; Chang, H.K.; Lai, C.H. A knowledge management-based engineering design system for highway design projects. Int. J. Appl. Sci. Eng. 2021, 18, 1–13. [Google Scholar]
  70. Goldberg, D.M. Characterizing accident narratives with word embeddings: Improving accuracy, richness, and generalizability. J. Saf. Res. 2022, 80, 441–455. [Google Scholar] [CrossRef]
  71. Jiang, S.; Zhang, J.; Zhang, H. Ontology-based semantic retrieval for risk management of construction project. J. Netw. 2013, 8, 1212–1220. [Google Scholar] [CrossRef] [Green Version]
  72. Lee, J.; Yi, J.S. Predicting project’s uncertainty risk in the bidding process by integrating unstructured text data and structured numerical data using text mining. Appl. Sci. 2017, 7, 1141. [Google Scholar] [CrossRef] [Green Version]
  73. Siu, M.F.F.; Leung, W.Y.J.; Chan, W.M.D. A data-driven approach to identify-quantify-analyse construction risk for Hong Kong NEC projects. J. Civ. Eng. Manag. 2018, 24, 592–606. [Google Scholar] [CrossRef]
  74. Kim, J.S.; Kim, B.S. Analysis of Fire-Accident Factors Using Big-Data Analysis Method for Construction Areas. KSCE J. Civ. Eng. 2018, 22, 1535–1543. [Google Scholar] [CrossRef]
  75. Li, J.; Wang, J.; Xu, N.; Hu, Y.; Cui, C. Importance degree research of safety risk management processes of urban rail transit based on text mining method. Information 2018, 9, 26. [Google Scholar] [CrossRef]
  76. Rupasinghe, N.K.A.H.; Panuwatwanich, K. Understanding construction site safety hazards through open data: Text mining approach. ASEAN Eng. J. 2021, 11, 160–178. [Google Scholar] [CrossRef]
  77. Faraji, A.; Rashidi, M.; Perera, S. Text Mining Risk Assessment-Based Model to Conduct Uncertainty Analysis of the General Conditions of Contract in Housing Construction Projects: Case Study of the NSW GC21. J. Arch. Eng. 2021, 27, 04021025. [Google Scholar] [CrossRef]
  78. Choi, S.J.; Choi, S.W.; Kim, J.H.; Lee, E.B. Ai and text-mining applications for analyzing contractor’s risk in invitation to bid (ITB) and contracts for engineering procurement and construction (EPC) projects. Energies 2021, 14, 4632. [Google Scholar] [CrossRef]
  79. Luo, X.; Liu, Q.; Qiu, Z. A Correlation Analysis of Construction Site Fall Accidents Based on Text Mining. Front. Built Environ. 2021, 7. [Google Scholar] [CrossRef]
  80. Chen, S.; Xi, J.; Chen, Y.; Zhao, J. Association Mining of Near Misses in Hydropower Engineering Construction Based on Convolutional Neural Network Text Classification. Comput. Intell. Neurosc. 2022, 2022, 1–16. [Google Scholar] [CrossRef]
  81. Ren, R.; Zhang, J. Semantic Rule-Based Construction Procedural Information Extraction to Guide Jobsite Sensing and Monitoring. J. Comput. Civ. Eng. 2021, 35, 20. [Google Scholar] [CrossRef]
  82. LI, Z.; Ramani, K. Ontology-based design information extraction and retrieval. Artif. Intell. Eng. Des. Anal. Manuf. 2007, 21, 137–154. [Google Scholar] [CrossRef] [Green Version]
  83. Hassan, F.U.; Le, T. Automated Requirements Identification from Construction Contract Documents Using Natural Language Processing. J. Leg. Aff. Disput. Res. 2020, 12, 2. [Google Scholar] [CrossRef]
  84. Bilge, E.Ç.; Yaman, H. Research trends analysis using text mining in construction management: 2000–2020. Eng. Constr. Archit. Manag. 2021, 29, 3210–3233. [Google Scholar] [CrossRef]
  85. Zhang, J.S.; El-Gohary, N.M. Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques. J. Comput. Civ. Eng. 2016, 30, 44. [Google Scholar] [CrossRef]
  86. Zhong, B.; Xing, X.; Luo, H.; Zhou, Q.; Li, H.; Rose, T.; Fang, W. Deep learning-based extraction of construction procedural constraints from construction regulations. Adv. Eng. Inform. 2020, 43, 101003. [Google Scholar] [CrossRef]
  87. Li, X.; Zhu, R.C.; Ye, H.; Jiang, C.X.; Benslimane, A. MetaInjury: Meta-learning framework for reusing the risk knowledge of different construction accidents. Saf. Sci. 2021, 140, 105315. [Google Scholar] [CrossRef]
  88. Xu, N.; Ma, L.; Liu, Q.; Wang, L.; Deng, Y. An improved text mining approach to extract safety risk factors from construction accident reports. Saf. Sci. 2021, 138, 105216. [Google Scholar] [CrossRef]
  89. Luque, C.; Luna, J.M.; Luque, M.; Ventura, S. An advanced review on text mining in medicine. Wires Data Min. Knowl. 2019, 9, 1302. [Google Scholar] [CrossRef]
  90. Xu, X.; Cai, H.B. Ontology and rule-based natural language processing approach for interpreting textual regulations on underground utility infrastructure. Adv. Eng. Inf. 2021, 48, 101288. [Google Scholar] [CrossRef]
  91. Zhao, L.L. Rasearch on Quenstion and Answer Research of Coal Mine Construction Safety Management Based on Knowledge Graph; China University of Mining and Technology: Beijing, China, 2022. [Google Scholar]
  92. Bo, X.; Chung, K.S. Development of an Image Data Set of Construction Machines for Deep Learning Object Detection. J. Comput. Civ. Eng. 2021, 35, 2. [Google Scholar]
  93. Liu, Y. Fine-tune BERT for Extractive Summarization. arXiv 2019, arXiv:1903.10318. [Google Scholar]
Figure 1. Research framework.
Figure 1. Research framework.
Sustainability 14 16846 g001
Figure 2. Process of selecting search words.
Figure 2. Process of selecting search words.
Sustainability 14 16846 g002
Figure 3. Process of data collection.
Figure 3. Process of data collection.
Sustainability 14 16846 g003
Figure 4. Annual trends of related publications.
Figure 4. Annual trends of related publications.
Sustainability 14 16846 g004
Figure 5. Co-author analysis.
Figure 5. Co-author analysis.
Sustainability 14 16846 g005
Figure 6. The top ten influential institutes.
Figure 6. The top ten influential institutes.
Sustainability 14 16846 g006
Figure 7. Institute co-authorship network.
Figure 7. Institute co-authorship network.
Sustainability 14 16846 g007
Figure 8. Country co-authorship network.
Figure 8. Country co-authorship network.
Sustainability 14 16846 g008
Figure 9. Reference co-citation network evolving with time.
Figure 9. Reference co-citation network evolving with time.
Sustainability 14 16846 g009
Figure 10. Journal co-citation network.
Figure 10. Journal co-citation network.
Sustainability 14 16846 g010
Figure 11. Keyword co-occurrence network.
Figure 11. Keyword co-occurrence network.
Sustainability 14 16846 g011
Figure 12. Timeline view of keywords.
Figure 12. Timeline view of keywords.
Sustainability 14 16846 g012
Table 1. Supplementary search words.
Table 1. Supplementary search words.
CategoryInitial Search WordsNewly Added Words
(Second Round)
Newly Added Words (Third Round)
Synonyms of text miningtext mining, text data mining, text analytics, knowledge-discovery in texttext-mining, text processing, text analysis
Tasks of text miningtext categorization, text clustering, information extraction, information retrieval, sentiment analysisdocument classification, document clustering,
information extracting, information retrieving
Object constraintstext, document(s)
Domain constraintsconstruction industry, construction project construction industries, construction projects, building industry, construction sectors, construction sitesconstruction engineering, construction management
Table 2. Keywords for literature search.
Table 2. Keywords for literature search.
BindingScope of SearchKeywords
#1synonyms of text mining“text mining” OR “text-mining” OR “text data mining” OR “text analy*” OR “knowledge-discovery in text” OR “text categorization” OR “text clustering” OR “document classification” OR “document clustering” OR “information extracti*” OR “information retriev*” OR “sentiment analysis” OR “text processing”
#2tasks of text mining and object constraints“text” OR “document*”
#3domain constraints“construction industr*” OR “construction project*” OR “building industry” OR “construction sectors” OR “construction sites” OR “construction engineering” OR “construction management”
#4type of literatureLIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO
(DOCTYPE, “cp”)
Table 3. Top four most relevant journals.
Table 3. Top four most relevant journals.
Published JournalsNumber of Articles
Automation in Construction22
Journal of Computing in Civil Engineering14
Journal of Construction Engineering and Management9
Advanced Engineering Informatics9
Total54
Table 4. The top 10 highly cited publications.
Table 4. The top 10 highly cited publications.
CitationsTitleAuthorsPublication
Year
Journal
151Semantic NLP-Based Information Extraction from Construction Regulatory Documents for Automated Compliance CheckingZhang, J.2016Journal of Computing in Civil Engineering
142Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reportsTixier, A.J.P.2016Automation in Construction
142Automating hierarchical document classification for construction management information systemsCaldas, C.H.2003Automation in Construction
125Automated classification of construction project documentsCaldas, C.H.2002Journal of Computing in Civil Engineering
124Construction site accident analysis using text mining and natural language processing techniquesZhang, F.2019Automation in Construction
96Construction accident narrative classification: An evaluation of text mining techniquesGoh, Y.M.2017Accident Analysis and Prevention
96Textual data mining for industrial knowledge management and text classification: A business oriented approachUr-Rahman, N.2012Expert Systems with Applications
92Predicting construction cost overruns using text mining, numerical data and ensemble classifiersWilliams, T.P.2014Automation in Construction
91Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checkingZhang, J.2017Automation in Construction
74Automated information transformation for automated regulatory compliance checking in constructionZhang, J.2015Journal of Computing in Civil Engineering
Table 5. Most productive authors.
Table 5. Most productive authors.
Author’s NameAuthor’s InstituteAuthor’s
Country
Number of
Publications
Zhang, J.University of Illinois at Urbana-ChampaignUnited States7
Kandil, A.Purdue UniversityUnited States6
Rezgui, Y.Cardiff UniversityUnited Kingdom5
Jiang, S.Dalian University of TechnologyChina5
Al Qady, M.Purdue UniversityUnited States4
Lin, K.Y.National Taiwan UniversityChina4
El-Gohary, N.M.University of Illinois at Urbana-ChampaignUnited States4
Hsieh, S.H.National Taiwan UniversityChina4
Soibelman, L.University of Illinois at Urbana-ChampaignUnited States4
Table 6. The top ten highly co-cited references.
Table 6. The top ten highly co-cited references.
FrequencyYearAuthorTitle of PublicationJournal
52019Zhang, F.Construction site accident analysis using
text mining and natural language
processing techniques
Automation in
Construction
52017Zou, Y.Retrieving similar cases for construction
project risk management using natural
language processing techniques
Automation in
Construction
52014Williams,
T.P.
Predicting construction cost overruns
using text mining, numerical data and
ensemble classifiers
Automation in
Construction
51999Baeaza
Yates, R.
Modern Information RetrievalUSA: Addison Wesley
41999Eastman,
CM.
Automated code compliance checking
for building envelope design
Journal of Computing
in Civil Engineering
32013Fan, H.Retrieving similar cases for alternative dispute resolution in construction accidents using text mining techniquesAutomation in
Construction
32017Goh, Y.M.Construction accident narrative
classification: an evaluation of text
mining techniques
Accident Analysis
and Prevention
32012Caldas, C.H.Automated classification of construction project documentsJournal of Computing
in Civil Engineering
32012Zhong,
B.T.
Ontology-based semantic modeling of
regulation constraint for automated
construction quality compliance checking
Automation in
Construction
32014Chi, N.W.Using ontology-based text classification
to assist Job Hazard Analysis
Advanced
Engineering
Informatics
Table 7. Clustering of frequent keywords.
Table 7. Clustering of frequent keywords.
Cluster No.Description of ClusterFrequent Keywords
1NLP-basedautomated compliance checking, BIM, information extraction, machine learning, natural language processing
2Document-basedcontract, databases, documentation, information
management, sustainability
3Big data
processing
data mining, deep learning, optical character recognition, text
classification, web crawling
4Unstructured data
processing
social network analysis, text mining, twitter, unstructured data
5AI-basedartificial intelligence, information retrieval, knowledge-based
systems
6Ontology and text
representation
heterogeneous data, ontology, text representation
7Knowledge
discovery
k-nearest neighbor, knowledge management, knowledge map
8Risk managementpre-bid clarification, risk management
9Accident
prevention
accidents, safety
Table 8. The top 10 keywords from the perspective of occurrence and total link strength.
Table 8. The top 10 keywords from the perspective of occurrence and total link strength.
RankKeywordsTotal Link StrengthOccurrence
1natural language processing3414
2text mining3428
3text classification3012
4knowledge management3012
5semantics238
6information retrieval2311
7safety228
8ontology218
9information management1613
10BIM129
Table 9. NLP open-source tools and implemented functions.
Table 9. NLP open-source tools and implemented functions.
CountryTool NameFunction
ForeignApache OpenNLPA Java-based machine learning work package that also supports ME (maximum entropy) machine learning
NLTKLexical tagging, chunking, sentiment analysis, classification, semantic reasoning, and so on
DomesticFudanNLPChinese word separation, lexical tagging, named entity recognition, text classification, news clustering, and other information retrieval functions
PkusegChinese word separation (high accuracy)
LTPChinese word segmentation, lexical tagging, named entity recognition, dependent syntactic analysis, semantic role annotation, and so on
ICTCLASChinese word separation, lexical tagging, text clustering, sentiment analysis, summary entities, code-switching, and so on
Table 10. Feature generation methods and introduction.
Table 10. Feature generation methods and introduction.
MethodDescriptionFeature
BOW/VSMA discrete representation of text, where text objects are numerically represented as discrete featuresThe model is easy to understand and simple to implement, but only focuses on the text content, not the word order, and there is a semantic gap
TF/TF-IDFIf a word or phrase appears frequently in one text and rarely in others, the word or phrase is considered to have good category differentiation and is suitable for classificationInformation retrieval and text classification based on frequency, but without consideration of semantic information of words
LSA/LDAThis approach focuses on the text vocabulary, counting the relationships between text topic words to achieve a textual representation.The numerical modelling of text objects is based on their context, considering the connections between words is a continuous textual representation
Pre-trained language modelsA neural network model that obtains semantic similarity by considering surrounding terms and embedding them in a high-dimensional vector space (e.g., Word2Vec, ELMO, Transformer, ERNIE, and BERT)Continuous-type representation based on neural networks, wherein the model automatically realizes the learning of lexical, syntactic, and semantic features; and realizes text representation learning and the effective combination of subsequent natural language processing tasks
Table 11. Training model and introduction.
Table 11. Training model and introduction.
ModelsAdvantagesDisadvantages
SVMSmall sample classification is better, the model is not very computationally intensive, and the generalization accuracy is higherThe model is sensitive to parameter tuning and kernel function selection, takes up more memory and running time in storage and computation, and is deficient in large-scale sample training
CRFSeen as the dominant model for named entity recognition; internal and contextual feature information can be used in the annotation processSlow convergence and long training time
CNNAutomatic feature extraction by convolution kernelInability to solve long-distance dependencies, ignoring the relationship between the local and the whole
RNNThe model specializes in sequences with timing information and is able to capture the hidden relationships between sequence unitsLong input sequences are prone to gradient disappearance or gradient explosion problems
Bi-LSTMMemory with network structure, remembering information in full sentencesNot as well-utilized as CNN for parallel computing
Self-AttentionThe algorithm is able to learn the internal structure of a sentence by calculating dependencies directly, regardless of the distance between words, and is relatively simple to implement——
Table 12. Statistical results of articles by category and task.
Table 12. Statistical results of articles by category and task.
Construction
Domains
TaskData SourceReference
Document ManagementKnowledge extractionConstruction project documents, contract documents, engineering design plans, construction process documents, post-project review documents, tender documents, statements of work[10,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
Knowledge search
Classification/clustering
Automated Compliance CheckingCompliance checkingBuilding/construction specification documents[51,52,53,54,55,56]
Security ManagementAccident analysisAccident reports, summary reports of disaster investigation[22,57,58,59,60,61,62,63,64,65,66,67,68,69,70]
Risk ManagementFactor analysisContractual texts, work reports[23,30,71,72,73,74,75,76,77,78,79,80]
Risk forecast
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xu, N.; Zhou, X.; Guo, C.; Xiao, B.; Wei, F.; Hu, Y. Text Mining Applications in the Construction Industry: Current Status, Research Gaps, and Prospects. Sustainability 2022, 14, 16846. https://doi.org/10.3390/su142416846

AMA Style

Xu N, Zhou X, Guo C, Xiao B, Wei F, Hu Y. Text Mining Applications in the Construction Industry: Current Status, Research Gaps, and Prospects. Sustainability. 2022; 14(24):16846. https://doi.org/10.3390/su142416846

Chicago/Turabian Style

Xu, Na, Xueqing Zhou, Chaoran Guo, Bai Xiao, Fei Wei, and Yuting Hu. 2022. "Text Mining Applications in the Construction Industry: Current Status, Research Gaps, and Prospects" Sustainability 14, no. 24: 16846. https://doi.org/10.3390/su142416846

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop