Exploring Natural Language Processing in Construction and Integration with Building Information Modeling: A Scientometric Analysis

: The European Union (EU) aims to increase the efﬁciency and productivity of the construction industry. The EU suggests pairing Building Information Modeling with other digitalization technologies to seize the full potential of the digital transition. Meanwhile, industrial applications of Natural Language Processing (NLP) have emerged. The growth of NLP is affecting the construction industry. However, the potential of NLP and the combination of an NLP and BIM approach is still unexplored. The study tries to address this lack by applying a scientometric analysis to explore the state of the art of NLP in the AECO sector, and the combined applications of NLP and BIM. Science mapping is used to analyze 254 bibliographic records from Scopus Database analyzing the structure and dynamics of the domain by drawing a picture of the body of knowledge. NLP in AECO, and its pairing with BIM domain and applications, are investigated by representing: Conceptual, Intellectual, and Social structure. The highest number of NLP applications in AECO are in the ﬁelds of Project, Safety, and Risk Management. Attempts at combining NLP and BIM mainly concern the Automated Compliance Checking and semantic BIM enrichment goals. Artiﬁcial intelligence, learning algorithms, and ontologies emerge as the most widespread and promising technological drivers


European Union Digitalization Strategy
The digitalization of the European market is one of the main objectives set by the European Union (EU). The digitalization of industrial sectors and production processes aims to increase and maximize the efficiency and potential growth of the digital economy in the European common market. The EU digitalization strategy regards, in particular, Architecture, Engineering, Construction, and Owner-operated (AECO) industry, as one of the pillars of EU economy [1]. However, the AECO sector is slowly adopting Digital Technologies and embracing Digital Innovation compared to other industrial sectors (e.g., manufacturing and telecommunication) [2]. To seize the full potential of digitalization of the construction sector, the EU Commission recommends combining Building Information Modelling (BIM) with other digitalization technologies [1]. In the last decade, BIM has become widespread in AECO industry [3,4]. BIM refers to the "use of a shared digital representation of a built asset to facilitate design, construction and operation processes to form a reliable basis for decisions" [5] 1. 2

. Document-Based and Model-Based Approaches
The construction process deals with several different and complex forms of information that are exchanged and modified by the actors involved, and much of it is captured, exchanged, and delivered using documents [6]. In AECO projects "Documents are interfaces, used to access and navigate through collections of information" [7]; thus, the sector can be defined as a document-centric industry [8,9]. As a consequence, a huge amount of unstructured data and information are produced and shared via natural language [6,7,10], such as documents and reports which require the need for specific techniques to be processed and digitally managed [11]. On the other hand, the adoption of BIM methodology tends to shift the sector toward a model-based approach, which is focused on the development and exchange of digital artifacts and models. Despite the widespread use of BIM approaches, AECO information flow is still mainly based on the production and exchange of documents [8,12,13]. Human natural language, written or spoken, is pervasive and the most communicative way to define and share knowledge. However, natural language is unstructured per se and difficult to be digitally managed [14]. Unstructured sources of information, such as text documents, are still essential components of design and construction projects [8]. The adoption of BIM in AECO industry is, in fact, an insufficient condition to leverage the value of BIM data and information [15].

Seizing the Full Potential of Digitalization: Pairing BIM with NLP Technology
As stated above, to seize the full potential of digitalization of the construction sector, it is necessary to combine BIM with other digitalization technologies [1]. Since the construction industry is an information-intensive sector, based on the transmission of textual documents [6,8], Natural Language Processing (NLP) can be applied to overcome the document-based nature of the sector. NLP is an interdisciplinary field which aims to process natural human languages using computers [16]. NLP, or computational linguistic, is an interdisciplinary field of computer science and linguistics, and a sub-field of Artificial Intelligence (AI). It is defined as the scientific and engineering discipline concerned with understanding written and spoken language from a computational perspective. It aims to represent human language through a formal and machine-readable language [17]. Information expressed in a formal and machine-readable form can be processed, queried, and retrieved by computers similarly to how the alpha-numerical parameters and information are managed via BIM methods and tools. Consequently, the application of NLP in the construction industry may have the capability and potential to enhance and optimize the information flow, thus supporting an effective and efficient management of construction projects [18].

Goal Setting and Article Structure
The proposed study investigates the knowledge domain of NLP studies and applications in the AECO domain, including the identification and analysis of possible links and integration between BIM and NLP methods through scientific mapping and data visualization techniques. Science mapping allows to depict a picture of the body of knowledge to The ultimate goal of NLP is the design of systems able to mimic human-like ability in dialogue, in acquiring and gaining knowledge from human language and text [20]. In general, NLP techniques can be applied to convert unstructured sources of data into machine-readable and processable data and information. In this way, computers can be used to explore and manipulate natural language text or speech [24,25].

NLP History and Evolution: From Rule-Based to Pre-Trained Models
The first revolution of traditional linguistic concepts coincided with the publication of the book "Syntactic Structures" by Noam Chomsky in 1957 [26]. With his writing, Chomsky theorized that to allow a machine to understand natural language, the structure of the sentence itself must be changed. To this end, Chomsky proposed a language for the translation of natural language sentences into machine language [26]. In 1964, ELIZA, the first rudimentary chatbot in history, was born. ELIZA was designed to imitate the responses of a Rogerian psychotherapist [27]. However, after twelve years of research in the field, the results obtained through NLP were not comparable in quality and costeffectiveness to the manual ones performed by humans. At the end of the 1960's, research on artificial intelligence applied to natural language processing was abandoned for at least 10 years until the early 80's. The new phase was characterized by the use of new concepts and the abandonment of previous theories. The new NLP systems were based on pure statistical systems and no longer on rule-based systems. There was a shift from the so-called rule-based approach to the approach based on statistical models and text corpora supported by the increasing computational power and the rise of the Machine-Learning algorithm. The 1980-1990 decade is known as the period of statistical NLP revolution [28]. At the beginning of the 2000s, the first neural language model based on Recurrent Neural Networks (RNN) was proposed [29]. An artificial neural network (ANN) is a nonlinear model that mimics the neural structure of the human brain in a biologically inspired way [30]. The model is capable of learning to perform different tasks. An artificial neural network is based on artificial neurons (processing elements) and it is organized into three interconnected layers: an input layer, a hidden layer composed by more than one layer, and an output layer [31]. It is demonstrated that the deep learning NLP framework has better performances than most of previous state-of-the-art approaches in several NLP tasks [32]. The deep learning NLP approach relies on Convolutional Neural Networks (CNNs) and Recurrent or Recursive Neural Networks (RNNs).
Summarizing, the NLP evolution can be broken down into three main phases from 1970 to 2010: • Rule-based systems: systems based on complex sets of manual written rules. Pros: the system has a high level of interpretability; Cons: it is not accurate and flexible. A rule-based system is too deterministic to manage noisy and ambiguous text data since human language is per se prone to error and incomplete.
• Statistical inference systems: systems based on statistical models.
Pros: statistical NLP affords rapid prototyping, the model is semi-automatically constructed from linguistically annotated resources, for that reason they are cheaper than rule-based systems [33]; Cons: statistical systems are robust systems which means that an output is always produced regardless of the quality of the input, consequently these systems require a more careful analysis of the quality of the input [34].
• Deep learning approach: systems based on deep learning algorithm and neural network.
Pros: they can efficiently manage the sparsity and non-structuring of learning data, respecting the complexity, articulation, and multidimensionality of human language, furthermore, they can solve most non-trivial NLP problems; Cons: low explainability of the models since there is no way to investigate and explain the structure of the net after the training task. The phenomenon is called black-box effect [35]. Moreover, one of the biggest issues of the deep learning approach is the shortage of training data, since they require a huge amount of data to be trained [36].

Latest Developments: Contextual Pre-Trained Models, the Transformers Mechanism
A subset of language models, namely the pre-trained models, were developed to overcome the shortage of training data typical of the deep learning approach. In addition, language modeling is believed to be one of the main challenges in several NLP tasks. Natural language modeling is effectively addressed by such pre-trained models [37]. In fact, pre-trained models are general purpose language models trained using online text corpora (e.g., Wikipedia): such a technique is defined as pre-training [38]. Pre-Trained Models on large corpora can learn universal language representations, avoiding training a new model from scratch. General pre-trained models can then be fine-tuned for specific NLP tasks: this technique is called transfer-learning [39].
Pre-trained language model representations can be context-free or contextual, and contextual representations can be unidirectional or bidirectional. Context-free models do not take into account the words near a given word. On the other hand, contextual models generate a representation of each term considering the other words in the sentence by relating the meaning of a word with the entire sentence. The importance of bidirectional pre-training for language representations has been widely demonstrated [36]. In late 2018, Google released BERT (Bidirectional Encoder Representations from Transformers), a new technique for contextual pre-training. The BERT algorithm is based on Transformers, a type of neural network architecture optimized for processing texts that learns contextual relations between words. BERT and, in general, Transformers-based models, are currently the state of the art for several NLP tasks, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks [36]. Transformers-based language pre-trained models can represent the characteristics of word usage such as syntax and how words are used in various contexts [40]. A list of the main Transformers-based pre trained language models is provided as follows: • BERT (Bidirectional Encoder Representations from Transformers); • ULMFiT (Universal Language Model Fine-Tuning); • OpenAI's GPT-2 and GPT-3 (Generative Pre-Trained Transformer).

Methodology
The research methodology is structured into the following phases: (I) science mapping methods and tools selection; (II) query methods and criteria; (III) data cleaning; (IV) scientometric analysis; (V) analysis and discussion of the results (Figure 1). fully tackle a broad set of NLP tasks [36]. Transformers-based language pre-trained mod-els can represent the characteristics of word usage such as syntax and how words are used in various contexts [40]. A list of the main Transformers-based pre trained language models is provided as follows: • BERT (Bidirectional Encoder Representations from Transformers); • ULMFiT (Universal Language Model Fine-Tuning); • OpenAI's GPT-2 and GPT-3 (Generative Pre-Trained Transformer).

Methodology
The research methodology is structured into the following phases: (I) science mapping methods and tools selection; (II) query methods and criteria; (III) data cleaning; (IV) scientometric analysis; (V) analysis and discussion of the results (Figure 1).

Science Mapping Methods and Tools Selection
The study proposes a scientometric literature review based on data visualization. Science mapping methods and tools are applied to analyze the current scientific literature on NLP in the AECO field and NLP and BIM combined applications. Science mapping purpose is the analysis and visual description of a scientific knowledge domain. In order to represent a specific knowledge domain, a collection of intellectual contributions should be gathered and analyzed [41]. Significant patterns and trends in the scientific literature and bibliographic data can be uncovered by science mapping. Scientometric methods include: longitudinal and cross temporal trends, keyword co-occurrence analysis, co-citation and co-authorship analysis [42], document co-citation analysis [43], and other analyses. Visualization techniques include network visualization [44], and visualizations of temporal and geo-localization structures [45]. Metrics and indicators of research impact are also considered [46].

Science Mapping Methods and Tools Selection
The study proposes a scientometric literature review based on data visualization. Science mapping methods and tools are applied to analyze the current scientific literature on NLP in the AECO field and NLP and BIM combined applications. Science mapping purpose is the analysis and visual description of a scientific knowledge domain. In order to represent a specific knowledge domain, a collection of intellectual contributions should be gathered and analyzed [41]. Significant patterns and trends in the scientific literature and bibliographic data can be uncovered by science mapping. Scientometric methods include: longitudinal and cross temporal trends, keyword co-occurrence analysis, cocitation and co-authorship analysis [42], document co-citation analysis [43], and other analyses. Visualization techniques include network visualization [44], and visualizations of temporal and geo-localization structures [45]. Metrics and indicators of research impact are also considered [46].
An analysis of the main science mapping tools has been conducted. Each tool has its own limitations and strengths. Therefore, an analysis and a comparison among tools are necessary. Studies which compare science mapping tools have already been performed [47,48]. Specifically, Moral-Muñoz et al. provide a complete overview and comparison of the features of the main science mapping tools, as summarized in Table 1. BiblioShiny [49], VosViewer, and Gephi are identified as the most suitable tools for the scientometric analysis. BiblioShiny incorporates all the analyses that the other tools allow to perform separately. Furthermore, it allows obtaining multiple visualizations and graphs directly from the web-based interface. The interface menu follows the science mapping analysis workflow and, thus, coding skills are not needed. VosViewer [50] has fewer features compared to BiblioShiny; however, it allows producing enlightening visualization of network relationships. Therefore, the scientometric analysis is performed using both BiblioShiny and VosViewer. Gephi software is used to calculate specific metrics such as the Degree centrality value, which measures the relative influence of a keyword upon the other keywords.

Query Methods and Criteria
The research was conducted at the end of December 2020, following a systematic literature review method. Specific criteria were established before the search phase: the search was restricted to full-English text articles published and stored in the Scopus Database (DB) only. The most reputable scientific DB available are Web of Science (WOS core collection) and Scopus. Both DBs are recognized as the most complete and reliable data source in several scientific fields [51][52][53]. The two DBs show overlaps in publications and bibliometric data. However, Scopus has a larger coverage of scientific production than WoS. Moreover, Scopus has a faster indexing process than WoS [54]. For these reasons, recent publications can be retrieved in Scopus, improving the scientometric analysis with more updated data. As stated, bibliometric data from the two DBs are strongly related. In addition, Scopus allows for detecting, in a more accurate way, the different researchers through citation count and h-index [55]. It is also demonstrated that there are no significant differences in the bibliometric analysis results coming from the two DBs [56]. According to the rationales given above, the proposed scientometric analysis is based on bibliometric data gathered from Scopus only. Consequently, the study does not merge the data from WoS and Scopus or other Databases. The choice does not affect the validity of the scientometric investigation, as explained above. Moreover, many previous scientometric studies have been based on Scopus, and Scopus has been recognized as a better choice for interdisciplinary research topics, such as NLP in AECO and NLP and BIM, than Web of Science [57,58].
A list of keywords to query the DB has been defined, which allowed the selection of a sample of publications and the related bibliometric meta-data corresponding to the boundaries of the knowledge domain of NLP in AECO, and BIM and NLP combined applications, as detailed in Figure 2. Keywords have been selected from previous related scientometric studies [59]. Boolean operators and wild cards are used to compose a keywords string to query the Scopus DB. Wild cards are shortcut characters (i.e., the asterisk *) which allows the inclusion of spelling variations and derivatives of the keywords without having to type each one individually. The string used to collect the data form the DB is provided as follows: • ("Civil engineering" OR "Construction engineering" OR "Architectural engineering" OR "Construction industry" OR "Construction management" OR "Construction sector" OR "BIM" OR "Building information model*") AND ("Natural Language Processing" OR "NLP" OR "Text mining" OR "Computational linguistic" OR "Information retrieval" OR "Text analy*").

Data Cleaning
Before running the analyses, similar keywords and synonyms have been normalized by merging different variants of the same keyword ( Table 2). The lexical variants and synonyms of BIM, NLP, and construction sector topics have been merged into a single term to clean the dataset from noisy data. On the other hand, keywords not belonging to those topics have not been modified to preserve the heterogeneity of the sample and to better represent the complexity of knowledge related to the main topics.  The first query string represents the AECO field and the BIM subtopic. The keywords list has been defined based on previous review studies on BIM and AECO topics [53,[60][61][62]. The second query string represents the NLP topic. The most common synonyms of NLP are used to collect an adequate number of publications. Keywords are again selected based on previous studies on the topic of NLP [63]. The first set of articles has been filtered by subject area, excluding the knowledge fields not related to AECO. A set of 254 publications has been identified, and all the useful bibliographic data, necessary for the analysis, have been downloaded from Scopus DB.

Data Cleaning
Before running the analyses, similar keywords and synonyms have been normalized by merging different variants of the same keyword ( Table 2). The lexical variants and synonyms of BIM, NLP, and construction sector topics have been merged into a single term to clean the dataset from noisy data. On the other hand, keywords not belonging to those topics have not been modified to preserve the heterogeneity of the sample and to better represent the complexity of knowledge related to the main topics.
The data cleaning activity has been performed by two of the co-authors of the paper to improve the normalization of synonyms and lexical variants of the keywords. The data set collected and cleaned has been analyzed through BiblioShiny to provide the main descriptive information about the data sample (Table 3).

Results and Discussion
The results and discussion section is divided into sub-paragraphs, each containing a brief description of the scientometric task performed, the results obtained, and the related discussion.

First Application and Annual Scientific Production Trend
The first NLP application on the AECO field appeared in 1989 with an article titled "Knowledge Processing for Construction Management Data Base," published in the Journal of Construction Engineering and Management. It should be noted that the 1980-1990 decade is known as the period of statistical NLP revolution [28]. In the wake of the statistical revolution, NLP statistical systems, which were cheaper and more flexible than the previous rule-based systems [33], were developed and tested in industry during the decade. The authors Logcher et al. aimed to design a data-base query system to help construction managers retrieve useful information to support the decision making process [64]. In their system architecture, the authors proposed a language analyzer (or natural language processor) to facilitate information retrieval and access by allowing the user to query the database in near-natural language. The natural language processor can be considered the first rudimentary application of NLP systems in the construction industry.
Temporal data show that the research topic has been around for 31 years, with an average Annual Growth Rate of 4.71%. Figure 3 shows the temporal trend of the research topic from 1989 to 2020. The research production about the topic is characterized by several fluctuations. However, the graph shows an upward trend throughout the years, with a sizable increase in research production in the more recent years. A primary increase in scientific production can be seen around 1997, a second around 2005, and a third around 2011, with a clear reduction in the number of publications in 2014 and a subsequent steady and gradual increase of interest in the research community from 2015 onward.  The concurrent need for AECO to manage unstructured data, in order to obtain useful insights to support the decisions making process, and the recent applications of NLP for knowledge acquisition and information retrieval, can be a factor for the rising interest in the topics, as also stated by Bilal et al. 2016 [65].

Average Citation per Year Trend
In the collected data set, one or more articles published in 2015 gather the highest number of average total citation per year, as shown in Figure 4. The trend of average citation per year seems not to match the trend of scientific production with a positive fluctuation in the year 2015 and a steady decrease towards 2020. The misalignment between the two trends can be caused by the high degree of innovativeness of the NLP theme in the construction sector, which is investigated by a limited number of research groups. Moreover, the analysis of the size and degree of collaboration between researchers, reported in detail in the following Section 4.5.3, shows the presence of small research groups with a small network of relationships. Small size and a limited number of collaborations could be the causes of the low impact on the scientific community in terms of citations. The concurrent need for AECO to manage unstructured data, in order to obtain useful insights to support the decisions making process, and the recent applications of NLP for knowledge acquisition and information retrieval, can be a factor for the rising interest in the topics, as also stated by Bilal et al. 2016 [65].

Average Citation per Year Trend
In the collected data set, one or more articles published in 2015 gather the highest number of average total citation per year, as shown in Figure 4. The trend of average citation per year seems not to match the trend of scientific production with a positive fluctuation in the year 2015 and a steady decrease towards 2020. The misalignment between the two trends can be caused by the high degree of innovativeness of the NLP theme in the construction sector, which is investigated by a limited number of research groups. Moreover, the analysis of the size and degree of collaboration between researchers, reported in detail in the following Section 4.5.3, shows the presence of small research groups with a small network of relationships. Small size and a limited number of collaborations could be the causes of the low impact on the scientific community in terms of citations.

Conceptual Structure Analysis: Key Research Patterns, Affinity, and Links
The term "conceptual structure" refers to the graphical representation of relations among concepts (keywords or words) in a sample of publications [66]. The conceptual structure of a set of documents can be investigated using a network visualization (e.g., cowords network, co-occurrence keywords network). Network visualization helps to understand the topics covered by a research field, defining the most important and recent topics, the so called research front [67]. Plotting meta-data related to the publication period similarly allows studying the evolution and the changes of a subject over such a period.
A similar approach to network analysis is the factorial analysis. Factorial analysis is a data reduction technique which helps to identify subfields of the major topics. Factorial analysis relies on the dimension reduction algorithm (e.g., correspondence analysis (CA), Multiple Correspondence Analysis (MCA), Principal Component Analysis (PCA)) [68]. The factorial analysis approach reduces the dimensionality of data; this parameter refers to how many attributes/variables are represented in a dataset. Factorial analysis can represent the dataset in a lower-dimensionality space.
This study adopts a mixed approach to investigate the bibliometric data sample; the methodology adopted is summarized in Figure 5. The analysis starts providing conceptual networks (co-occurrence keyword and temporal overlay networks), after which networks are dimensionally reduced using factorial analysis and the related bi-dimensional matrixes are plotted. The x and y axes of the bi-dimensional graph are functions of the centrality and density of the network graphs themselves. The adopted mixed approach allows representing the several subfields and the thematic evolution of the main topics.

Conceptual Structure Analysis: Key Research Patterns, Affinity, and Links
The term "conceptual structure" refers to the graphical representation of relations among concepts (keywords or words) in a sample of publications [66]. The conceptual structure of a set of documents can be investigated using a network visualization (e.g., co-words network, co-occurrence keywords network). Network visualization helps to understand the topics covered by a research field, defining the most important and recent topics, the so called research front [67]. Plotting meta-data related to the publication period similarly allows studying the evolution and the changes of a subject over such a period.
A similar approach to network analysis is the factorial analysis. Factorial analysis is a data reduction technique which helps to identify subfields of the major topics. Factorial analysis relies on the dimension reduction algorithm (e.g., correspondence analysis (CA), Multiple Correspondence Analysis (MCA), Principal Component Analysis (PCA)) [68]. The factorial analysis approach reduces the dimensionality of data; this parameter refers to how many attributes/variables are represented in a dataset. Factorial analysis can represent the dataset in a lower-dimensionality space.
This study adopts a mixed approach to investigate the bibliometric data sample; the methodology adopted is summarized in Figure 5. The analysis starts providing conceptual networks (co-occurrence keyword and temporal overlay networks), after which networks are dimensionally reduced using factorial analysis and the related bi-dimensional matrixes are plotted. The x and y axes of the bi-dimensional graph are functions of the centrality and density of the network graphs themselves. The adopted mixed approach allows representing the several subfields and the thematic evolution of the main topics. Buildings 2021, 11, x FOR PEER REVIEW 12 of 35

Co-Occurrence Keywords Network Maps
To perform a scientometric analysis and visualize data via science mapping, VosViewer was chosen. VosViewer is used to analyze bibliometric network data; in particular, the study investigates the co-occurrence relations between authors keywords [69]. Co-occurrence is an above-chance frequency of occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in the linguistic sense can be interpreted as an indicator of semantic proximity among topics [70]. Semantic proximity itself can be visualized in a co-occurrence map to uncover main research interests and topics, as well as their relationships. The keyword network represents the investigated knowledge domain and how the different keywords are interconnected [71]. The analysis performed in VosViewer was set as follows: • Analysis type: co-occurrence, the relatedness of items (keywords) is determined based on the number documents in which they occur together; • Unit of analysis: authors' keywords; • Counting methods: full counting methods, meaning that each co-occurrence link has the same weight; • Threshold: the minimum number of occurrences of a keyword is 6; from the set of 1936 initial keywords 74 meet the threshold and they are graphically visualized.
A keywords co-occurrence network was produced ( Figure 6). The circles represent the keywords divided into four major clusters (red, blue, yellow, and green) and a minor cluster (purple), and the lines represent the relations among keywords nodes. As stated, lexical variants and synonyms have been previously merged during the data cleaning activity and generic keywords were omitted (i.e., Buildings, Research, User interfaces, Computer software, Documentation, Managers, Expert systems, Visualization, Websites, Engineering research, Design/methodology/approach). The network is composed by 74 nodes divided into five clusters connected via 1340 relation links.

Co-Occurrence Keywords Network Maps
To perform a scientometric analysis and visualize data via science mapping, VosViewer was chosen. VosViewer is used to analyze bibliometric network data; in particular, the study investigates the co-occurrence relations between authors keywords [69]. Co-occurrence is an above-chance frequency of occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in the linguistic sense can be interpreted as an indicator of semantic proximity among topics [70]. Semantic proximity itself can be visualized in a co-occurrence map to uncover main research interests and topics, as well as their relationships. The keyword network represents the investigated knowledge domain and how the different keywords are interconnected [71]. The analysis performed in VosViewer was set as follows: • Analysis type: co-occurrence, the relatedness of items (keywords) is determined based on the number documents in which they occur together; • Unit of analysis: authors' keywords; • Counting methods: full counting methods, meaning that each co-occurrence link has the same weight; • Threshold: the minimum number of occurrences of a keyword is 6; from the set of 1936 initial keywords 74 meet the threshold and they are graphically visualized.
A keywords co-occurrence network was produced ( Figure 6). The circles represent the keywords divided into four major clusters (red, blue, yellow, and green) and a minor cluster (purple), and the lines represent the relations among keywords nodes. As stated, lexical variants and synonyms have been previously merged during the data cleaning activity and generic keywords were omitted (i.e., Buildings, Research, User interfaces, Computer software, Documentation, Managers, Expert systems, Visualization, Websites, Engineering research, Design/methodology/approach). The network is composed by 74 nodes divided into five clusters connected via 1340 relation links.
The co-occurrence network shows the presence of five clusters. The most influential, the red cluster, represents the main applications of NLP and BIM in AECO industry. The fields with more applications are the Project and the Construction management fields, the latter closely related to the Information management field. The use of Information Technology (IT) is also highlighted, as well as tools and methods for the implementation of IT in the construction field such as: Database Systems, Computer Simulation, Data Processing, and Virtual Reality. The co-occurrence network shows the presence of five clusters. The most influential, the red cluster, represents the main applications of NLP and BIM in AECO industry. The fields with more applications are the Project and the Construction management fields, the latter closely related to the Information management field. The use of Information Technology (IT) is also highlighted, as well as tools and methods for the implementation of IT in the construction field such as: Database Systems, Computer Simulation, Data Processing, and Virtual Reality.
The blue cluster represents the BIM-related field. The BIM bubble is strongly connected with the Architectural Design theme. The design phase seems to be the phase with the largest number of BIM and NLP independent applications. The keyword Ontology also belongs to the blue cluster. Ontology is a Semantic Web format, and it can be considered as the common and shared vocabulary by which knowledge can be represented [72]. Ontologies seem to be the most promising way to solve the interoperability issue among heterogeneous BIM authoring software applications by making information systems universally accessible and achieving semantic interoperability [73]. The potential of ontology to bring the BIM approach to the semantic web, thus enhancing the interoperability and supporting the collaboration among actors, is widely recognized [74,75]. Several studies have been conducted in this direction with applications in the built environment field, such as: scheduling, cost management and estimation [76,77], smart homes and intelligent environment [78], BIM-based approach [79], construction knowledge management [80], project collaboration and information exchange [81], facility management [82], property The blue cluster represents the BIM-related field. The BIM bubble is strongly connected with the Architectural Design theme. The design phase seems to be the phase with the largest number of BIM and NLP independent applications. The keyword Ontology also belongs to the blue cluster. Ontology is a Semantic Web format, and it can be considered as the common and shared vocabulary by which knowledge can be represented [72]. Ontologies seem to be the most promising way to solve the interoperability issue among heterogeneous BIM authoring software applications by making information systems universally accessible and achieving semantic interoperability [73]. The potential of ontology to bring the BIM approach to the semantic web, thus enhancing the interoperability and supporting the collaboration among actors, is widely recognized [74,75]. Several studies have been conducted in this direction with applications in the built environment field, such as: scheduling, cost management and estimation [76,77], smart homes and intelligent environment [78], BIM-based approach [79], construction knowledge management [80], project collaboration and information exchange [81], facility management [82], property management [83], building design [84,85], construction code compliance and conformance checking [86], and building energy efficiency [87].
The Ontology bubble, being a method to enhance collaboration and information sharing, is connected to the IFC term. IFC (Industry Foundation Classes) is an open data model and a digital description of the built asset industry. IFC aims to standardize Building Information Model (BIM) data that are exchanged and shared among software applications used by the several actors of a design, construction, and facility management process [88].
In light of this, the knowledge management, and the interoperability keyword itself, belong to the blue cluster.
The yellow cluster represents the semantic technologies topic; NLP can be described as a semantic technology itself. Main applications fields such as the Automated Compliance Checking (ACC) and the semantic enrichment of the BIM approach, i.e., the quality, accessibility and interpretation of the information stored in BIM models [89], are visualized.
The green cluster shows the main fields of application of NLP systems in the construction industry, such as risk management [90,91] and risk assessment [92], and safety management and safety engineering for accident prevention [93][94][95]. Main tools and methods to perform NLP tasks are also visualized in the graph, the most prominent of which are the following: artificial intelligence, data, and text mining, and learning algorithm with their declinations (machine learning and deep learning). As stated in Section 1.2, deep learning algorithms have the highest performances in several NLP tasks [32] and, for that reason, are widely used and thus underlined in the graph.
The green Natural Language Processing cluster is close to the blue BIM topic and connected to the yellow cluster of semantic technologies. The three topics: BIM, Semantic, and NLP seem to be strongly linked and interconnected. The closeness between the three themes can be explained by the ability of NLP systems to process natural language, which is semantic information itself, and translate it into a machine-understandable format, such as ontologies that are widely investigated with various applications to support interoperability between BIM systems with a focus on semantic interoperability. From this perspective, NLP, which is a semantic technology, and BIM enriched with semantic information can be both considered drivers to lead the industry towards the digital transition by bringing the sector into the Semantic Web [96]. Semantic Web is, in fact, a machine-processable approach supporting universal information exchange understandable by both machines and humans working in cooperation [97]. As investigated by Pauwels et al., there is a clear tendency of the scientific research of investigating and using Semantic Web technologies to solve the interoperability issue of AECO supporting the digital transition of the industry [98].
In summary, BIM, NLP, the Semantic topic, and their intersections are all part of the transition process towards the implementation of the Semantic Web which aims to fully digitalize AECO sector (Figure 7).
The Ontology bubble, being a method to enhance collaboration and information sharing, is connected to the IFC term. IFC (Industry Foundation Classes) is an open data model and a digital description of the built asset industry. IFC aims to standardize Building Information Model (BIM) data that are exchanged and shared among software applications used by the several actors of a design, construction, and facility management process [88]. In light of this, the knowledge management, and the interoperability keyword itself, belong to the blue cluster.
The yellow cluster represents the semantic technologies topic; NLP can be described as a semantic technology itself. Main applications fields such as the Automated Compliance Checking (ACC) and the semantic enrichment of the BIM approach, i.e., the quality, accessibility and interpretation of the information stored in BIM models [89], are visualized.
The green cluster shows the main fields of application of NLP systems in the construction industry, such as risk management [90,91] and risk assessment [92], and safety management and safety engineering for accident prevention [93][94][95]. Main tools and methods to perform NLP tasks are also visualized in the graph, the most prominent of which are the following: artificial intelligence, data, and text mining, and learning algorithm with their declinations (machine learning and deep learning). As stated in Section 1.2, deep learning algorithms have the highest performances in several NLP tasks [32] and, for that reason, are widely used and thus underlined in the graph.
The green Natural Language Processing cluster is close to the blue BIM topic and connected to the yellow cluster of semantic technologies. The three topics: BIM, Semantic, and NLP seem to be strongly linked and interconnected. The closeness between the three themes can be explained by the ability of NLP systems to process natural language, which is semantic information itself, and translate it into a machine-understandable format, such as ontologies that are widely investigated with various applications to support interoperability between BIM systems with a focus on semantic interoperability. From this perspective, NLP, which is a semantic technology, and BIM enriched with semantic information can be both considered drivers to lead the industry towards the digital transition by bringing the sector into the Semantic Web [96]. Semantic Web is, in fact, a machine-processable approach supporting universal information exchange understandable by both machines and humans working in cooperation [97]. As investigated by Pauwels et al., there is a clear tendency of the scientific research of investigating and using Semantic Web technologies to solve the interoperability issue of AECO supporting the digital transition of the industry [98].
In summary, BIM, NLP, the Semantic topic, and their intersections are all part of the transition process towards the implementation of the Semantic Web which aims to fully digitalize AECO sector (Figure 7).

Co-Occurrence Keywords Temporal Overlay Network Maps
VosViewer also allows overlaying temporal meta-data regarding publication years of the articles related to the keywords displayed in the graph. A temporal overlay data map is provided in Figure 8.
The cluster representing the main fields of application of NLP and BIM in the construction sector is the most dated, with keywords dating back to the beginning of 2000. The very first attempts to apply information technology in the construction sector date back to 1998. NLP, BIM, and Semantic topics clusters gather the most recent keywords with an average publication year of 2015. The timespan of the keywords of the green, blue, and yellow clusters covers a range of 10 years from 2009 to 2019. Table 4 shows the average publication years of the four main clusters considering each keyword's publication years.

Co-Occurrence Keywords Temporal Overlay Network Maps
VosViewer also allows overlaying temporal meta-data regarding publication years of the articles related to the keywords displayed in the graph. A temporal overlay data map is provided in Figure 8. The cluster representing the main fields of application of NLP and BIM in the construction sector is the most dated, with keywords dating back to the beginning of 2000. The very first attempts to apply information technology in the construction sector date back to 1998. NLP, BIM, and Semantic topics clusters gather the most recent keywords with an average publication year of 2015. The timespan of the keywords of the green, blue, and yellow clusters covers a range of 10 years from 2009 to 2019. Table 4 shows the average publication years of the four main clusters considering each keyword's publication years.

Centrality Node Measurement
The centrality of a node, which corresponds to a keyword, represents the importance of the topic in the research domain analyzed. In other words, centrality allows hierarchizing the keywords, applying a simple and direct approach [99]. In this study, centrality is measured computing the Degree Centrality (DC) which represents the number of links that a keyword has with the other keywords of the network, giving a measure of the influence of a keyword upon the others [100]. Main research interests have been ranked based upon the DC. The influence and importance of a keyword within the network graph is proportional to the DC value. An additional centrality metric, the Betweenness Centrality (BC), was calculated in the case where two nodes had the same DC value. The additional metric represents influential nodes for highest values, capturing how often a node is in between others. This quantifies the number of times a node acts as a bridge along the shortest path between two other nodes [101].
Gephi software was used to calculate the DC of each node. The calculated values of DC and BC of the first 25 keywords are shown in Figure 9.  Table 5.

Centrality Node Measurement
The centrality of a node, which corresponds to a keyword, represents the importance of the topic in the research domain analyzed. In other words, centrality allows hierarchizing the keywords, applying a simple and direct approach [99]. In this study, centrality is measured computing the Degree Centrality (DC) which represents the number of links that a keyword has with the other keywords of the network, giving a measure of the influence of a keyword upon the others [100]. Main research interests have been ranked based upon the DC. The influence and importance of a keyword within the network graph is proportional to the DC value. An additional centrality metric, the Betweenness Centrality (BC), was calculated in the case where two nodes had the same DC value. The additional metric represents influential nodes for highest values, capturing how often a node is in between others. This quantifies the number of times a node acts as a bridge along the shortest path between two other nodes [101].
Gephi software was used to calculate the DC of each node. The calculated values of DC and BC of the first 25 keywords are shown in Figure 9.  Table 5.

Keywords Evolution (1989-2020)
A graph, which shows the trend of keywords over time (from 1989 to 2020) in the investigated body of knowledge, is provided in Figure 10. A graph, which shows the trend of keywords over time (from 1989 to 2020) in the investigated body of knowledge, is provided in Figure 10.

Factorial Approach and Thematic Map: From Network Graph to Bivariate Map
As stated in Section 4.2, factorial analysis allows reducing the dimensionality of data and represent it in a lower-dimensionality space, in this case in a 2D graph. The methodology applied to reduce the dimensionality is the Correspondence Analysis (CA). Keywords are plotted as points with coordinates in a bi-dimensional space: the more the keywords are similarly distributed in the data set, the closer they are plotted in the bivariate map. Summarizing, keywords are grouped into the same cluster if they are discussed together in a large proportion of articles; the opposite, keywords are distant when a small fraction of papers uses the terms together. The origin of the chart represents the center of the research field analyzed, namely the large shared topics [102].

Correspondence Analysis and Clustering: Map of Words
The factorial bi-dimensional map ( Figure 11) shows three main clusters. The cluster in blue is identified by the Information Technology keywords and 9 secondary terms, including terms such as Construction and Project management. The green cluster is identified by the Construction safety topic and gathers 11 keywords including the NLP term; NLP application in safety and risk management is one of the most investigated alongside the Automated Compliance Checking task, and some relevant papers of the clusters are also listed in Table 6. In the green cluster the following keywords can also be found: Deep learning, Machine learning, and Artificial intelligence terms which are the three keywords depicting approaches and tools employed for NLP tasks. The red cluster is identified by the Building design terms and is composed of seven keywords, including: BIM, ifc, and ontology.
The factorial bi-dimensional map ( Figure 11) shows three main clusters. The cluster in blue is identified by the Information Technology keywords and 9 secondary terms, including terms such as Construction and Project management. The green cluster is identified by the Construction safety topic and gathers 11 keywords including the NLP term; NLP application in safety and risk management is one of the most investigated alongside the Automated Compliance Checking task, and some relevant papers of the clusters are also listed in Table 6. In the green cluster the following keywords can also be found: Deep learning, Machine learning, and Artificial intelligence terms which are the three keywords depicting approaches and tools employed for NLP tasks. The red cluster is identified by the Building design terms and is composed of seven keywords, including: BIM, ifc, and ontology.   Table 6. Relevant papers about NLP application for Safety and Risk mitigation, and Automated Compliance Checking task.

Topic Brief Description and Main Goal Reference
Risk management NLP based system to analyze the uncertainty of the bidding documents: predicting risks during the bidding process of construction projects. [103] Automated Compliance Checking Semantic machine learning-based text classification algorithm for classifying clauses and sub-clauses: enhancing Automated Compliance Checking (ACC). [104] NLP and deep learning-based approach, converting human-readable building regulations to computer-readable format: supporting Automated Rule Checking activity. [105] Construction safety NLP techniques performed on construction accident report databases: improving efficiency and performance of risk mitigation Case Base Reasoning (CBR) method. [90] Text mining and NLP to analyze construction site accident: preventing reoccurrence of similar accidents enhancing scientific risk control plans. [106] Keywords clusters, corresponding to the factorial analysis reduction map (Figure 11), are marked in the co-occurrence network. Cluster A (green), cluster B (red), and cluster C (blue) of the factorial analysis reduction map intermingle closely, indicating their close relation in terms of research themes. Cluster B, the red cluster in the factorial map (BIM, ontology, and ifc keywords), can be considered a bridge theme, being the connection between the NLP and Semantic green A cluster, and the Information Technology and Construction management blue C cluster, as shown in Figure 12. are marked in the co-occurrence network. Cluster A (green), cluster B (red), and cluster C (blue) of the factorial analysis reduction map intermingle closely, indicating their close relation in terms of research themes. Cluster B, the red cluster in the factorial map (BIM, ontology, and ifc keywords), can be considered a bridge theme, being the connection between the NLP and Semantic green A cluster, and the Information Technology and Construction management blue C cluster, as shown in Figure 12.

Thematic Map Analysis
To analyze the temporal evolution of topics, a thematic map is provided (Figure 13). BiblioShiny allows, by using a clustering algorithm, gathering different keywords in investigated topics. Each topic is plotted on a thematic or strategic map [107]. The graph has two dimensions: on the x-axis, the Callon Centrality and on the y-axis, the Callon Density [108]. Centrality can be interpreted as the importance of the theme in the entire knowledge

Thematic Map Analysis
To analyze the temporal evolution of topics, a thematic map is provided (Figure 13). BiblioShiny allows, by using a clustering algorithm, gathering different keywords in investigated topics. Each topic is plotted on a thematic or strategic map [107]. The graph has two dimensions: on the x-axis, the Callon Centrality and on the y-axis, the Callon Density [108]. Centrality can be interpreted as the importance of the theme in the entire knowledge domain, and Density as the maturity level of the themes themselves. According to the quadrant, it is possible to define four types of themes [108][109][110] Furthermore, the dimension of the bubbles representing the investigated topics is proportional to the relative importance of each topic, respectively, to the others.
To investigate the evolution of the topics (trajectory along time), the timespan (1989-2020) is divided into time-slices according to the annual scientific production trend analysis (Figure 3): The time slices are chosen to focus on the most recent developments (second tim slice (2014-2017), third time-slice (2017-2019), and fourth time-slice (2019-2020)). T knowledge domain is investigated starting from the point of reduction in the number publications (2014) and the subsequent steady and gradual increase in interest from 20 to 2020. To better analyze the obtained results, generic terms, such as construction a construction industry, are excluded from the thematic map.  to 2020. To better analyze the obtained results, generic terms, such as construction and construction industry, are excluded from the thematic map.
The first time-slice (1989-2014) does not report the theme related to NLP. The BIM theme falls in the upper left quadrant, identified as a highly developed but isolated theme, very specialized with few connections with other topics. The theme of information management begins to be considered as a fundamental aspect for the research; however, it is not yet fully investigated and developed in the period, and the same is true for information sharing, which is characterized as an emerging topic.
In the second time-slice (2014-2017), the BIM topic moves to the lower left quadrant, being identified as a declining theme, leaving room for topics such as Augmented Reality (AR) and artificial neural networks (ANN) in the upper left quadrant of the highly developed and specialized topics. In the quadrant of themes important for the research field but not yet developed, the NLP topic and the field of construction safety appear. The scope related to risk management is defined as a core element of the structure of the research field in the period 2014-2017.
In the third time-slice (2017-2019), the NLP is identified as an emerging theme, while two new themes related to the use of deep-learning algorithms and collaborative information sharing techniques appear. The theme of information retrieval in the analyzed three-year period is identified as a motor theme for the structure of the research field.
In the last time-slice (2019-2020), NLP is identified as a motor theme with high density and high centrality values, which means that it is a well-developed and core element of the structure of the research field. Two new topics related to big data analytics and computer vision appear as highly developed, although isolated, themes.

Source Ranking and Impact: The Bradford's Law
To identify the most relevant journals and conferences in the analyzed domain knowledge, a counting of articles divided into sources is provided in Table 7. The analysis of academic journals and conference articles can be useful for researchers and scholars to find the most active and up-to-date sources, authors, and research groups.
A further analysis of the source impact is carried out based on Bradford's Law using the BiblioShiny online tool. Bradford's law describes how information is scattered in a field, based on the distribution of citations [111]. Literally, Bradford's Law states: "if the journals are arranged in descending order the number of articles they carried on the subject, then successive zones of periodicals containing the same number of articles on the subject form the simple geometric series 1: n 1 S : n 2 S : n 3 S ". Bradford's Law divides all citations of a subject equally into three zones; the first zone is called "core zone" and it gathers the highest numbers of citations with the smallest number of journals. The second zone requires more journals to obtain the same number of citations, and the third zone more than the second one. Bradford describes a "decrease in productivity" in the transition from core zone 1 to zone 3 [112]. Bradford's Law has influenced the methodology of creating the collections, supporting the organization and management of bibliographic works, and academic documentation [113]. From this perspective, Bradford's Law can be used to identify the most highly cited journals for a field or subject, helping to categorize core journals in the field, as shown in Figure 14.
The core zone, Zone 1, is composed by 86 articles gathered in three sources, two journals, and one conference proceeding: Automation in Construction, Journal of Computing in Civil Engineering, and Proceedings of Congress on Computing in Civil Engineering. Eighty-six articles grouped in eleven sources compose zone 2, the middle zone, and Zone 3, the minor zone, gathers eighty-two articles in fifty sources. A further analysis of the source impact is carried out based on Bradford's Law using the BiblioShiny online tool. Bradford's law describes how information is scattered in a field, based on the distribution of citations [111]. Literally, Bradford's Law states: "if the journals are arranged in descending order the number of articles they carried on the subject, then successive zones of periodicals containing the same number of articles on the subject form the simple geometric series 1: n 1 S: n 2 S: n 3 S". Bradford's Law divides all citations of a subject equally into three zones; the first zone is called "core zone" and it gathers the highest numbers of citations with the smallest number of journals. The second zone requires more journals to obtain the same number of citations, and the third zone more than the second one. Bradford describes a "decrease in productivity" in the transition from core zone 1 to zone 3 [112]. Bradford's Law has influenced the methodology of creating the collections, supporting the organization and management of bibliographic works, and academic documentation [113]. From this perspective, Bradford's Law can be used to identify the most highly cited journals for a field or subject, helping to categorize core journals in the field, as shown in Figure 14. The core zone, Zone 1, is composed by 86 articles gathered in three sources, two journals, and one conference proceeding: Automation in Construction, Journal of Computing in Civil Engineering, and Proceedings of Congress on Computing in Civil Engineering. Eighty-six articles grouped in eleven sources compose zone 2, the middle zone, and Zone 3, the minor zone, gathers eighty-two articles in fifty sources.

Source Impacts: H-Index, G-Index, and M-Index
To find the most impactful source, H-index, G-index, and M-index are also calculated and compared in Table 8: • H-index, or Hirsch-index, is an author's or journals' number of published items (i.e., articles), each of which has been cited in others papers at least a number of times (h) [114];  To find the most impactful source, H-index, G-index, and M-index are also calculated and compared in Table 8: • H-index, or Hirsch-index, is an author's or journals' number of published items (i.e., articles), each of which has been cited in others papers at least a number of times (h) [114]; • G-index, introduced in 2006 is: "an improvement of H-index to measure the global citation performance of a set of articles. If this set is ranked in decreasing order of the number of citations that they received, the G-index is the (unique) largest number such that the top g articles received (together) at least g 2 citations" [115]; • M-index is equal to H-index/n, where n is the number of years since the first published paper of the source [114].
As already shown in core zone 1 of the Bradford's Law plot, Automation in Construction and Journal of Computing in Civil Engineering are the most impactful journals considering all the three indexes, H-index, G-index, and M-index. They are followed by the Journal of Construction Engineering and Management and by the Proceedings of Congress on Computing in Civil Engineering, the latter also having been identified in the core zone 1 of the Bradford's Law plot.

Source Evolution and Dynamics
Once the sources with the greatest impact on the scientific community with respect to NLP and BIM topics in the construction industry had been identified, a graph of the trend of the top five sources in terms of impact was produced to investigate their evolution over the period 1989-2020 ( Figure 15).
Only The frequency of publications per author can be described using Lotka's Law. Lotka's Law states: "as the number of published articles increases, authors producing many publications become less frequent" [116,117]. Figure 16 shows that only 19 authors are relevant and have an impact on the knowledge domain. The chart allows identifying the significant authors in the analyzed topic. The scientific production of the most relevant authors is analyzed in the following section. The frequency of publications per author can be described using Lotka's Law. Lotka's Law states: "as the number of published articles increases, authors producing many publications become less frequent" [116,117]. Figure 16 shows that only 19 authors are relevant and have an impact on the knowledge domain. The chart allows identifying the significant authors in the analyzed topic. The scientific production of the most relevant authors is analyzed in the following section. In the proposed graph (Figure 17), the scientific production of the core authors is plotted. The lines represent the author's scientific production timeline, the bubble size is proportional to the numbers of documents published in a certain year, and the bubble color intensity is proportional to the number of citations per year. In the proposed graph (Figure 17), the scientific production of the core authors is plotted. The lines represent the author's scientific production timeline, the bubble size is proportional to the numbers of documents published in a certain year, and the bubble color intensity is proportional to the number of citations per year.
The most active period in terms of publications and citations ranges from 2003 to 2019. Before that period, Professor L. Y. Liu from the University of Illinois Urbana-Champaign published an article in 1993 [118]. The latest publications belong to Zhang J. [119], Issa R. R. A. [120], Hallowell M. R., Tixier A. J.-P. [121], and Li H. [122].    In the proposed graph (Figure 17), the scientific production of the core authors is plotted. The lines represent the author's scientific production timeline, the bubble size is proportional to the numbers of documents published in a certain year, and the bubble color intensity is proportional to the number of citations per year.

Authors Collaboration: Co-Authorship Network
To investigate and visualize the relationships among authors, i.e., the so called social structure of the research field [123], a co-authorship network is provided ( Figure 18).
The co-authorship network shows the existence of 10 main research groups. Only 4 out of 10 groups are composed by more than two people. The network shows a social structure composed by small research groups with few relationships between them. Six researchers compose the largest group, while the remaining groups vary from a minimum of two to a maximum of four members.

Authors Collaboration: Co-Authorship Network
To investigate and visualize the relationships among authors, i.e., the so called social structure of the research field [123], a co-authorship network is provided (Figure 18). The co-authorship network shows the existence of 10 main research groups. Only 4 out of 10 groups are composed by more than two people. The network shows a social structure composed by small research groups with few relationships between them. Six researchers compose the largest group, while the remaining groups vary from a minimum of two to a maximum of four members.

Countries Scientific Production and Collaboration Intensity
A geographical map representing the provenance of the scientific production and the collaboration among countries is provided ( Figure 19)

Countries Scientific Production and Collaboration Intensity
A geographical map representing the provenance of the scientific production and the collaboration among countries is provided ( Figure 19). The countries with the highest scientific production in the field are the United States of America (179 articles), followed by China (44 articles), the United Kingdom (41 articles), Canada (28 articles), and Australia (24 articles). To visualize and investigate the collaboration among researchers from different countries, a collaboration bar chart is provided ( Figure 20). The bar chart indicates, for each country, the number of documents in which there is at least one co-author from a different country than the corresponding author. The countries with the highest scientific production in the field are the United States of America (179 articles), followed by China (44 articles), the United Kingdom (41 articles), Canada (28 articles), and Australia (24 articles). To visualize and investigate the collaboration among researchers from different countries, a collaboration bar chart is provided ( Figure 20). The bar chart indicates, for each country, the number of documents in which there is at least one co-author from a different country than the corresponding author.
The countries with the highest scientific production in the field are the United States of America (179 articles), followed by China (44 articles), the United Kingdom (41 articles), Canada (28 articles), and Australia (24 articles). To visualize and investigate the collaboration among researchers from different countries, a collaboration bar chart is provided ( Figure 20). The bar chart indicates, for each country, the number of documents in which there is at least one co-author from a different country than the corresponding author. The chart shows a low degree of collaboration among researchers from different countries. Considering the first 10 countries for scientific production, only 20 articles out of 137 have at least one co-author from a different country of the corresponding author. The unique country with a higher number of publications from multiple countries authors is Australia. The chart shows a low degree of collaboration among researchers from different countries. Considering the first 10 countries for scientific production, only 20 articles out of 137 have at least one co-author from a different country of the corresponding author. The unique country with a higher number of publications from multiple countries authors is Australia.

Most Relevant Affiliations and Institutions
Affiliations are listed according to the number of published articles (Table 9). Seven out of the ten most scientifically productive institutes are American, two are Canadian, and only one is Taiwanese.

Conclusions
Information sharing, storing, and management procedures in AECO are highly based on document production and exchange. Text documents, i.e., unstructured sources of information, are still essential for the construction process [8]. On the other hand, the adoption of BIM in AECO industry tends to shift the sector toward a model-based approach. Despite the widespread use of BIM approaches, AECO information flow is still mainly based on document production and exchange [8,12,13]. For that reason, adopting BIM is insufficient to leverage the whole value of unstructured data and information [15]. The study identifies Natural Language Processing as a possible approach to process unstructured text infor-mation, helping to overcome the document-based nature of the sector and to seize the full potential of digitalization in the construction sector [1].
The proposed study aims to investigate the knowledge domain of NLP technologies and applications in AECO, including the identification and analysis of possible links and integration between BIM and NLP methods, drawing a picture of the body of knowledge. Scientometric and data visualization approaches are applied to explore: Conceptual (main themes and trends, in Sections 4.1-4.3), Intellectual (influence of articles, sources, and authors, in Sections 4.4 and 4.5), and Social structure (interaction among countries, affiliations, and researchers in Section 4.6) of the selected knowledge domain.
The research methodology is structured into five main phases: (I) science mapping methods and tools selection; (II) query methods and criteria; (III) data cleaning; (IV) scientometric analysis; (V) analysis and discussion of the results. Each science mapping tool has its own limitations and strengths. To select the best set of tools, an analysis and a comparison is conducted. BiblioShiny [49], VosViewer, and Gephi are identified as the most suitable science mapping tools. The bibliometric data are gathered from Scopus DB only. Scopus has a larger coverage of scientific production and a faster indexing process than Web of Science [54]. A keywords string, composed by keywords used by previous studies on NLP, BIM, and AECO topics [63], has been defined to query the Scopus DB and to download the bibliometric meta-data corresponding to the boundaries of the investigated knowledge domain. A sample of 254 publications and the related useful bibliographic data are downloaded from Scopus DB.
Temporal trends analysis results underline an increasing interest of the scientific community in the NLP topic in the AECO sector. The increasing volume of, and the consequent need for AECO to manage, unstructured data to support the decision-making process, and the recent advancements of NLP, can be factors for the rising interest in the topics, as also stated by Bilal et al. 2016 [65]. A misalignment between the trend of average citations per year and the scientific production trend is discovered, likely caused by the high degree of innovativeness regarding the NLP theme in the construction sector investigated by a limited number of research groups. A small size and limited number of research groups investigating the theme paired with a low degree of collaboration between researchers, as reported in Sections 4.5 and 4.6, could be the causes of the low impact on the scientific community in terms of citations.
Network visualization (i.e., co-words network, co-occurrence keywords network) is performed to investigate the conceptual structure of the data sample, defining the most important and recent topics [67]. Meta-data related to the publication year are plotted to study the evolution and the changes of a subject over a period. The co-occurrence keywords network, Section 4.2, shows a close relationship among BIM, Semantic, and NLP topics, which can be explained by the capability of NLP systems to process natural language, which is a semantic information itself, and translate it into a machine-understandable format, such as ontologies. Ontologies seem to be well-explored and promising digital artifacts to support the interoperability between BIM systems with a focus on semantic interoperability. There is a clear tendency of the scientific community towards investigating and using Semantic Web technologies to solve the interoperability issue of AECO industry [98]. NLP, BIM, the Semantic topic, and their intersections can all be considered part of the transition process towards the implementation of the Semantic Web in AECO processes aiming to fully digitalize the sector.
A factorial analysis is applied to reduce the dimensionality of network graphs, representing them in a two-dimensional space. The factorial analysis reduction map shows the role of the cluster, composed by the BIM, ontology, and ifc keywords, as a bridge theme connecting the NLP and Semantic cluster and the Information Technology and Construction management cluster. A thematic map is provided to analyze the temporal evolution of topics; the map shows the evolution of the NLP topic from the quadrant of "important but not really developed themes" to the lower left quadrant being an emerging topic in the 2017-2019 time-slice. In the last time-slice, 2019-2020, NLP is identified as a motor theme with high density and high centrality values; big data analytics and computer vision appear as highly developed and isolated themes. The analysis of the conceptual structure also allows identifying the main NLP technological drivers: Artificial intelligence, Text mining, and Learning algorithms; their declinations (Machine learning and Deep learning) emerge as the most widespread and promising technological drivers. The application of NLP seems to be pervasive in several AECO fields. Project, Safety, and Risk Management are the fields with the highest number of NLP applications. Regarding the combined applications of NLP and BIM, the Automatic Compliance Checking field has the highest number of articles. These are likely regulation documents, which are highly standardized and structured into formats, and are feasible to be processed by NLP systems and translated into machine-readable language. NLP-based systems to convert regulatory information represent an active field of research. Information Retrieval from BIM models and Information Enrichment of BIM objects are further active fields of investigation. No articles seem to be related to the preliminary design or requirement definition phases, representing possible research areas not covered by the Academia.
As stated, data about provenance of corresponding authors and co-authors show a low degree of collaboration among researchers from different countries, only 20 articles out of 137 have at least one co-author from a different country of the corresponding author. The most relevant and impactful journals and conferences are also identified through a source impact analysis.
As conclusive remarks, the evolution of the research about NLP and BIM systems suggests an effort from the research community to support the sector in the transition from a document-centric to a fully information-based approach. Semantic information, by its nature, is closely related to natural language that can be managed and processed through NLP systems. Thus, the combined use of NLP and BIM systems can have a positive impact on the digitalization of the AECO sector. NLP tools and technique can become a connection between the world of documents and the world of digital entities, such as BIM models, ontologies, or knowledge graphs (KG). NLP services built on the latest transformer-based pre-trained language models (e.g., BERT or GPT-3) will enable the processing of text documents and returning digitalized and queryable information and entities in a semi-automatic way. Consequently, the separation of the informative sources, i.e., the document-based and the BIM model-based sources, which is demonstrated to be counterproductive, will be averted. NLP, BIM, and Semantic technologies and their intersections can all be considered drivers for the digital transition of the design and construction process. The latest research focuses on modelling and visualizing semantic information and knowledge [124]. The recent semantic and knowledge modeling approaches in the AECO sector mainly aim to find a methodology to model and store semantic data in a structured way [125], and to maintain the interrelation among numerical and semantic data during the whole progress of the construction project, thus preserving the traceability of data properties' progression [125,126]. The semantic modelling approach ultimately aims to overcome the document-centric approach based on unstructured data, in order to reduce the fragmentation typical of the traditional information management method [127,128]. The performed bibliometric analysis confirms the industry's growing interest in BIM, NLP, and Semantic technologies integration, aiming to overcome the above-mentioned limitations of current document-centric processes of AECO sector.
The findings of the analysis are to be considered in light of some limitations. The main limitations of the proposed approach are the following: (I) research findings do not fully reflect the entire NLP and BIM knowledge domain in AECO industry, being the Scopus DB query circumscribed by the selected keyword string, e.g., non-English articles are omitted from the analysis (18 out of 272); (II) the study is a static picture of the body of knowledge in a specific period . Regarding the second limitation, it is worth noting how applying the same bibliometric approach in the future will allow further investigation of the dynamic nature and the evolution of the NLP and BIM topic. Institutional Review Board Statement: Ethical review and approval were waived for this study because they were not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.