ACS: Construction Data Auto-Correction System—Taiwan Public Construction Data Example

: This study aims to develop an automatic data correction system for correcting the public construction data. The unstructured nature of the construction data presents challenges for its management. The different user habits, time-consuming system operation, and long pretraining time all make the data management system full of data in an inconsistent format or even incorrect data. Processing the construction data into a machine-readable format is not only time-consuming but also labor-intensive. Therefore, this study used Taiwan’s public construction data as an example case to develop a natural language processing (NLP) and machine learning-based text classiﬁcation system, coined as automatic correction system (ACS). The developed system is designed to automatically correct the public construction data, meanwhile improving the efﬁciency of manual data correction. The ACS has two main features: data correction that converts unstructured data into structured data; a recommendation function that provides users with a recommendation list for manual data correction. For implementation, the developed system was used to correct the data in the public construction cost estimation system (PCCES) in Taiwan. We expect that the ACS can improve the accuracy of the data in the public construction database to increase the efﬁciency of the practitioners in executing projects. The results show that the system can correct 18,511 data points with an accuracy of 76%. Additionally, the system was also validated to reduce the system operation time by 51.69%.


Construction Data Management
Data management has been considered as one of the most vital tasks in the construction industry [1]. The unstructured nature (e.g., plain text) of the construction data presents challenges for its management. Processing the construction data into a machine-readable format is time-consuming and labor-intensive, as it requires lots of paperwork to structuralize data from various sources. Nowadays, many popular tools and standards have been developed by experts to help people manage construction data. Management tools, such as Microsoft Project and Primavera, are prevalent. The MasterFormat and UniFormat standards are common in the U.S. and Canada. However, engineers still need to follow the rules and form formats provided by standards and tools to manually transform unstructured data into structured data.
Even though many tools and standards are ready to follow, the quality of processed data is not sufficient due to the various backgrounds of the related personnel. Engineers from different fields may have their own interpretations of the standards, thus causing inconsistency of the data format or even incorrect data. For instance, Taiwan government provides a tool to its employees and contractors to manage public construction projects using a coding system similar to MasterFormat. However, due to the different interpretations of the coding standard, the average data accuracy of 7592 public construction projects is only 48%. The PCCES code considers work items and resource items. Resource items include materials, human resources, and machines. Work items are included in the composition of materials, machinery, labor, and miscellaneous. The coding of work items and resource items can be used differently. As there are many projects in public works, the scale of these projects varies, ranging from airports, dams, and tunnels to planting sidewalk trees, paving bricks on sidewalks, and dredging trenches, all within the scope of public works. Each project requires the participation of many upstream and downstream manufacturers, office staff, and government officials.
Due to different roles and divisions of labor, the perspectives of project management vary. Projects of various sizes, workers from different sources, and different roles in the project will cause these people to have a different understanding of the PCCES encoding system when using it. These differences in understanding have resulted in non-compliant materials in the project. However, these non-compliant materials have accumulated over time, presenting challenges for the database. Due to these reasons, with the accumulation of days and months, the database has become full of non-compliant data.

Benefits and Challenges of the PCCES
The PCCES occupies a very important position in the field of public construction works in Taiwan. Public construction projects require the use of the PCCES in accordance with the law, so these projects have uniform specifications that can be followed. The PCCES offers some benefits after collecting a large number of projects and integrating data. For example, according to the PCC announcement [3], the PCCES can improve the efficiency and credibility of fund review and fund comparison, avoid repeating the establishment of system software and hardware by various agencies, save national public funds, and reduce the opportunities for restricting competition and restricting bidding during project bidding.
Although the PCCES does bring these benefits and helps users a lot, some researches have pointed out problems with the system. For instance, users are not used to the standardized work items and resource codes in the PCCES [5]. Users need longer education and training to be familiar with and operate the PCCES [6]. Many practical applications are still not included in the PCCES [7,8]. The operation of the PCCES is time-consuming, requires a lot of manpower, and has a high error rate [9]. Moreover, in the production of PCCES documents there are general contradictions between architects and construction companies in measurement and calculation methods, and some measurement and calculation methods do not conform to the PCCES coding standard [10].

Research Objectives
This study focuses on the development of a machine-learning-based system to autocorrect a public database of the construction field that contains the PCCES data in Taiwan. People cannot use the data in the public database as the database is filled with messy data. In actuality, while people use many tools for managing construction data, the data are still chaotic. The quality of these data depends on how familiar a user is with the tools. There is no classification method that can consider the meaning of construction specifications in the construction classification system's database and then automatically correct the wrong data. Therefore, this study aims to develop a machine-learning-based system to improve the performance of people who work on the construction files and need to use construction classification systems. The developed method should achieve the following goals: (1) Develop an auto-correct function to automatically correct the data from public databases related to the construction field and the unstructured construction project data. Users could use this function to correct data that they obtained from the open data database made by the government. (2) Develop a recommender function, which can help users to perform their job efficiently without having experience in construction classification systems.

Literature Review
In order to develop a system that can automatically structuralize the construction data, this study first reviews the characteristics and challenges of construction data management (Section 2.1). Then, the novel methods for unistructural data processing proposed by other studies is discussed in Section 2.2. Lastly, the review of state-of-the-art machine-learningbased approaches to construction data processing is conducted in Section 2.3.

Challenges in Construction Data Management
Processing unstructured data is considered one of the most critical challenges in construction data management. Project owners, architects, contractors, and suppliers communicate and coordinate through various documents. It is common that documents have different naming conventions for the same object [11]. The different naming conventions increase the time and cost of processing the data as the personnel need to organize the raw materials for further communications. Attempts are being made by some to use a construction classification system (CCS), such as the MasterFormat [12], UniFormat [13], and OmniClass [12], to lower the extra cost of data processing. CCS usually uses layers of Sustainability 2021, 13, 362 4 of 21 coding to categorize various materials and adds narrative descriptions. This system allows all team members to use the storage efficiently and use retrieval mechanism codes in the system to reference specific parts of any document, reducing the cost [14]. However, even users who are working on the same project will have different CCS systems according to their various roles in the project. It is not feasible to make all users familiar with the CCS systems used by each other: the learning cost is too high. Not all of the team members are familiar with these specifications, which may cause the inputted data to remain unstructured.
In the field of construction engineering, there is also the problem of dealing with the unstructured data [15][16][17]. Unstructured data mean the data set is not stored in the structured format in the database; an alternative definition may be that the data do not follow a predefined data model composition. This makes the data irregular, ambiguous, and difficult to understand using traditional computer programs [18]. Many documents are generated in a construction project, including images and text. The text-based unstructured materials include contract templates, construction specifications, quality documents, and material management documents [17]. Among them, only 20% of the available data are structured and stored in a relational database, while approximately 80% are unstructured text and stored in various forms of documents [19]. Traditionally, unstructured data are expected to be converted to structured data by manual work. However, acquiring knowledge from unstructured data is usually painful and expensive [20]. Therefore, several studies have been conducted to determine a simpler and cheaper means of retrieving useful information from unstructured data.

Unstructured Data Processing
Unstructured data have been considered a critical challenge in data management for decades. Several studies have been conducted on transferring the unstructured data into useful information in many fields. For instance, Kharrazi et al. used natural language processing (NLP) to solve the problem of unstructured electronic health records [21]. Luo et al. used the knowledge from data specialists and computer data modules to extract structured data from unstructured medical records [22]. In the business field, Farhadloo et al. attempted to discover the relative importance of each service or unique product using the Bayesian method for a customer review system [23]. These investigators used an online analytical processing (OLAP) system to analyze unstructured data from multiple perspectives, including text mining (TM), information retrieval (IR), and information extraction (IE), in an attempt to extract business intelligence from unstructured data.
In the field of construction engineering, researchers have also tried to solve the problem of unstructured data. In the study of [16], a view-based method was used, with metadata models to convert documents to structured data. Alsubaey et al. presented a Naïve Bayes text mining approach to identify early warnings of failure from meeting records [24]. Kim and Chi developed a system based on natural language processing (NLP) to extract hidden knowledge from construction accident cases [25]. Even though studies have been conducted that address unstructured data, none of these results can solve the problem that the CCS faces.
The issue for the CCS is in the material codes and in challenges in providing accurate description of materials. For materials in the CCS, there is a coding system for specifications, and the CCS uses specific terms to describe the specifications. However, in actuality, not everyone can master the coding system and become familiar with these terms. People use unprofessional terms in construction files at work, and it works fine as these terms are readable. Even though the codes and descriptions in construction files are invalid in the CCS, these construction files contain data with coding errors yet proper descriptions of unstructured data.
Machine learning is a popular solution to convert unstructured data into structured data. Machine learning is flourishing due to the improvement of hardware computing power, the reduction of data storage costs, and the innovation of various algorithms.
With the advancement of machine learning, various machine learning algorithms enable the model to learn from data, which makes computers able to handle more and more tasks. After training the model with a large number of examples, classifying the data or predicting the model training, the model can extract information from samples to learn and can complete specific tasks or prediction. These abilities allow a computer to complete specific tasks or make predictions for a variety of applications. Machine learning can gradually replace human resources in specific tasks, such as in autonomous vehicles [26], voice recognition [27], weather prediction [28], face recognition [29], lie detection [7], image processing [30], etc.

Machine-Learning-Based Methods For Construction Data Processing
A typical construction project may have thousands of outstanding issues. An artificial intelligence program that can help humans systematize these problems and the data accompanying them would greatly improve the efficiency of work. There are many things that artificial intelligence can accomplish. For images, in the unstructured data processing of images, some algorithms have performed classification of pictures directly [31], and some studies have algorithms that give these images a text caption, then these images became text data [32]. For text, Wu et al. used natural language analysis to automatically extract keyword lists from pathological examination reports [33]. Nandhakumar et al. used the characteristics of words or sentences and conditional random field (CRF) models to extract important parts of medical reports [34]. The methods described above are attempts made to structure data.

System Overview
This study proposes an automatic correction system (ACS) that can correct the data automatically or provide a recommended list to users for manual correction. The proposed system includes three primary modules: a data processing module, a search processing module, and a mapping processing module. Figure 2 shows the system overview of the ACS. The data processing module processes the raw data collected from different resources and then stores them in the result database. The main job of the data processing module is to train a word embedding model and establish a result data set. The search processing module processes the target data input by the user and then puts them into the word embedding model to find data with higher similarity. The data processing module and the search processing module both use the same text processor [35]. The following sections will describe each module sequentially.

Data Processing Module
The data processing module processes the raw data then uses the processed raw data to train a word embedding model. This study used the corrected blank valuations and the CCS system manual as the raw data. The data processing module uses the text processor to normalize the data that were manually extracted from the raw data. These processed data will then be inserted into a corpus. Subsequently, correct code, correct name, correct name's segmentation, and original name's segmentation are extracted from the corpus as the result data and stored in the result database. At the end of this module, the corpus is taken as the input for training the word embedding model. Figure 3 shows the system overview of the ACS.

Raw Data Collection
For the raw data collection, the system manuals and blank valuations were collected that related to a specific CCS. The ACS has a particular use in correcting a public database associated with one specific CCS, and the performance of the ACS depends on the quality of the collected data. System descriptive documents were collected from the institutions that manage CCS, and blank valuations that use CCS were received from a private company's real cases. For the descriptive documents of the ACS, they at least contain specifications, descriptions, and coding systems. These correct codes and descriptions were extracted as a data column to two data fields in the "Manual Extracted Data," termed as "Correct Code" and "Correct Name." The inaccurate description was also extracted as a data column, termed as "Original Name".

Text Processor
The text preprocessing is advantageous for the subsequent classification results and can reduce the complexity of the calculation. Figure 4 shows workflow of the ACS's text processor. The text processor handles the text preprocessing and contains the following tasks: stop word removal, lowercase conversion, normalization, and tokenization. After users enter data, such as a sentence, the text processor removes stop words [36], converts uppercase and lowercase letters to be consistent, normalizes the preprocessing data obtained thus far, and finally uses tokenization to segment the data. An alias dictionary is utilized for the tokenizing. The dictionary includes synonyms, which can increase the quality of statements [35].

Search Processing Module
For the search processing module, the input is the target data that the user keys in, and the output data are data with higher similarity obtained from the word embedding model. After the data processing module trains the word embedding model, the search processing module uses the text processor to handle the input target data and convert the target data into segmentations. Subsequently, the segmentation is taken as the input to the word embedding model to find similar data. The output of this module is the similar data that the model found. Figure 5 shows the system overview of the ACS.

Word Embedding
Word embedding is one of the most popular representations in documentation vocabulary [37]. It can capture the context, semantic, and syntactic similarity of words in documents, and the relationships with other words. Roughly speaking, it is vector representations of specific words. The program cannot directly use the text contained in an electronic file, and thus the text needs to be converted into a format that the computer can handle, with word vector representation being one of these conversion methods. A series of processing steps will be performed on the text until the text becomes a sentence or word. These sentences or words are given an independent code, and the code is a vector.
There are many ways to do word embedding, such as hashing vectorizer [38], count vectorizer [39], and Term Frequency -Inverse Document Frequency (TF-IDF) vectorizer [40]. The method we used in this system was a shallow neural network, where by learning a large amount of text, words are transformed into vectors in vector space. Then, a distribution of a large number of vectors in vector space is used to calculate similarity and find words with similar meanings. Words with the same meanings have identical representations. This representation is considered a fundamental breakthrough in machine learning for natural language processing problems [41]. Here, the characteristics of word embedding technologies are taken as the core application of the system. Word embedding methods include dimensionality reduction of word co-occurrence matrices [42][43][44], probability models [45], and explicit representation of the context in which words are located [46]. In the ACS, we used the continuous bag of words (CBOW) [47] as the language model to obtain the vector matrix. Figure 6 shows the system overview of the ACS. In the CBOW model, the context of surrounding words is used to predict the word in the middle. The input layer is one-hot encoding [48] and the size is equal to N. Each element in the input layer corresponds to the words in a vocabulary. Zero means no input, and a 1 means a word is input. The hidden layer is the input layer multiplied by the weight Since the input is a hot-coded vector, the hidden layer is the result of superimposing multiple rows from the weight matrix. The row index of these rows is equal to the index, with the input element being 1. Therefore, the input layer is like a lookup table for the weight matrix row search.

Similarity Calculation
The ACS uses text similarity to correct the data. For example, there are data that have been classified as wrong data, since the description is not sufficiently accurate even if it is semantically close. The system can replace the incorrect data with the most similar data if they can be found from the model.
The ACS deals with the text that users entered as a result and then inserts the results into the word embedding model. The result is the segmentation of the text. The model will give a vector to the result, which can represent the result in the model's vector space. Then, this study uses the cosine similarity [49][50][51] to calculate the similarity of vectors. The ACS takes the results in the result database, inputs them into the model to obtain vectors, and then uses the vectors from the results and the vectors from user model entries. After the vectors are input to the model, the ACS calculates the normalized dot-product from the cosine angle. Given two data a and b represented as two vectors V a and V b , the cosine similarity can be calculated as Equation (1). (1) V a is the frequency of each word in the user statement after being disassembled. V b is the frequency of each word in the corpus statement after being disassembled. The cosine(V a , V b ) can represent the similarity of these two vectors. The closer the angle is to zero degrees, the more similar the two vectors are.

Mapping Process Module
The mapping process module finds the correct code from the result data set depending on the similar data from the search processing module. The system obtains a data set and data after finishing the above two modules. The result database contains the data processed by the text processor in the data processing module. The data in the database are similar to Sustainability 2021, 13, 362 9 of 21 the target data processed by the search processing module. The mapping process module will pick the data from the similar data one by one as keywords to query the result database. If it finds the data that contain the segmentation that 100% matches the keyword, it takes the correct code from the data as the output. Figure 7 shows the system overview of the ACS.

Implementation
This research used the PCCES data as the training data to implement the developed system. The following subsections will describe the training data and the implementation of the ACS's three modules.

Training Data
For data training, in this study two types of data were collected for training the word embedding model: the PCCES manual and manually corrected actual project data. The latter was used, as these data were more in line with real work scenarios.

PCCES System Manuals
In the PCCES manuals, there are many of the specification codes and instructions. Anyone who wants to use the PCCES to estimate the project budget needs to know how to use the PCCES manuals. People need to be trained to know which objects they want to evaluate and where to place the objects, such as human resources, machines, materials, methods, and the environment. Then, they need to select the correct manual chapter. The specification code and description contain different layers, and thus a code and specification are needed for picking objects layer by layer. Finally, after combining these picked codes and specifications, the data processing is completed with 100% accuracy.
One-hundred percent correct information was obtained from the manual. There are 18 chapters in PCCES manuals, which cover the tender documents and contract items such as general requirements and fieldwork. This study only selected concrete chapters for implementations.
In a chapter there are multiple sections, and each section has a code designed for the name, specification, and unit of the material. Permutations were used to generate all possible data. The grid shown in Table 1 was used to choose one code from each column and combine them as a specification code with a description. For example, the code "03330/4/2/0/0/2" means "Building Concrete/Ready-Mixed Underwater Concrete/140 kgf/cm 2 /M 2 ." The exhaustive method in the table was used to list all specification codes, with these specification codes being one of the training data for the word embedding model.  (6) Ready-mixed, under 20F (6) 350 kgf/cm 2 (6) --Each (7) Ready-mixed, under 30F (7) 400 kgf/cm 2 (7) --Set (8) ----KG (9)

Blank Valuations
Real project documents were obtained from two companies, Knowledge Analysis Space Exploration, Inc., Taipei, Taiwan and United Geotech, Inc., Taipei, Taiwan, which were undertaking public works, and these documents were one of the word embedding model training data sets. Each construction project contained a large number of documents and data files. The desired files in these documents and files were the price valuation files. The price valuation file contained names, terms, and aliases of objects or materials in the construction files. In actuality, the price valuation file may contain vague information that people can understand while the PCCES cannot. The system is unable to process the information and needs humans to correct it for the system to understand. Those corrected files were the desired files for collection as they contained two types of information: vague information and manually structured data. These two types of information can point to each other. Table 2 lists the example of the valuation we collected.

System Implementation
The ACS can be divided into four parts: the text processor, the database, the model training, and the searching and mapping. In this section, the implementation of these four parts is introduced.

Text Processing
The text processor used in the ACS was developed and used in our previous work [35]. In the previous work, a proper term dictionary was already constructed to improve the quality of data segmentation. This dictionary was used to develop a text processor, which could be used to handle any data imported into the system. Since the PCCES is written in Chinese, we implemented specific Chinese text processors, especially for Taiwan's construction field.
We constructed a dictionary for terms in Taiwan's construction field to solve these issues. This dictionary was used in the text processor to normalize the data. The text processor unified inconsistent units in Chinese, removed and replaced Chinese symbols and stop words, and replaced unit symbols with the real letter or number. The results are shown in Table 3. For example, if the input text was "Building Concrete and Ready-Mixed Underwater Concrete/140 kgf/cm 2 /m 2 ," the output would be "Building Concrete/Ready-Mixed Underwater Concrete/140 kgf/cm 2 /M 2 ." Table 3. The characteristics that this study processed in the text processor.

System Implementation
The ACS can be divided into four parts: the text processor, the database, the model training, and the searching and mapping. In this section, the implementation of these four parts is introduced.

Text Processing
The text processor used in the ACS was developed and used in our previous work [35]. In the previous work, a proper term dictionary was already constructed to improve the quality of data segmentation. This dictionary was used to develop a text processor, which could be used to handle any data imported into the system. Since the PCCES is written in Chinese, we implemented specific Chinese text processors, especially for Taiwan's construction field.
We constructed a dictionary for terms in Taiwan's construction field to solve these issues. This dictionary was used in the text processor to normalize the data. The text processor unified inconsistent units in Chinese, removed and replaced Chinese symbols and stop words, and replaced unit symbols with the real letter or number. The results are shown in Table 3. For example, if the input text was "Building Concrete and Ready-Mixed Underwater Concrete/140 kgf/cm 2 /m 2 ," the output would be "Building Concrete/Ready-Mixed Underwater Concrete/140 kgf/cm 2 /M 2 ." Table 3. The characteristics that this study processed in the text processor.

Database
Specific information was extracted from the raw data and stored in a database for use in training the word embedding model. For the training data, there were already the PCCES manuals and blank valuations. However, these data were raw data that could not be used to train the word embedding model directly, and thus there was a need to first preprocess these data. One-hundred percent correct data were already generated from PCCES manuals. Furthermore, there were the original and corrected blank valuations. The 100% accurate data had two data types: specification code and specification description. For the fixed blank valuations, there were four data types: correct specification code, " (period) Remove Unify units "3000psi" Replace with "210 kgf/cm 2 "

Database
Specific information was extracted from the raw data and stored in a database for use in training the word embedding model. For the training data, there were already the PCCES manuals and blank valuations. However, these data were raw data that could not be used to train the word embedding model directly, and thus there was a need to first preprocess these data. One-hundred percent correct data were already generated from PCCES manuals. Furthermore, there were the original and corrected blank valuations. The 100% accurate data had two data types: specification code and specification description. For the fixed blank valuations, there were four data types: correct specification code, accurate specification description, invalid specification code, and invalid specification description. These four different data types were combined as four data table fields. The correct specification description and incorrect specification description were used to extend the other data fields, termed as "correct description segmentation" and "original description segmentation." This data field was used to store the result of the text preprocessing module after processing the specification text.

Model Training
In this study, data segmentation and the CBOW were used to train the model. The segmentation was stored in the previously constructed database. The language model used was CBOW, which was developed by Tomas Mikolov [47]. Because the scope of application of this research was for a small field, the use of CBOW was enough for the goal of solving this research, and compared with other models such as BERT [52] and GPT-2 [53], the cost of CBOW was low, due to reduced hardware requirements, reduced data volume, and more constant training time, so this study chose CBOW. The CBOW language model has been implemented in the gensim package provided in Python. This study used the gensim package to train the CBOW model.
In the database, the correct description segmentation and original description segmentation were already present. The segmentation was combined as a huge list that contained 18,513 data rows. This list was the input of the CBOW, and the output was the word embedding model. The listing method is shown in Table 6. Table 6. The list of segmentation that this study used to train the word embedding model.

Searching
The goal of the search function was to find the 10 most similar data and return the data to the user. In the search function, after the user entered a term, sentence, or messy text, the system used a text processor to process the text entered by the user. After processing the input text using a text processor, the text processor generated a word segmentation, with the quality of the word segmentation equal to that of the training data. With these word segmentations, the training data with the correct data were extracted from the database and segmented. Then, in the word embedding model, one by one, the user input text segmentation and training data segmentation performed calculations to find the 10 most similar data.

Mapping
The comparison function found the correct coding and narrative from the database according to the 10 most similar words. In the search function, the system completed the search and provided the 10 most similar words to the comparison function. The type of data was word segmentation, and it could not be used directly. At this time, the comparison function was needed to restore the word segmentation to the original text narrative, which was why the segmentation was stored in the database. SQL statements were used to compose database query commands, to query 10 segmentations of similar data separately, to find the correct code and correct description, and then to display it on the user interface for the user.

Validation
To validate that the ACS could structuralize and correct the existing data in the PCCES, this research conducted a system evaluation to evaluate the performance of the ACS. Additionally, a user test was also conducted to test if the recommendation function of the ACS could help related personnel to correct the data with higher efficiency. Two tests were designed to verify the usability of the ACS, one for the system evaluation and one for the user test. Whether the auto-correction function was complete and feasible was first tested, followed by testing of the recommendation function.

System Evaluation
A system evaluation test was conducted to validate the accuracy of the developed auto-correction function of the ACS. This study used the PCCES as an example and used real data from two companies in Taiwan as the data sources. The following subsections will describe the data source, classification, and the results of the test.

Data Source
The real data was obtained from two companies in Taiwan: Knowledge Analysis Space Exploration, Inc. and United Geotech, Inc. Furthermore, we generated 10,906 pieces of 100% correct data from the PCCES manuals. Totals of 5847 and 1382 raw data points were obtained from Knowledge Analysis Space Exploration, Inc. and United Geotech, Inc., respectively. The reason that these data were desirable was that they were from actual work projects; regardless of whether the code or the description of these data were correct, the project is still working fine, which means that the semantics in the data is accurate or close to the object.
For the 100% correct data, as per the codes and specifications listed in Table 1, six layers were included in the table. The PCCES manuals were used by picking one code and one description from each layer, depending on whether the description met the target material, and then combining these components. Finally, the data that perfectly fit the PCCES manuals were obtained. The user needed to insert this material in the project documentation if there was new material used in the construction project. For example, for a premixed concrete that had no additives and a strength of 4000 psi, then Table 1 could be used to generate the code and description for this material. The code of this material would be "0333024003," and the description would be "Building Concrete, Ready-Mixed Concrete, 280 kgf/cm 2 , M3."

Classification
The corrected data were obtained after the raw data were processed by the developed system. These corrected data were used to calculate the accuracy of the automatic construction function to validate the performance and accuracy of the ACS. The obtained data included the raw data and correct data to calculate the accuracy of the corrected data. Furthermore, the raw data and accurate data could correspond to each other. With the correct answers, we could know whether the system modified the data correctly or not.
Four rules were used to determine whether the automatically corrected data was correct to evaluate the accuracy of the automatic correction function: (1) if the raw data, manual correction data, and system correction data were all the same, it was correct; (2) if the manual calibration data and the machine calibration data were the same, it was accurate; (3) if the original data and the machine calibration data were the same, it was correct; and (4) after finishing the comparison with the first three rules, the information that was not included in the above three rules would be manually checked.

Results and Discussion for System Evaluation
In this section, the test result of the auto-correction function is presented. For the auto-correction function, the above four rules were used to judge whether the corrected data were correct. After completing the comparisons, 7392 and 1532 data points were obtained based on the first and second rules, respectively. Additionally, 4268 data points were obtained based on the third rule. Then, 1025 data points were confirmed manually. A total of 14,217 correct data points were obtained after the system processed the 18,551 input data points for a 76.64% accuracy. The results are shown in Table 7. In the automatic correction function, the system took out the most similar data from the 10 most similar data and used these data to correct the wrong data. But in the failure cases, we observed that the correct answers to some of the failure cases were actually included in the 10 most similar data, but they were not the number one answers. In addition, there were some failure cases because they did not exist in the training data, such as tremie pipe and sprayed concrete, etc., so these cases were not applicable to this system.

User Test
Besides the auto-correction function, the ACS also provided a recommendation function to help the user correct the data manually. In order to test the usability of the recommendation function, this study designed a test and invited eight actual practitioners and students from the Department of Civil and Construction Engineering to use the ACS. The subjects were asked to conduct nine tasks by using both the PCCES and ACS. The operation times of each task were recorded for further analysis.

Background of the Subjects
For the user test, we invited eight users who had many years of industry experience and could use the PCCES proficiently at work, as well as students who had no work experience and no experience in using the PCCES. These users tested whether people with the same background in a specific field were familiar with the PCCES differently and whether there were differences in test results. The subjects included five civil engineering students without any experience in using the PCCES, and three civil engineers with 1, 4, and 8 years' experience. The details of these users are shown in Table 8.

Testing Scenario
In actuality, users often use email and Microsoft Excel when they are working. The resulting electronic documents contain various kinds of information. Users need to retrieve useful information from the electronic documents and then manually extract this information from the manual, finding the correct code and specification description in the PCCES to meet the work specification.
In this test scenario, we simulated the user's process of using the PCCES to create data in accordance with the description of materials, manpower, and equipment in the Excel file. Ten raw materials were extracted from some actual work project data to simulate the real work situation (Table 9). The eight testers were asked to use the PCCES and ACS to find their correct codes and specifications. At the same time, the time when they found the code was recorded, giving a usage time, and these statistics were further used to evaluate the benefits of the ACS for its users. In this section, the results of asking users to check the recommendation function are presented. For the recommendation function, we mainly compared whether the ACS could help users work more efficiently than the PCCES, so the operating time was examined.
For the recommendation function, the times taken by users to process 10 raw data points using the two methods of the PCCES and ACS were recorded, and the processing times of the users according to each topic were averaged, as shown in Tables 10 and 11. Table 10 shows that after repeated operation of the PCCES, all the users minimized their operating times, and this minimum value could not be lowered. Additionally, as a result, as shown in Table 11, no matter how good a user was at operating the PCCES, users had really low operating times in the ACS. The average operating times of each question in the two systems are shown in Figure 8. Comparing these two systems, even if the users were allowed to repeat the operation 10 times in the PCCES, the average working time of the PCCES was much higher than that of the ACS. In this testing, we found that for a continuously operating system, as the familiarity increases, the user's operating time will decrease to a point at which, finally, there is a bottleneck that prevents further reductions. The decline rate of the operating time of the PCCES was significant; however, the rate of decrease for the ACS was less obvious, and the user's operating time was almost the same. It could be seen that the recommendation function of the ACS greatly eliminated this proficiency factor. Compared with the PCCES, using the recommendation function of the ACS helped users save 54% of the operation time.

Discussions
The main contribution of this research is the development of a system that can automatically correct public databases in the field of construction engineering in Taiwan for a database that is full of chaotic data. Two main functions are implemented in the ACS: the auto-correction function and the recommendation function. No manual intervention is required during the processing of the auto-correction function, which can reduce the user's workload. The recommendation function allows users to input different keywords or similar words. The function will process the input text data and automatically check to find the most similar and standard data to provide to the user. The language model was trained based on the collected 18,551 data points, used to calculate the similarity between the user input text and the correct data, and then used the accurate data in the automatic correction function or provided a list of user recommendations. The auto-correction function was used to summarize 18,551 data points at an accuracy of 76%. Ten data out of the 18,551 were taken for user operation. Compared with the PCCES, using the recommendation function can help users save 54% in operation time.

Contributions
In the existing public database in Taiwan, there is a large amount of messy data, so that the Taiwan government needs to stipulate the accuracy rate of the data by law. We attempted to use the auto-correction function of the ACS to solve the problem of this messy data.
(1) The ACS can automatically correct and structuralize the untrusted construction data. Many studies have been conducted to structuralize the semi-structured data. For instance, Woo et al. used the text clustering method to clean the large-scaled medical report data [54]. Soto et al. proposed the ViTA-SSD system that allows the user to explore the insightful patterns in the semi-structured data by providing a visualized analysis method [55]. However, these methods still require lots of human interactions on results checking, correcting, and exploring. Instead, the ACS used a machinelearning-based method that can automatically correct and structuralize the construction data. It allows the ACS to improve quality and output efficiency and does not require human intervention in the processing process, which can reduce the workload of users.
Furthermore, the automation of operations to achieve the maximum benefit is a current industrial target.
(2) The ACS provides a recommendation function that can improve the related personnel's working efficiency and accuracy. As not everyone can be proficient in understanding the PCCES coding specifications, the produced data are often non-standard. Even for users who are skilled in using the PCCES, due to the operation steps of the PCCES, the operation time will encounter a bottleneck and cannot continue to be reduced. The recommendation function of the ACS is an attempt to solve the above two problems. By providing a more user-friendly experience and removing the need for familiarity of coding standards, the user's operating time and data accuracy are improved.
(3) The ACS successfully reduces the threshold of operating the data management system. It can be seen from the test of the recommendation function that, regardless of whether users have used the PCCES or not, when operating the PCCES, as the number of operations increases, the proficiency will increase, and the operation time will not continue to reduce past a certain level. This is because the PCCES code is attached and the operation interface has a lower operating time, even if the user is already skilled. The recommendation function implemented by the ACS relies on the user entering divergent keywords or similar words, such as material names, aliases, or specifications, and other obscure data. The system then provides the results that meet the PCCES specifications for the user to select. The benefit of the recommendation function is that it removes the variable proficiency levels. Regardless of whether the user has a professional background or is familiar with coding standards, he or she can rely on keywords to find the correct result.
(4) The system we developed is not limited to the PCCES, it can be applied to different CCS systems. Since the system is based on the CBOW language model, the data we collect will determine the application direction of this system. In this study, we use the data related to the PCCES system to obtain the model after training and use this model to classify and correct the wrong PCCES data. In other words, if the training data are replaced by another CCS system, this system can also be utilized on other CCS systems.

Limitations
(1) In this study, the data collected limited the scope of the ACS in the PCCES. In machine learning, data are the foundation of everything, where only by having more data can the application scope and accuracy be expanded. In this study, we only implemented one section of the PCCES coding specification. Most of the collected data belong to this category, as it is one of the most commonly used codes in this section. Furthermore, the specification description contained in this section has a description of the strength of the concrete. If this section can be improved, the coding accuracy rate will be beneficial for the PCCES as a whole.
(2) The accuracy of the ACS is not sufficient. In this study, we used 18,551 PCCES data to train the CBOW model, applied it to a small range of the PCCES, and got a 76.64% correct rate. The application range of the ACS in the PCCES depends on the data we collected, and although the accuracy of 76.64% was much higher than the 37.47% announced by PCC, it is not good enough to meet expectations. The application range of the ACS in the PCCES is not that wide. The average accuracy rate will decrease when the application range of the ACS is expanded and extended to a range that is difficult to automatically correct if the higher accuracy rate is not achieved in the application range for now. For the system that uses machine learning technology, the amount of data collected will affect the trained model and then affect its accuracy. If we can obtain more information, we may be able to improve the accuracy rate.

Conclusions
This research proposed a data correction system, the ACS, for automatically correcting the public construction data. The system we developed provides data auto-correction and recommendation features to improve human working performance and reduce the threshold of operating the data management system. A text classification system, the ACS, was developed using language models based on natural language processing and machine learning to correct public databases in the field of construction in Taiwan; at the same time, this system is also proposed to help users produce correct data more efficiently. By using a machine-learning-based language model to analyze text semantics and provide higher accuracy and efficiency information, the ACS can improve the efficiency of actual users and the accuracy of data in construction projects. In the automatic calibration test, a 76% accuracy rate was obtained after correcting 18,551 data points. A user test was also conducted on the recommendation function. A question was provided containing 10 real data points as well as a questionnaire to perform a user test on eight participants to observe them solving the problem under the two systems. After the trial, for the average processing time for each data point, 51.95% of the time was saved. From the test results, it was found that users using the ACS were more efficient than when using the original system and could accurately produce materials that meet the specifications. The results show that the ACS can effectively save operation time of the CCS and thus reduce the threshold of operating the data management system by providing a recommendation function. The proposed method can not only be used in the PCCES but can also be deployed to different CCS systems.