A Method to Enable Automatic Extraction of Cost and Quantity Data from Hierarchical Construction Information Documents to Enable Rapid Digital Comparison and Analysis
Abstract
:1. Introduction
1.1. Related Work
1.2. The Novelty of the Current Method
- An end-to-end method: many approaches managed to successfully solve a part of the data mining process, but very few encompass the processes of data extraction, data wrangling, and data preprocessing to make assets from different projects directly comparable.
- Strong validation: the suggested method was assessed with a large number of assets coming from real historical projects, presenting reliable and robust results.
- Different approach: the suggested approach relied on the usage of already existing technologies from the fields of data mining and machine learning, assembled in an alternative way to target a different purpose, making a unique method encompassing the whole process with this combination.
2. The Method
2.1. Materials
- Anaconda navigation version 2.2 for creating an environment.
- The IDE (integrated development environment) Jupyter notebook version 6.4.5.
- Python language version 3.7.
- Different open-source libraries, from which we can highlight “pandas” for generating the data structures or “scikit learn” for providing the machine learning capabilities.
2.2. Input Data
- Costs for the first project were presented as a single PDF file with an elemental breakdown of the work for each of the 4 buildings included in the project, with a total of 2217 items grouped into 88 elemental bills.
- Costs for the second project were presented as 17 separate trade-based Excel work packages, with a total of 1553 items.
2.3. Understanding All Processes of the Method
- ID: It consisted of an integer number that increased sequentially, and it numerically identified the number of assets that were registered in the dataset.
- Bill attribute: It was a string-type attribute that identified the number of the bill, where the asset was located, and a short description of it, for example, “Bill 123 Mechanical and plumbing”.
- Bill description: Another string-type attribute which contained redundant information, including only a short description of the bill. It was later used for categorization purposes.
- Category: It was a categorical attribute containing a string that uniquely identified the higher level of the category for the SMM7 standard [17] that the asset belonged to.
- Subcategory: Another categorical attribute that identified the second layer of the category for the SMM7 standard, including a more specific categorization. For example, for the category “D groundwork”, we found the subcategory “D20: excavating and filling”.
- Description 1, 2 and 3: As additional information, each row contained three different descriptions, where the first description contained the most generic information and the last one was the most specific. The information that the descriptions contained varied a lot. To cite some examples, they could contain different units of measure, for example “maximum depth not exceeding 1.50 m”, or they could specify the type of work that was carried out, such as “Site preparation”.
- Quantity: An integer number that specified the number of items needed.
- Unit: An integer number which described the unit of measure, such as meters, item, or square meters. For example, if the quantity of an item was 100 and the unit of measure indicated square meters, the dataset indicated that 100 square meters of that specific asset was needed on a specific project.
- Rate: A Boolean number including the price that was charged for each unit of measure. For example, it might have stated that for each square meter of a constructed wall, the client would be charged 157.57 GDP.
- Total cost: It was the number obtained by multiplying the rate and the quantity. Following the previous examples, if the rate for each square meter of a wall was 157.57 and the quantity was 100, the total cost would be 15,757 GBP.
- Letter: The BoQs used as input files contained a letter that uniquely identified each asset located in the same categories and subcategories.
- Page number: As a helpful piece of information, the processed data structure included the page number where the original item was registered in the input file. In this way, the accountant surveyor could doublecheck the correctness of the attributes in a faster way.
- Trade-based category name: One of the projects also contained a trade-based classification of all their assets. Hence, this string attribute worked as a classification attribute, identifying the categories that it belonged to.
- Trade-based category number: Additionally, it specified the amount of the total cost that was located in that specific trade-based category. In cases where the asset only belonged to one category, this number was the same as the total cost attribute.
- Second trade-based category name: Since SMM7 is not a trade-based standard, there were a few cases where the same asset in SMM7 belonged to two categories with a trade-based approach. Hence, this attribute was blank in most of the cases, and it would specify the second category that the asset belonged to in case of conflict.
- Second trade-based category number: In those cases where the asset belonged to more than one trade-based category, this number indicated the cost that was located in the second category. For example, for a fictitious asset classified in the SMM7 class “Masonry” with a total cost of 10,000 GDP, on the trade-based standard, it would be located in 4000 GDP for “Substructure” and 6000 GDP for “external walls”.
2.4. The Limitations of the Study
- Representative data: The success of the method might depend on the diversity and representativeness of the historical project data used. If the subset of projects does not cover a wide range of project types, sizes, and complexities, the method’s accuracy and applicability to real-world scenarios could be limited.
- Expert knowledge dependency: The method relies on a combination of data science techniques and experts’ knowledge for data classification and standardization. Although some of the classification costs do not leave room for discussion, in some specific cases, this could lead to bias or inconsistencies if the experts’ knowledge is not fully comprehensive or if different experts have differing interpretations.
3. Results
4. Conclusions
Future Work
- Firstly, research should investigate the challenges and barriers that construction companies might face when adopting and implementing the proposed method. This could include factors such as initial setup, integration with existing workflows, and overcoming resistance to change.
- Secondly, although the method was tested with a large set of assets, its validation could also be extended to more historical projects of different kinds and to implementing different data structures. This would bring the method more flexibility, credibility, and applicability.
- Thirdly, a comparative study between the proposed automated method and traditional manual methods of cost data extraction and organization could be conducted. This would help demonstrate the efficiency gains and accuracy improvements offered by the new approach.
- Finally, the present method could be explored and aligned with different existing industry standards for cost estimation and classification, enhancing its compatibility and encouraging its adoption by other companies.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yan, H.; Yang, N.; Peng, Y.; Ren, Y. Data mining in the construction industry: Present status, opportunities, and future trends. Automation in Construction. 2020, 119, 103331. [Google Scholar] [CrossRef]
- Symonds, B.; Barnes, P.; Robinson, H. New Approaches and Rules of Measurement for Cost Estimating and Planning. In Design Economics for the Built Environment: Impact of Sustainability on Project Evaluation; John Wiley & Sons: Hoboken, NJ, USA, 2015; pp. 31–46. [Google Scholar] [CrossRef]
- Fisher, D.; Miertschin, S.; Pollock, D.R., Jr. Benchmarking in Construction Industry. J. Manag. Eng. 1995, 11, 50–57. [Google Scholar] [CrossRef]
- Zou, Y.; Kiviniemi, A.; Jones, S.W. Retrieving similar cases for construction project risk management using Natural Language Processing techniques. Autom. Constr. 2017, 80, 66–76. [Google Scholar] [CrossRef]
- Desai, V.S. Improved Decision Tree Methodology for the Attributes of Unknown or Uncertain Characteristics-Construction Project Prospective. Int. J. Appl. Manag. Technol. 2008, 6, 201. [Google Scholar]
- Zhong, Y. Research on Construction Engineering Project Management Optimization Based on C4.5 Improved Algorithm. IOP Conf. Serv. Mater. Sci. Eng. 2019, 688, 055036. [Google Scholar] [CrossRef]
- Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 2002, 34, 1–47. [Google Scholar] [CrossRef]
- Soibelman, L.; Wu, J.; Caldas, C.; Brilakis, I.; Lin, K.Y. Management and analysis of unstructured construction data types. Adv. Eng. Inform. 2008, 22, 15–27. [Google Scholar] [CrossRef]
- Moreno, V.; Génova, G.; Parra, E.; Fraga, A. Application of machine learning techniques to the flexible assessment and improvement of requirements quality. Softw. Qual. J. 2020, 28, 1645–1674. [Google Scholar] [CrossRef]
- Ahn, S.J.; Han, S.U.; Al-Hussein, M. Improvement of transportation cost estimation for prefabricated construction using geo-fence-based large-scale GPS data feature extraction and support vector regression. Adv. Eng. Inform. 2020, 43, 101012. [Google Scholar] [CrossRef]
- Akanbi, T.; Zhang, J. Design information extraction from construction specifications to support cost estimation. Autom. Constr. 2021, 131, 103835. [Google Scholar] [CrossRef]
- Norman, E.S.; Brotherton, S.A.; Fried, R.T. Work Breakdown Structures: The Foundation for Project Management Excellence; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
- Ilmi, A.A.; Supriadi LS, R.; Latief, Y.; Muslim, F. Development of dictionary and checklist based on Work Breakdown Structure (WBS) at seaport project construction for cost estimation planning. IOP Conf. Serv. Mater. Sci. Eng. 2020, 930, 012007. [Google Scholar] [CrossRef]
- Stoy, C.; Dreier, F.; Schalcher, H.-R. Construction duration of residential building projects in Germany. Eng. Constr. Archit. Manag. 2007, 14, 52–64. [Google Scholar] [CrossRef]
- Hong, H.; Tsangaratos, P.; Ilia, I.; Liu, J.; Zhu, A.-X.; Chen, W. Application of fuzzy weight of evidence and data mining techniques in construction of flood susceptibility map of Poyang County, China. Sci. Total Environ. 2017, 625, 575–588. [Google Scholar] [CrossRef] [PubMed]
- Murray, G.P. Rules and Techniques forMeasurement of Services. Meas. Build. Serv. 1997, 9–18. [Google Scholar] [CrossRef]
- Keily, P.; McNamara, P.H. SMM7 Explained and Illustrated; RICS Books: Coventry, UK, 2003. [Google Scholar]
Aim | Approach | Reference |
---|---|---|
To identify similar construction projects for risk management. | A combination of NLP (Natural Language Processing) and machine learning with a case-based reasoning approach. | [4] |
To enhance the classification of attributes in construction projects. | A combination of data analysis and machine learning to identify the main factors that drive these classifications and provide reliable predictions. | [5] |
The optimization of risks applied to construction projects. | A two-step method is suggested based on the generation of the optimization attributes and the implementation of the algorithm C4.5. | [6] |
Automatic text categorization of a project’s assets. | A system that harnesses the benefits of NLP and machine learning for making an automatic text categorization. | [7] |
To analyze the variability and the types of data structures used in construction projects. | A method that combines data extraction, data mining, and analysis to assess the variability of structures among different projects. | [8] |
To identify the non-flood areas in Poyang County, China. | To carry out different processes of data extraction and analysis that materialized in the identification of the flood risk areas. | [9] |
To review and assess the current state of data mining in construction projects. | A systematic review of the historical application of data mining in construction projects through the years. | [1] |
To decrease the transportation costs of prefabricated construction pieces. | The approach extracted and processed geospatial data to feed a support vector machine for regression. | [10] |
To automatize the process of data extraction to support cost estimation. | A method composed of three processes: the extraction of design information, matching the specified material from items in the database, retrieving the price information of those materials. | [11] |
To form a dictionary based on the WBS standard [12] to support cost estimation. | To carry out different surveys based on experts’ opinions to develop the dictionary. | [13] |
To assess the main factors of the duration of construction projects. | A data analysis was performed to assess the main factors that had an influence in determining the length of construction projects. | [14] |
ID | Bill Description | Category | Subcategory | Description Level 1 | Description Level 2 |
---|---|---|---|---|---|
0 | Groundworks and substruct. | C demolition/… | C90 alterations… | Various loc. on site | Existing perimeter fencing and disp… |
1 | Groundworks and substruct. | C demolition/… | C90 alterations… | Various loc. on site | Remove existing timber fencing int… |
2 | Groundworks and substruct. | D groundwork | D20 excavating… | Site preparation | Site preparation |
3 | Groundworks and substruct. | D groundwork | D20 excavating… | excavating | To reduce levels |
4 | Groundworks and substruct. | D groundwork | D20 excavating… | excavating | Basements and the like |
Row | Description 3 | Quantity | Unit | Rate | Total Cost | Letter | Page Num. |
---|---|---|---|---|---|---|---|
0 | Complete; provisional | 113 | m | 2258 | 255,154 | a | 1 |
1 | Complete; provisional | 154 | m | 2258 | 347,732 | b | 1 |
2 | Brushes, scrub, undergrowth, hedges, trees and … | 3328 | m2 | 237 | 765,036 | a | 1 |
3 | Maximum depth not exceeding 2.00 m | 1140 | m3 | 339 | 38,646 | b | 1 |
4 | Maximum depth not exceeding 1.00 m | 242 | m3 | 339 | 82,038 | c | 1 |
Row | Trade-Based Category Name | Trade-Based Category Code | Trade-Based Cat. Name 2 | Trade-Based Cat. Code 2 |
---|---|---|---|---|
0 | Site works | 255,154 | - | 0 |
1 | Site works | 347,732 | - | 0 |
2 | Substructure | 76,036 | - | 0 |
3 | Substructure | 38,646 | - | 0 |
4 | Substructure | 82,038 | - | 0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Adanza Dopazo, D.; Mahdjoubi, L.; Gething, B. A Method to Enable Automatic Extraction of Cost and Quantity Data from Hierarchical Construction Information Documents to Enable Rapid Digital Comparison and Analysis. Buildings 2023, 13, 2286. https://doi.org/10.3390/buildings13092286
Adanza Dopazo D, Mahdjoubi L, Gething B. A Method to Enable Automatic Extraction of Cost and Quantity Data from Hierarchical Construction Information Documents to Enable Rapid Digital Comparison and Analysis. Buildings. 2023; 13(9):2286. https://doi.org/10.3390/buildings13092286
Chicago/Turabian StyleAdanza Dopazo, Daniel, Lamine Mahdjoubi, and Bill Gething. 2023. "A Method to Enable Automatic Extraction of Cost and Quantity Data from Hierarchical Construction Information Documents to Enable Rapid Digital Comparison and Analysis" Buildings 13, no. 9: 2286. https://doi.org/10.3390/buildings13092286