Tackling the Data Sourcing Problem in Construction Procurement Using File-Scraping Algorithms †

: The Architecture, Engineering and Construction (AEC) sector has a lower adoption rate of machine learning (ML) tools than other industries with similar characteristics. A signiﬁcant contributing factor to this lower adoption rate is the limited availability of data, as ML techniques rely on large datasets to train algorithms effectively. However, the construction process generates substantial data that provide detailed characterisation of a project. In this regard, this paper presents a data-scraping algorithm to search construction procurement repositories systematically to develop an ML-ready dataset for training data for ML and natural language processing (NLP) algorithms focused on con-struction’s procurement phase. This tool automatically scrapes procurement repositories, developing a procurement ﬁle dataset comprisfﬁng bills of quantities (BoQs) and project speciﬁcations.


Introduction
In recent years, there has been increased interest in implementing ML and NLP tools in the AEC sector [1,2].Still, the adoption rate of these tools is low compared to other industries with similar characteristics [3].
A significant contributing factor to this lower adoption rate is the limited availability of data, as ML techniques rely on large datasets to train algorithms effectively [4,5].This difficulty in sourcing abundant data in the construction sector represents one of the main challenges for ML developers within the AEC industry [6,7].Nonetheless, this paradigm clashes with the inherent workings of the construction process since it generates a significant amount of data that offer comprehensive characterisation of a project [4,8].
In the specific case of Portuguese Construction Procurement, public construction projects are mandatorily submitted to online, open-source repositories [9,10].However, the consultation and extraction of procurement files is decentralised and not automated, making data agglomeration difficult and time-consuming [11].Previous studies have tackled this difficulty by scraping procurement data in these repositories to a tabular dataset to be used in ML applications [11,12].Thus, if the necessary diligence is ensured, the procurement phase represents a great opportunity for data aggregation.
In light of this, this paper presents a data-scraping algorithm capable of extracting data from construction procurement repositories.This tool automatically scrapes procurement repositories, developing a procurement file dataset comprising BoQs and project specifications.Future studies will use the gathered data to develop an ML-ready dataset for training data for ML, and NLP algorithms focused on construction's procurement phase.
The remainder of this document is organised into three sections: Section 2 presents the methods and codes developed to scrape open-source data to a semi-structured format; Section 3 describes the gathered data and briefly highlights the framework where scraped data will be used; and Section 4 presents our conclusions and final remarks.

Methods
Following previous work [11,12], a reengineered version of the PPPData algorithm was developed.As highlighted in Figure 1, this new algorithm focused on scraping procurement files from the open-source online repository Portal Base [9] using the Selenium [13] and Chrome Driver [14] Python libraries.
Eng. Proc.2023, 53, 34 2 of 6 for training data for ML, and NLP algorithms focused on construction's procurement phase.
The remainder of this document is organised into three sections: Section 2 presents the methods and codes developed to scrape open-source data to a semi-structured format; Section 3 describes the gathered data and briefly highlights the framework where scraped data will be used; and Section 4 presents our conclusions and final remarks.

Methods
Following previous work [11,12], a reengineered version of the PPPData algorithm was developed.As highlighted in Figure 1, this new algorithm focused on scraping procurement files from the open-source online repository Portal Base [9] using the Selenium [13] and Chrome Driver [14] Python libraries.Although bulk download is possible using this algorithm, a month-by-month method was used for scraping where a search query was inputted to the algorithm stating the month and year the user intended to scrape.This method allowed for easier database organisation in later phases of data processing.Next, the algorithm would open Google Chrome and load the page with the results of Portal Base for that specific month.The Portal Base platform organises its information in a table, where each line is a contract.Each contract has a detailed page from which procurement files can be downloaded.The algorithm looped through all the tables on each page and the lines in each table to open the detailed contract page and download the procurement files.
The procurement files could be located in 4 different online platforms: (1) Acingov [15]; (2) Saphetygov [16]; (3) Vortalgov [17]; (4) Anogov [18].For the first two platforms, a simple request to the platform API using the reference located in the procurement files' Although bulk download is possible using this algorithm, a month-by-month method was used for scraping where a search query was inputted to the algorithm stating the month and year the user intended to scrape.This method allowed for easier database organisation in later phases of data processing.Next, the algorithm would open Google Chrome and load the page with the results of Portal Base for that specific month.The Portal Base platform organises its information in a table, where each line is a contract.Each contract has a detailed page from which procurement files can be downloaded.The algorithm looped through all the tables on each page and the lines in each table to open the detailed contract page and download the procurement files.
The procurement files could be located in 4 different online platforms: (1) Acingov [15]; (2) Saphetygov [16]; (3) Vortalgov [17]; (4) Anogov [18].For the first two platforms, a simple request to the platform API using the reference located in the procurement files' download link was sufficient to obtain a compressed folder with the procurement files of that contract.
In the case of Vortal, a new web page was opened from Vortal's website.This page had all the information associated with the contract in question, including the procurement files, in a table.The algorithm had to loop through all the lines in the table and individually download each file, which were then associated into a single folder.
A similar process had to be carried out for the Anogov-based contracts.However, the algorithm had to open new rows in the table of files by clicking a hidden button, only visible if the mouse hovered over a symbol in the table.Each file was downloaded through a request to the Anogov API using the reference in the hidden row.Finally, all the files were associated into a folder.
At the time of writing, all available procurement files from April 2023 to January 2020 have been gathered in a raw dataset comprising 8612 folders from as many public venture construction contracts.
All code used in this paper's methodology can be accessed on GitHub using the following link: https://github.com/LuisJSousa/ScrapeProcurementFiles(accessed on 20 November 2023).

The Data
As previously mentioned, the algorithm successfully scraped over 8500 folders of files from as many contracts, each containing text-based documents in Microsoft Excel, Microsoft Word and PDF formats.These files represent various procurement documents, including BoQs, project specifications and other legally required files essential for procurement processes.
The existing dataset is in a raw format.Its structure follows a hierarchical order, with folders organised by year and month, further divided by the name or number of each contract.All the documents associated with each specific contract are stored within these final folders.
In future endeavours, multiple rounds of data treatment will be necessary to classify the different types of documents into standardised groups, making them suitable for machine learning applications.
The primary objective of these future efforts will be to create a substantial dataset of BoQs which will be instrumental in automating the generation of these documents for budget proposal purposes, as established in the framework shown in Figure 2, presented in [19].
The framework involves data aggregation using the web-scraping algorithm presented in this paper.Subsequently, a "master" BoQ is selected, preferably the one most frequently used by enterprises selected to participate in this study.In case this "master" BoQ is not available, an arbitrated BoQ will be chosen.
Next, different algorithm architectures will be developed, employing various architectures and different Python libraries focused on ML and NLP, using the scraped data to train algorithms to classify BoQ tasks.This training phase is followed by a testing phase where the accuracy of the algorithms will be evaluated, testing their ability to classify BoQ tasks effectively.Moreover, the efficiency of the algorithms will be compared with the manual classification typically performed by technicians.
BoQs used during budgeting will be uploaded to the database to enable continuous learning, thereby increasing the volume of historical data that contribute to the algorithm's classification capabilities.This iterative learning process will enhance the tool's performance and effectiveness over time.
In this sense, the scraping algorithm developed in this communication is a crucial step in achieving future goals, because establishing a well-organised and extensive dataset will significantly enhance the potential for accurate and efficient automation of these processes.

Conclusions
The implementation of ML and NLP applications in the AEC sector is still in its early stages compared to other industries with similar characteristics.A major obstacle to progress in this area is sourcing relevant and reliable data.However, the construction industry itself generates vast amounts of data during its operations, presenting a unique opportunity to tackle this issue and enabling the use of ML tools.
For the specific case of the Portuguese AEC sector, procurement files are mandatorily submitted to online repositories, presenting a significant opportunity for data agglomeration.
In this regard, this research proposes a solution to the data-sourcing problem through the use of data-scraping algorithms.By employing an automated approach, the presented algorithm can extract information from online open-source repositories containing procurement files suitable for ML applications in the AEC domain.
Notably, the algorithm was capable of scraping more than 8500 file folders from as many public procurement contracts.This led to the creation of a significantly large and diverse raw dataset comprising procurement documents such as BoQs and project specifications, laying the groundwork for future advancements in ML and NLP within the construction industry.Future studies will focus on processing and organising the gathered data to create a well-structured dataset.This critical step will pave the way for the development of complex ML applications aimed at automating the creation of BoQs for procurement purposes.Its final goal is to transition the budget-making process from a laborious classification task to a more efficient verification-based approach.By streamlining the BoQ generation process, this technology can accelerate budget proposal development in the construction sector, saving time and resources while improving accuracy and efficiency.

Supplementary Materials:
The presentation materials can be downloaded at: https://www.mdpi.com/xxx.

Conclusions
The implementation of ML and NLP applications in the AEC sector is still in its early stages compared to other industries with similar characteristics.A major obstacle to progress in this area is sourcing relevant and reliable data.However, the construction industry itself generates vast amounts of data during its operations, presenting a unique opportunity to tackle this issue and enabling the use of ML tools.
For the specific case of the Portuguese AEC sector, procurement files are mandatorily submitted to online repositories, presenting a significant opportunity for data agglomeration.
In this regard, this research proposes a solution to the data-sourcing problem through the use of data-scraping algorithms.By employing an automated approach, the presented algorithm can extract information from online open-source repositories containing procurement files suitable for ML applications in the AEC domain.
Notably, the algorithm was capable of scraping more than 8500 file folders from as many public procurement contracts.This led to the creation of a significantly large and diverse raw dataset comprising procurement documents such as BoQs and project specifications, laying the groundwork for future advancements in ML and NLP within the construction industry.Future studies will focus on processing and organising the gathered data to create a well-structured dataset.This critical step will pave the way for the development of complex ML applications aimed at automating the creation of BoQs for procurement purposes.Its final goal is to transition the budget-making process from a laborious classification task to a more efficient verification-based approach.By streamlining the BoQ generation process, this technology can accelerate budget proposal development in the construction sector, saving time and resources while improving accuracy and efficiency.