A Data-Driven Approach to Assess the Risk of Encountering Hazardous Materials in the Building Stock Based on Environmental Inventories

: The presence of hazardous materials hinders the circular economy in construction and demolition waste management. However, traditional environmental investigations are costly and time-consuming, and thus lead to limited adoption. To deal with these challenges, the study investigated the possibility of employing registered records as input data to achieve in situ hazardous building materials management at a large scale. Through characterizing the eligible building groups in question, the risk of unexpected cost and delay due to acute abatement could be mitigated. Merging the national building registers and the environmental inventory from renovated and demolished buildings in the City of Gothenburg, a training dataset was created for data validation and statistical operations. Four types of inventories were evaluated to identify the building groups with adequate data size and data quality. The observations’ representativeness was described by plotting the distribution of building features between the Gothenburg dataset and the training dataset. Evaluating the missing data and the positive detection rates afﬁrmed that reports and protocols could locate hazardous materials in the building stock. The asbestos and polychlorinated biphenyl (PCB)-containing materials with high positive detection rates were highlighted and discussed. Moreover, the potential inventory types and building groups for future machine learning prediction were delineated through the cross-validation matrix. The novel study contributes to the method development for assessing the risk of residual hazardous materials in buildings.


Introduction
Although a series of bans for the use of hazardous materials in construction have been imposed since the 1970s, an appreciable quantity of contaminated materials remains in the existing building stocks [1]. The frequent presence of asbestos-containing materials [2,3] and PCBs (polychlorinated biphenyls)-containing components [4] is the result of mass production and adoption in the 1920s-1990s [5]. In addition to the negative concerns of human health and the environment, demolished and renovated projects become more expensive and take a longer time if hazardous materials are encountered unexpectedly. The decontamination and abatement cost account for a noticeable amount for waste disposal and working precautions' preparation [6].
Advanced data mining and statistical learning have been accessible to emerging research fields in recent years. Information about the building stock has also been made more available, mainly through governmental open database initiatives in several different areas, i.e., investigation records, project economy, and so on [6]. These two developments dramatically improve the estimation capability for locating hazardous building materials in demolition and renovation projects. Furthermore, by coupling building stock data and hazardous product registers, predictive detection of in situ hazardous building materials is made possible [7].
The study explores and develops a data-driven approach to assess hazardous materials' detection in the building stock. The importance of effective hazardous materials management is recognized through updating legal requirements and extending criteria for a healthy living environment in building certification [8]. By increasing the quality control and locating the potential in situ hazardous materials, a step toward the circular economy for construction and demolition waste (C&DW) can be realized [9]. For instance, mandatory pre-demolition audits (also called environmental audits or waste audits) are enforced in Austria, Bulgaria, the Czech Republic, Finland, France, Hungary, Luxembourg, The Netherlands, Romania, Spain, and Sweden, whereas optional environmental investigations have been applied to a limited extent in Belgium, Denmark, Germany, Ireland, Italy, Slovakia, and the United Kingdom for 5-10 years [10]. Advantages for pre-demolition audits include improving hazardous construction and demolition waste identification, as well as promoting resource circularity and efficient use of mixed wastes [11]. According to the breakdown of construction and demolition wastes' generation, Sweden presents a higher percentage of hazardous wastes (13%) than the average among European Union (EU) countries (2.5%) owing to its sound separation systems as well as a long tradition of environmental legislation [9]. The use of asbestos and PCB in building materials has been prohibited since the middle of the 1970s in Sweden. Several other EU countries have also achieved advanced progress in establishing waste management systems and databases. For example, pre-demolition audits for the certain scale of the non-residential building are mandatory in Flanders [12]. With an increasing number of emerging databases and extensive documentation, the goal of tracing in situ hazardous building materials through employing data mining on registered records could be attained [13].
Built upon statistics, machine learning, and pattern recognition techniques, data mining enables automatic or semi-automatic exploration of large amounts of data to discover patterns or rules [14]. Owing to these benefits and building data availability, the potential of data-driven built environment management is shown. However, previous literature sheds light on significant challenges for practical implementation, including time-consuming pre-processing obtaining complete digital datasets [15]. Furthermore, limited information regarding the extent of the previous adoption is available, leading to a struggle in designing precautionary abatement policy and decontamination plans [6]. Several researchers attempted to detect asbestos-containing materials through developing new methodologies considering these barriers and knowledge gaps. Employing temporal descriptions of materials in an ontology-based approach; prediction of the presence of asbestos in buildings was explored by Mecharnia et al. [7]. Similarly, statistics were also employed in study inspection reports and online demolition databases to quantify the amounts and abatement costs for asbestos-containing materials in abandoned residential dwellings by Franzblau et al. [6]. Govorko et al. [2,16] developed a mobile application to investigate the types and the conditions of asbestos-containing materials in residential settings.
To realize the Construction 2020 strategy [17] and the Communication on Resource Efficiency Opportunities in the Building Sector [18], the protocols and guidelines for waste audits before demolition and renovation for buildings were established by the EU Commission [19]. The emergence of relevant tools and complementary legislation is expected to improve current practice in waste identification, source separation, and collection [11]. Although Sweden has introduced obligatory environmental audits since 1995, several practical predicaments exist for using the data, i.e., the hardcopies of environmental inventories are kept by several different authorities, a harmonized protocol at the national level is lacking, and no digital query-based database of in situ building materials is available yet [11]. This article attempts to address these challenges by developing a generalizable approach and extending the investigation scope to multiple hazardous materials in the building Sustainability 2021, 13, 7836 3 of 23 stock. The empirical study aims to quantify the risk of residual hazardous materials in the existing buildings and investigate the data quality and quantity of the environmental audits for advanced analysis. Through a case study in the City of Gothenburg, the potential of using environmental inventories to assess the extent of in situ hazardous materials has been explored. The first part of the research involves data assembling and validation, followed by cross-comparison and descriptive statistics of the training dataset. The study results can offer valuable insights into the frequent occurrence of hazardous materials in demolished and renovated buildings and specify building groups where occurrences are more likely. Furthermore, the pilot work lays a good foundation for the subsequent machine learning modeling to predict the presence of hazardous materials. To achieve the research objectives, three research questions are formulated as follows: RQ1: What is the potential for employing environmental inventories as input data to assess the presence of hazardous materials in the building stock? RQ2: How representative is the training dataset in relation to the Gothenburg building stock? RQ3: How can the risk of encountering hazardous materials in the building stock be assessed?

Study Design
Given that no digital pre-demolition audit dataset exists in Sweden, nor can building material records be found in the national building registers, the study proposed an innovative data coupling method by adding environmental inventories from the field study to the building information database from authorities. A similar data coupling approach has been performed by Wilk et al. [20] and Krówczyńska et al. [21] to study the spatial distribution of asbestos-cement roofing. To assess the potential of using environmental inventories for hazardous materials identification, a training dataset consisting of pre-demolition audits from demolished and renovated buildings in the City of Gothenburg constructed earlier than 1982 and national building registers at a regional scale was created. The developed training dataset compiling the registered records of the environmental investigation from the past decade can be regarded as a pioneered study of sustainable building material management. Data validation for the acquired documents becomes a fundamental step for the future machine learning study that leverages the existing data labels for predicting the potential presence of the remaining hazardous materials.
The study design illustrated in Figure 1 followed the procedure of training dataset creation, processing, and analysis. Firstly, registered-based data were collected from various databases for quality and quantity control. Then, the following data processing included data reformatting, merging, and cleaning. Finally, data analysis was performed in four parts: validating data quality and quantity, evaluating data representativeness with the Gothenburg building stock, assessing missing data and the detection records of each building subgroup, and cross-validating investigation data for risk assessment.

Data Collection
Several data collection relating tasks were performed sequentially to ensure data operationality. First and foremost, environmental inventories and national building registers were collected from different authorities. Pre-demolition audits were gathered from the Archive of the City of Gothenburg during the permit application period from 2010 to 2020. Currently, no query-based database of environmental records exists, which allows free search in the text masses. Therefore, the search process was done manually in the building permit register system using the keywords "demolition", "renovation", "reconstruction", "modification", and "alteration" in the document titles. Extensive document screening was executed to identify the projects with environmental audits in their permit decisions. Thereafter, investigation documents were requested and reformatted into a digital dataset.
Meanwhile, national building registers were received from the Swedish Cadastral and Land Registration Authority, where real estate registers from municipalities and the Swedish Tax Agency are kept. Besides, the Energy Performance Certificates were collected from the Swedish National Board of Housing, Building, and Planning. Merging these national datasets was executed with GIS Feature Manipulation Engine from Safe Software to constitute the comprehensive dataset for the research purpose. This comprehensive dataset comprised registered buildings from the three major economic regions in Sweden: Stockholm, Gothenburg, and Malmö regions. The methods for merging national datasets were developed by Johansson et al. [22] and can add additional auxiliary data for analyses. Then, the data extraction was conducted to retrieve the Gothenburg city dataset for the representativeness study concerning building characteristics. Many-to-many relationships were not included in the larger Gothenburg dataset as register data were at the property level, giving a data loss of approximately 10%. The metadata of the available Gothenburg dataset are appended in Appendix A [23]. Built upon the general building information from the national building registers and the detection records from the environmental inventories, the training dataset, a subset of the Gothenburg building stock, was Figure 1. A proposed procedure for creating the training dataset comprised of (1) data collection, (2) data processing, and (3) data analysis.

Data Collection
Several data collection relating tasks were performed sequentially to ensure data operationality. First and foremost, environmental inventories and national building registers were collected from different authorities. Pre-demolition audits were gathered from the Archive of the City of Gothenburg during the permit application period from 2010 to 2020. Currently, no query-based database of environmental records exists, which allows free search in the text masses. Therefore, the search process was done manually in the building permit register system using the keywords "demolition", "renovation", "reconstruction", "modification", and "alteration" in the document titles. Extensive document screening was executed to identify the projects with environmental audits in their permit decisions. Thereafter, investigation documents were requested and reformatted into a digital dataset.
Meanwhile, national building registers were received from the Swedish Cadastral and Land Registration Authority, where real estate registers from municipalities and the Swedish Tax Agency are kept. Besides, the Energy Performance Certificates were collected from the Swedish National Board of Housing, Building, and Planning. Merging these national datasets was executed with GIS Feature Manipulation Engine from Safe Software to constitute the comprehensive dataset for the research purpose. This comprehensive dataset comprised registered buildings from the three major economic regions in Sweden: Stockholm, Gothenburg, and Malmö regions. The methods for merging national datasets were developed by Johansson et al. [22] and can add additional auxiliary data for analyses. Then, the data extraction was conducted to retrieve the Gothenburg city dataset for the representativeness study concerning building characteristics. Many-to-many relationships were not included in the larger Gothenburg dataset as register data were at the property level, giving a data loss of approximately 10%. The metadata of the available Gothenburg dataset are appended in Appendix A [23]. Built upon the general building information from the national building registers and the detection records from the environmental inventories, the training dataset, a subset of the Gothenburg building stock, was created for further data processing and analysis. Many-to-many relationships were examined manually for the observed buildings in the training dataset.

Data Processing
To ensure coherent documentation of environmental inventories and improve the data readability for coding software, a standard procedure was developed and executed iteratively in creating the training dataset. The process consisted of (1) creating a dataset structure by assembling common variables across environmental inventories with the building as an observation unit; (2) checking data eligibility in terms of construction year and investigation completeness; (3) leveling data quality by clustering inventory types and investigator's experiences, then converting the data to pre-defined data types; (4) extracting relevant building registers from the comprehensive datasets using national real estate index and harmonizing updates across datasets; (5) merging and reformatting building registers and environmental inventory to become a training dataset; and (6) revising and manipulating variables of interest through aggregating multiple records to verify data consistency for construction year, renovation year, area, and so on. The final variables of interest and their metadata in the training dataset are presented in Table 1. Necessary data processing was executed to assure uniform data input from heterogeneous data sources. As the data update varied among different authorities, the registered data for variables of interest were compared with the inventory records to determine the actual investigated part. These registered data were used as proxies for filling missing data from the inventories. To assure data alignment for analysis, revised variables were created by prioritizing the inventory data and complementing them with the registered data. If none of the registers contained the information, the variables were labeled as NA. Moreover, irrelevant observations, such as buildings constructed after the ban of asbestos and PCB in building materials in 1982, as well as the updated registers for reconstructed buildings, were removed. The clean dataset summed up to 402 observations. The detection results were collected from various environmental investigations and for all types of building usages. Four inventory types were identified based on the document title and the content format: report, protocol, control plan, and demolition plan. Reports contain the most thorough investigation records with test sample results. In comparison, protocols were developed by the municipality with a list of binary questions for the investigated hazardous materials and their amount. Control plans were used for small houses and simple buildings, and hazardous substances were generalized without specifying specific materials. Demolition plans are required documents for demolition permit application, and free text is used to describe the detection of hazardous substances or materials. Considering the various extent of environmental investigations, primary hazardous substances such as asbestos, PCB (polychlorinated biphenyls), CFC (chlorofluorocarbon), and mercury were included in the training dataset. The detection results were documented at two levels in a binary way: hazardous substances and hazardous building materials. Besides, specific building parameters, including construction year, renovation year, detailed usage, area, and the number of the floors, were also noted as data labels. The data quality was controlled through a cross-validation workshop with the research group and a domain expert to affirm the correct interpretation of the inventory documents.
Furthermore, building classes were created by reference to the description of the renovation or demolition permit, primary usage of the building stated in the inventory, as well as building types and building categories from the national building registers. According to the actual investigation area and the past activities, the 402 observations were categorized into ten building classes: single-family house, multifamily house, temporary dwelling, school, office, commercial building, production building, industrial building, warehouse, and other/infrastructure. Determining the building class can facilitate clustering the buildings with similar scale and construction tectonic. The categorization of the inventory types and the building classes is fundamental to structure the data subgroups for comparative analysis.

Data Analysis
Through conducting statistical operations on the 402 observations in the training dataset, data representativeness and risk assessment of encountering hazardous building materials were addressed. Python's built-in library and interactive modules such as Pandas, Matplotlib, and Seaborn were employed for the explorative data analysis. The training dataset's representativeness was evaluated by comparing the building parameters' mean values with the same parameters in the Gothenburg dataset 1929-1982. Furthermore, the underlying correlations between the positive detection rates and different clustering subgroups were ascertained, i.e., inventory type, building class, construction year, and area. The descriptive statistical results provided an overall picture to assess the positive detection rates of residual hazardous materials in the building stock. To finalize the statistical results and set the scene for the future machine learning study, a cross-validation matrix evaluating the data quality and quantity was created.
Based on the cross-validation matrix, the data subgroup for each building class and investigated materials were assigned an assessment score. The assessment scores were created following (1) below. First of all, the investigation results for each building class were transformed into dummy variables, and four inventory types were given different weights based on the level of detail. The weights from high to low in decile points were assigned to the report, the protocol, the control plan, and the demolition plan, respectively. For each hazardous material in a given building class, the number of observations for various inventory types was multiplied by individual inventory weight, then the results were summed and divided by the number of observations. After evaluating the observation quality on an individual basis, a data number threshold was introduced to assess missing values for each subgroup. For example, if the data size was more than 30 observations, denoted as 1, between 15 and 30 observations were marked as 0.5, otherwise they were 0. Taking data size into account allowed assessing whether the observation number was large enough for generating useful statistical results. Cross-validating the individual observation quality and missing data in each subgroup, we can distinguish whether the detection results were reliable through adopting a data boundary. In the end, the findings were summarized and indicated the data subgroups that were found to be promising for machine learning pre-processing.

Results and Discussions
The results and discussions are structured in five parts: evaluating data quality and size, data representativeness, statistical operations, cross-validation matrix, and method replicability. Examining data quality and data size facilitated identifying subgroups appropriate for data analysis. By comparing the Gothenburg dataset and the training dataset, the distribution of building parameters was displayed to show data representativeness. Subsequently, the positive detection rates were highlighted with missing values for hazardous substances and materials through clustering with different parameters, such as inventory type, building class, construction year, and area. To summarize the previous analysis results and minimize the possible errors involving heterogeneous data, a cross-validation matrix was created as an indicator based on data quality and quantity for investigated materials in each building class. Finally, a short discussion regarding the method replicability to other contexts and the relation to previous research were discussed at the end of the section.

Evaluating Data Quality and Size
In Figure 2, the inventory types were ranked in descending order according to investigations' comprehensiveness and documentation details. The high-detailed levels of environmental inventories specified the presence of hazardous substances and containing materials in semi-uniform formats, such as report and protocol, constituted 48.5% and 21.6% of the investigations, respectively. They described whether missing data is the result of investigations that have not been done or because investigation materials are not in place. Field sampling of hazardous materials was usually executed by hazardous waste experts, lowering the risk of mislabeling caused by visual distinguishment. Conversely, hazardous materials' information is occasionally missing in simple investigations such as control plans or demolition plans. These information sources are only useful when estimating contamination at the building level. Data granularity became visible by clustering inventory types and building classes. High data granular building groups came from reports mainly by hazardous waste experts and contractors for large-scale and complex buildings. These included schools (13.9%), commercial buildings (8.7%), industrial buildings (6.5%), multifamily houses (6.0%), and offices (5.5%). Owing to the risk of polluted operations, industrial and production buildings, in most cases, require thorough environmental investigations by legislation. On the contrary, pre-demolition audits for single-family houses and temporary buildings mainly consisted of protocols, control plans, and demolition plans performed by contractors or private persons. inventory types and building classes. High data granular building groups came from reports mainly by hazardous waste experts and contractors for large-scale and complex buildings. These included schools (13.9%), commercial buildings (8.7%), industrial buildings (6.5%), multifamily houses (6.0%), and offices (5.5%). Owing to the risk of polluted operations, industrial and production buildings, in most cases, require thorough environmental investigations by legislation. On the contrary, pre-demolition audits for singlefamily houses and temporary buildings mainly consisted of protocols, control plans, and demolition plans performed by contractors or private persons. An overview of the experience level of the investigators across the environmental inventories helped evaluate the data source quality, as shown in Figure 3. Around 56.0% of the observations in the training dataset were performed by hazardous waste experts who are skilled in doing complicated environmental investigations. These environment consultants were primarily involved in drafting environmental reports and field sampling for the buildings obliged to pre-demolition audits. Another one-third of the observations came from contractors, such as demolition companies or waste handling companies. They are responsible for the demolition work and permit application, and thus skillful in making protocols, demolition plans, and control plans. The rest, 13.9% of the inventories, were done by private persons. They could be the building owners who only do the ocular inspections for a part of buildings. Considering the proportion of the observations carried out by hazardous waste experts or contractors, the reliability of the data is high. An overview of the experience level of the investigators across the environmental inventories helped evaluate the data source quality, as shown in Figure 3. Around 56.0% of the observations in the training dataset were performed by hazardous waste experts who are skilled in doing complicated environmental investigations. These environment consultants were primarily involved in drafting environmental reports and field sampling for the buildings obliged to pre-demolition audits. Another one-third of the observations came from contractors, such as demolition companies or waste handling companies. They are responsible for the demolition work and permit application, and thus skillful in making protocols, demolition plans, and control plans. The rest, 13.9% of the inventories, were done by private persons. They could be the building owners who only do the ocular inspections for a part of buildings. Considering the proportion of the observations carried out by hazardous waste experts or contractors, the reliability of the data is high.

Data Representativeness
To test the feasibility of building machine learning prediction models from the training dataset for future studies, data representativeness in terms of similarity of building parameters needs to be controlled for. Representativeness of the training dataset was addressed by comparing the distribution and the building parameters with the entire build-

Data Representativeness
To test the feasibility of building machine learning prediction models from the training dataset for future studies, data representativeness in terms of similarity of building parameters needs to be controlled for. Representativeness of the training dataset was addressed by comparing the distribution and the building parameters with the entire building stock in the City of Gothenburg. Evaluating the variances between datasets enabled us to identify interest groups based on indicative variables, including construction year, renovation year, area range, and the number of floors. Gothenburg building stock data were retrieved from the national building registers that contained 157,301 buildings. 100,635 buildings were older than 1982, when the construction industry's use of asbestos products was banned. Building class of the observations in the comprehensive dataset was classified according to municipality data using 1-99 indexing. Aggregation of building built before 1982 showed that most of Gothenburg's old buildings were residential buildings. The rest were unspecified buildings, school buildings, industrial buildings, production buildings, and commercial buildings. The lack of registers for temporary buildings, offices, and warehouses may be categorized as unspecific buildings.
According to Figure 4, the training dataset (N = 336) represents around 2.2% of the Gothenburg building stock constructed from 1929 to 1982 (N = 14,996). The period was chosen for consistency as the earliest building registers traced back to 1929. The density plots were used to balance the unequal numbers of observations before comparing their distribution. Density normalization scales the bars for the individual dataset, thus the areas sum up to 1 [24]. Then, boxplots were created to illustrate the quartile of both datasets. They are used to display data variation in statistical sampling [25]. Figure 4A showed that more than half of the buildings were built between 1950 and 1970 in both datasets, corresponding to the periods of the two massive construction activities in Sweden, the People's Home [26] and the Million Homes Programme [27] eras. A majority of the renovation activities (≥70%) in both datasets took place during 1990-2005, based on Figure 4B. According to Figure 4C, living area measurements of Gothenburg buildings were between 101 and 1000 m 2 , whereas the area in the training dataset was either for buildings larger than 1500 m 2 or smaller than 100 m 2 . An interpretation is that buildings with environmental investigations are larger complicated buildings or smaller complementary buildings. Furthermore, the difference between the training dataset and the Gothenburg dataset in the number of floors indicates that low-level buildings were more commonly demolished or renovated for other use purposes, as presented in Figure 4D.
Building parameters for each building class in the Gothenburg dataset and the training dataset were compared to comprehend the building subgroups' underlying characteristics, as presented in Table 2. The distribution of the building class was calculated by dividing the number of observations in each subgroup by the total number of observations in the dataset. Buildings in the city center are often mixed residential and commercial buildings with commercial zones on the lower floors. If these two building classes in the training dataset are summed, the amount will be comparable to the Gothenburg dataset. Moreover, school buildings were more frequently renovated with the removal of hazardous materials, resulting in more environmental inventories than other building classes. One reason for oversampling could be that multiple environmental investigations were executed for the individual buildings in the school complexes, leading to an over representative data size. The differences in the mean area and the mean floor of the school buildings could also be understood from an aggregation level perspective, where Gothenburg registers took properties into account rather than buildings. Lastly, industrial buildings and production buildings accounted for a few numbers, and the buildings in the training dataset were older than the corresponding Gothenburg subgroups. A reasonable doubt could be the concern of contaminated activities. In most cases, the City Environment Administration requested environmental audits for the industrial buildings before demolition, resulting in more environmental inventories than other building classes.   Building parameters for each building class in the Gothenburg dataset and the training dataset were compared to comprehend the building subgroups' underlying characteristics, as presented in Table 2. The distribution of the building class was calculated by dividing the number of observations in each subgroup by the total number of observations in the dataset. Buildings in the city center are often mixed residential and commercial buildings with commercial zones on the lower floors. If these two building classes in the training dataset are summed, the amount will be comparable to the Gothenburg dataset. Moreover, school buildings were more frequently renovated with the removal of hazardous materials, resulting in more environmental inventories than other building classes. One reason for oversampling could be that multiple environmental investigations were executed for the individual buildings in the school complexes, leading to an over representative data size. The differences in the mean area and the mean floor of the school buildings could also be understood from an aggregation level perspective, where Gothenburg registers took properties into account rather than buildings. Lastly, industrial buildings and production buildings accounted for a few numbers, and the buildings in the training dataset were older than the corresponding Gothenburg subgroups. A reasonable doubt could be the concern of contaminated activities. In most cases, the City Environment Administration requested environmental audits for the industrial buildings before demolition, resulting in more environmental inventories than other building classes.

Statistical Operations
In Table 3, the positive detection rates and the amount of missing data in each inventory type were described. By reviewing the amount of missing data (Appendix B), a better understanding of the usefulness of detection results could be developed. A positive detection rate showed the detection of hazardous materials, which was calculated by dividing the number of positive results by the total number of observations, excluding the missing data. The results showed that different inventory types had their strengths in detecting hazardous substances and materials. Large numbers of missing data (≥90%) were presented in control plans and demolition plans as simple inventories lacked information about hazardous materials. They often only show the positive detection of one sought construction part; thus, these environmental inventories were mainly conducted to remove the specific asbestos-containing material. The inclusion of simple inventories into a more extensive dataset could lead to the risk of a skewed dataset. Detailed inventories, on the other hand, contained comprehensive detection records of hazardous materials. However, the detection records of certain building materials were not always available as they were not included in the protocol, such as asbestos-containing switchboards and joints, PCB-containing door closers, and cables. Asbestos showed high positive detection rates in reports (0.84), demolition plans (0.70), and protocols (0.51), especially in pipe insulation, door or window insulation, cement panel boards, and ventilation channels. PCB generally had lower positive detection rates compared with asbestos in building materials. The results from the reports and the protocols indicated a higher risk of encountering PCB-containing joints/sealants and capacitors in fluorescent lamps/burners than other potential PCB-containing materials. Furthermore, CFC and mercury occurred frequently in buildings with positive detection rates higher than 0.50 in all inventories. CFC-containing materials were primarily located in freezers/fridges. However, the positive detection rates for building insulations and cooling units were not in agreement between reports and protocols. Investigations of CFC-containing materials in reports showed a high positive detection rate at refrigerations, whereas building insulation has a higher positive detection rate in protocols. The positive detection rate of mercury was the highest across inventory types. Mercury-containing materials were primarily found in lighting tubes. Positive detection rates of mercurycontaining level monitors or sensors and thermometers were also high in reports, while investigations of protocols reported a high positive detection rate at relay or switches. The significant differences in the positive detection rates and the reported frequency across inventory types can be attributed to multiple reasons, including the building features of each subgroup's observations, purposes for environmental audits, and investigators' experience levels. Hence, further data clustering was performed to analyze positive detection rates and missing data in the subgroups of building classes (Table 4), construction year range (Appendix C), and area range (Appendix D). Based on the conclusion of Figure 2, building classes with high data quality and adequate data size were selected for further data analysis. These included multifamily houses, schools, offices, commercial buildings, and industrial buildings. Table 4 describes the data size and the positive detection rate of hazardous substances and materials for each building class. A threshold value of 30 valid observations was set to enhance the certainty of the results, underlined in Table 4. Asbestos was identified predominately in multifamily houses with a positive detection rate of 0.93. In multifamily houses, asbestos-containing ventilation channels, joints, and pipe insulation were encountered. For school buildings and commercial buildings, ventilation channels contained a higher risk of asbestos. In contrast, asbestos-containing pipe insulations and door or window insulation were common in offices and industrial buildings. PCB positive detection rates were generally lower than asbestos across building classes. Office buildings had the highest PCB positive detection rate (0.70), mainly from PCB-containing joints/sealants. On the contrary, industrial buildings had an outstanding PCB detection value in capacitors in lamps/burners. For the rest of the building classes, the positive detection rates of PCB-containing joints/sealants and capacitors in lamps/burners showed similar patterns.
Concerning the exposure to environmental hazardous substances, mercury had a more frequent presence than CFC. The high positive detection rate of mercury was primarily due to the massive adoption of mercury-containing lighting tubes. Mercury-containing thermometers were also commonly used in multifamily houses and office buildings. On the other hand, CFC was found frequently in commercial buildings. CFC-containing fridges or freezers were the primary reason for high positive detection rates in multifamily houses, school buildings, commercial buildings, and industrial buildings. CFC-containing cooling unit was the main attribute in office buildings. Overall, the patterns of detecting hazardous materials in each building class appear to be reasonable considering the usage of the building and their average construction year. Our results are to some extent in agreement with the experience-based expert knowledge regarding frequent occurrences of hazardous materials in certain building classes. However, it has been challenging to cross-compare the positive detection rate of a specific building material among other building classes given varied data sizes. To determine the generalization potential of the results in relation to the regional building stock, incorporating more valid data of the studied buildings classes into the subsequent analysis will be essential.

Cross-Validation Matrix
The cross-validation matrix was developed as a tool to evaluate the results' reliability. It helps set a boundary for hypothesis formulation by considering the data quality and quantity on hand. Table 5 presents the overall assessment scores for each building class calculated by the cross-validation matrix. The assessment score for the individual hazardous substance and material differed significantly among building class subgroups, indicating high variation of data quality and data size. The assessment score in 0 or NA implies that the investigation records had low reference values owing to their data source primarily from simple inventories or have an insufficient data size. Despite that the total data size was high, high-quality detection records from reports or protocols were few. The lack of detection records in hazardous material levels led to the high number of missing values, thus the overall assessment score ranked low. The data sizes of temporary dwellings, production buildings, warehouses, and others were inadequate or lacking investigation records for certain building materials. Hence, their assessment scores were the lowest and should be excluded in the subsequent machine learning modeling. Table 5. The overview of the assessment scores for each building class based on data quality and data size (numbers in bold are higher than 80 scores).   Based on the cross-validation matrix, a ranking coupling building class and hazardous materials in descending order, were presented in Table 6. The ranking of the cross-validation results can not only guide further data collection procedures, but also show the limitations of the current environmental audits. The school buildings reached the highest score in most hazardous substances and materials investigations, implying that their detection records were reliable. The fact may because their inventory data come mainly from reports with extensive investigation records. For commercial buildings and industrial buildings, the assessment scores on the hazardous substance level, asbestos-containing tile or clinker, and mercury-containing lighting tubes were high as well. Yet, the rest of the hazardous materials in these two groups had low scores. While PCB, PCB-containing joints, mercury, and asbestos-containing pipe insulation and tiles or clinkers had high reference values in multifamily houses, only the detection records of mercury and mercury-containing lighting tubes were high in office buildings. The fact that PCB was better surveyed in multifamily houses than other building classes could be explained by their conventional construction method of concrete elements and bricks with many sealants. Therefore, their detection results of the potential PCB-containing joints or sealants and asbestos-containing tiles or clinkers showed fewer missing values in the investigation records.

Method Replicability
Nowadays, obligatory environmental audits and open databases for building registers, the method's key data input, are already available in many EU countries. For example, the Waste Register database in Estonia, the Integrated Waste Management System, and the Asbestos Database in Poland [10]. Thus, these countries have developed an established waste management practice and are in the position to adopt the proposed approach to estimate the residual hazardous material stock. As for the countries with the limited implementation of environmental audits concerning building size and function, as described in the introduction, it will be beneficial to create a harmonized environmental protocol for auditing hazardous materials in an online database [11]. Having a uniform digital dataset template can reduce the risk of information asymmetry and save processing time for data compiling, making data queries for environmental information much more effective.
Building registered data have become an essential source to describe building stock. However, data uncertainties are required to be addressed to assure accurate analytic results [23]. Previous efforts validated the EPC databases using stepwise regression models [23] and deploying a data quality assurance method [28]. Compared with the broad applications of EPC data [28], the environmental audits data remain relatively unexplored. Moreover, a large number of uninvestigated building components and undetermined assessment results, as well as the varied extent of environmental investigation execution between building classes, fail to be exposed. The study systematically examines the quality and content of environmental inventories, and based on that, evaluates their usability to enrich the building databases by adding specific environmental information. Referencing the EU data validation levels [29], a standard procedure to transform the field data into an organized, reliable dataset was exhibited. By showing the limitations and the possibilities of the environmental audits data, we hope to encourage more research in the application domain of safe construction and waste management.
The metabolism of the residual hazardous materials in the building stock is slow and the risk of exposure exists in every stage of the building life cycle. Previous literature showed that exceeding PCB concentration in the indoor air from volatilization during the building operation phase [30] and the airborne asbestos emission from the emergency demolitions [3] place an indispensable requirement for a long-term preventive maintenance strategy. To facilitate the abatement policy of in situ hazardous materials, extensive studies were conducted at the urban inventory level and the specific building class level. A stocks and flows model for asbestos was developed to determine common types of asbestoscontaining materials [31]. Surprisingly, it was found that cement sheeting and waterpipe accounted for 90% of the asbestos consumption in Australia. Another study in the Australian residential buildings also showed high positive detection rates of asbestos backing board to the electrical meter boxes and asbestos eves [2]. Overall, asbestos-containing materials presented in 82.3% of the sampled houses. Similarly, asbestos-containing materials were found in around 95% of the abandoned residential homes in Detroit, especially in flooring, roofing, siding, and duct insulation [6]. These findings aligned with our results as the positive detection rate of asbestos in multifamily houses was 93%, and the high-risk materials were pipe insulation and cement panel board.
The investigations of building-related PCBs also showed a similar trend. An extensive survey in Switzerland verified that 48% of the buildings built between 1950 and 1980 contained PCBs [32]. In the same period, 46% of the school buildings in the USA were constructed [4]. Implementing the engineering controls to mitigate PCB diffusion from joint sealants and building caulk in American school buildings indicated a strong need for immediate actions [33,34]. In a citywide building sampling study, the positive detection rate of PCB-containing sealants was found at 14% in Toronto [30]. In particular, a high density of PCBs existed in commercial and electricity-intensive buildings [35]. Compared with brick or glass buildings, concrete buildings had a higher tendency to be contaminated by PCBs [30]. The results from our study agree with their findings. High detection rates of PCBcontaining joints or sealants were noticed in concrete-built office and industrial buildings. Overall, the frequent presence of asbestos and PCB-containing materials worldwide reveals a necessity to develop an effective identification approach for general buildings. The method proposed in the study presents a data-driven solution to evaluate the risk of encountering hazardous materials in a high detail.
This study confirmed the potential to assess hazardous materials in regional building stock by using multiple registered records. Combining the environmental information from numerous registered data sources made it possible to systematically estimate the risk of hazardous substance occurrence in the building stock [20]. High data completeness of the detailed inventories, such as protocols and reports, enables a thorough analysis of hazardous material types in various building classes. Although the final output is highly dependent on data availability and quality, it is a rather cost-effective approach to trace the existing hazardous materials. In addition, the method's generability can be replicated in other countries and help prioritize decontamination plans before demolition or extensive retrofit.

Conclusions
This paper studied the feasibility of tracing hazardous materials in the existing building stock based on multi-sourcing registered records. The opportunities and challenges of using environmental audits to guide the risk assessment of hazardous materials were explored. By associating national building registers and environmental inventories, a training dataset was created to verify the experienced-based expert knowledge. Around 65% of the training dataset's observations comprised reports and protocols with high data granularity. Asbestos was the most frequently investigated substance, following by mercury, PCB, and CFC. The extent of environmental investigations of each building class varies, depending on building complexity and ownership. Most of the observations in reports were schools, commercial buildings, industrial buildings, multifamily houses, and office buildings, whereas in protocols, control plans, and demolition plans, single-family houses were most common. By validating data size and quality, building groups appropriate for the statistical operations were determined. Furthermore, comparing the distribution of building parameters between the training dataset and the Gothenburg dataset helps to understand the data representativeness, which involves the viability of constructing prediction models from the training dataset. The risk of encountering hazardous materials was assessed by evaluating the positive detection rates and the amount of missing data. Through clustering data subgroups such as inventory types, building classes, construction year, and area range, different perspectives for evaluating positive detection rates and missing data were presented. The results indicated high positive detection rates of asbestos-containing materials in multifamily buildings and prevalent PCB-containing materials in industrial buildings. High number of missing values for hazardous materials in single-family houses, production buildings, and warehouses were highlighted to improve the current environmental investigation practices.
The explorative approach of delineating quality environmental data demonstrates a general workflow for studying in situ hazardous building materials' management. The novel method is cost-effective in identifying general occurrence patterns of hazardous building materials and can be used to complement traditional environmental investigations. The findings from the cross-validation matrix showed that the potential data subgroups for machine learning modelling were school buildings, commercial buildings, multifamily houses, and industrial buildings at hazardous substance and material levels. Future research is planned to include more observations from the abovementioned building classes to increase the prediction accuracy and conduct cross-verification when constructing machine learning models. The developed data-driven method and the structure of the training dataset proposed in the study are replicable in the countries accessible to the environmental-audit records and general building information.
Author Contributions: K.M. acquired the research fund and was responsible for project management. M.M. initiated the contact with the Gothenburg City Archives for arranging the data collection process, requested building registers from the authorities, helped structure the tables in the method and results and discussions sections, and revised the manuscript. T.J. assisted with merging building registers. P.-Y.W. screened the relevant documents, retrieved the information, compiled them into a dataset, drafted the manuscript and figures, and conducted the manuscript revision. All authors participated in cross-validation workshops, results discussion, and manuscript revision. C.S. conceived the idea of the cross-validation matrix in the results and discussions section. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The environmental inventories collected in the study are confidential and regulated by the Gothenburg city archive. The national building registers acquired from different authorities were requested for the specific research purpose. Therefore, they are not available online or in the MDPI Research database.

Acknowledgments:
The work is part of the research project Prediction of hazardous materials in buildings using AI and is supported by RISE Research Institutes of Sweden. Special thanks are sent to Theresa Salhammar and Anna-Sara Berg in the Gothenburg City Archives for the support of searching environmental inventories in the building permit documents.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The metadata of the Gothenburg dataset from Swedish EPC, Swedish Real Estate Taxation register, and Municipal cadastral register.