A Biplot-Based PCA Approach to Study the Relations between Indoor and Outdoor Air Pollutants Using Case Study Buildings

: The 24 h and 14-day relationship between indoor and outdoor PM 2.5 , PM 10 , NO 2 , relative humidity, and temperature were assessed for an elementary school (site 1), a laboratory (site 2), and a residential unit (site 3) in Gainesville city, Florida. The primary aim of this study was to introduce a biplot-based PCA approach to visualize and validate the correlation among indoor and outdoor air quality data. The Spearman coefﬁcients showed a stronger correlation among these target environmental measurements on site 1 and site 2, while it showed a weaker correlation on site 3. The biplot-based PCA regression performed higher dependency for site 1 and site 2 ( p < 0.001) when compared to the correlation values and showed a lower dependency for site 3. The results displayed a mismatch between the biplot-based PCA and correlation analysis for site 3. The method utilized in this paper can be implemented in studies and analyzes high volumes of multiple building environmental measurements along with optimized visualization.


Introduction
The 2020 Global Health Observatory (GHO) statistics show that indoor and outdoor air pollution is attributable to nearly seven million fatalities every year [1]. Most (90%) of the people around the world are exposed to both indoor and outdoor air pollutants [2]. The United States Environmental Protection Agency (U.S. EPA) estimated that concentrations of indoor air pollutants are on average two to five times worse than outdoor concentrations [3]. Building leakage, air infiltration, and inadequate ventilation can lead to unhealthy levels of indoor air quality (IAQ) [4]. IAQ deterioration has been represented as the largest risk factor to occupants in DALYs (Disability-Adjusted Life Years) due to Sick Building Syndrome (SBS) and Building-Related Illness (BRI) [5][6][7]. The major air pollution includes Particulate Matter (PM), Radon (Rn), Nitrogen dioxide (NO 2 ), Lead (Pb), Sulfur dioxide (SO 2 ), Carbon monoxide (CO), Ozone (O 3 ), Formaldehyde, and biological pollutants. Children, the elderly, and people with asthma are at higher risk of BRI from fine particulate matter (PM 2.5 , PM 10 ) and gaseous pollutants such as NO 2 , O 3 , CO, and SO 2 [8,9]. Few studies have examined the associations between concentrations of air pollutants and COVID-19 disease effects [10][11][12][13][14]. Wu et al. observed that the mortality rate of COVID-19 raises by 8% for every 1 µg/m 3 of particulate matter increase, which presents statistical evidence that an increase of every 10 µg/m 3 in NO 2 or fine particle causes a 22.41% or 15.35% rise in the number of COVID-19 cases [12]. PM 2.5 and NO 2 often generate from the combustions of gasoline, oil, diesel fuels, wood, and coal. PM 2.5 and PM 10 can also originate from certain indoor sources, such as pollens, dust, pesticides, mold, and human activities, including cooking, welding, smoking, kerosene heaters, and household cleaning [9,[15][16][17]. Further studies are warranted in that the existing HVAC systems are not capable of addressing all aspects of aerosol infection control, and the auxiliary filtration interventions with a proper operation are now required [8,[18][19][20]. Therefore, to prevent occupants' health risks from exposure to indoor air pollution, efficient monitoring and studying the relations between indoor and outdoor air quality are necessary. In recent years, semi-conductor air quality sensors and IAQ monitoring techniques have rapidly surged [8,[21][22][23]. Several field studies of relationships between indoor and outdoor air pollution have been conducted using different analysis methods [8,24,25]. Chamseddine et al. [26] have used the Pearson product-moment correlation coefficient method for monitoring indoor and outdoor concentrations of PM 2.5 , PM 10 , CO, CO 2 , and TVOC in hospitals. Gabriel et al. [27] have computed both Pearson and Spearman correlation coefficients between indoor and outdoor levels of ultrafine particles and TVOCs in public indoor swimming pools. Zhao et al. [28] have concluded that outdoor PM 10 and CO levels affect the IAQ of the residential house based on descriptive statistics with the Analysis of Variance (ANOVA) test. Kim et al. [29] have applied the Multivariate Analysis of Variance (MANOVA) method for studying the associations between non-woven fabric filters' ability, indoor and outdoor air quality in commercial offices. Most prior studies have mainly focused on a linear relationship between indoor and outdoor air pollutants inside a single type of building, rather than considering the monotonic relationships of various types of buildings. Only a few studies have considered Principle Component Analysis (PCA) to reduce the multicollinearity between collected parameters. Madureira et al. [30] have monitored concentrations of ultrafine particles, CO 2 , VOCs, and CO in public school buildings using multilevel linear regression with PCA for examing the association between IAQ, outdoor air quality, cleaning activities building features. Kwon et al. [31] have used PCA, and partial least square (PLS) approaches to monitor seasonal variations of PM 2.5 , PM 10 , and CO 2 inside subway stations. It is recommended to apply PCA-based analysis methods on different types of buildings for proper regulation and, therefore, a better understanding of the relationships of indoor and outdoor air quality [8,32,33]. However, PCA-based results involving multi environmental measurements are often challenging to visualize, and previous studies have not provided alternate methodologies to fill this gap. Biplot is a type of statistics graph that can be applied to represent the relations between multidimensional parameters from PCA [34,35].
In the present study, the research lab, primary school, and academic office building were monitored to measure and visualize the longitudinal air quality conditions. Three key airborne pollutants (PM 2.5 , PM 10 , and NO 2 ) defined by the United States Environmental Protection Agency (US EPA), as well as temperature and humidity data, were simultaneously collected with a ten-minute sampling interval from both indoor and outdoor. This paper is organized as follows. The next section describes the sampling locations, the measurement methods, and data analysis techniques. Section 3 presents the results of the data measured from different buildings. The final section addresses the conclusion of this paper; highlights and possible future work are also provided.

Sampling Sites and Sampling Protocol
The three occupied sites (Table 1) chosen for this experiment were the media center of an elementary school building (Site 1), a lab house (Site 2), and a residential apartment unit (Site 3), which are all located in the city of Gainesville, Florida, United States. Gainesville is a mid-density city seat of central Florida, which stays in a humid subtropical climate throughout the year. The selected buildings for this study are mechanically conditioned all the year, and the windows are rarely opened to meet ANSI/ASHRAE Standard 52.2-2017 [8,36]. The buildings are located in the central region of the city to minimize the microclimate variation. Site-specific parameters are listed in Table 1. In total, the five air quality parameters monitored were PM 2.5 , PM 10 , NO 2 , relative humidity, and temperature. Three major monitoring protocols were followed to reduce measurement uncertainty, including the standardized EPA protocol for characterizing IAQ in large office buildings [37], the [38] for monitoring indoor air quality in schools (regional office for Europe), and requirements of the Schools Indoor Pollution and Health Observatory Network in Europe project (SINPHONIE) [39]. A two-week indoor and outdoor air quality measurement was carried out with ten-minute sampling intervals for 24 h continuously for all cases. For site 1, data were collected between 8 November 2019 and 22 November 2019 (before COVID-19). For site 2, data were collected between 4 August 2020 and 18 August 2020, while for site 3, air quality was monitored from 8 September 2020 to 22 September 2020. Each indoor monitor system was set up about 3.6 feet above the floor, 4.9 feet from any corners [37,40]. For comparison, outdoor air quality and RHT measurements were conducted simultaneously with indoor air measurement ( Figure 1). For site 1, a weatherproofed sensor was placed 4.9 feet above the surface of the roof. For site 2, the outdoor sensor was placed 4.9 feet above the deck of the laboratory. Finally, for site 3, the sensor was placed 4 feet above balcony of an apartment [38, 40,41]. No. of windows n/a 3 2 Indoor smoking Not allowed Not allowed Not allowed

Sensors
Two weatherproof Air Quality Egg (AQE, 2018) monitors manufactured by WickedDevice, LCC [42,43] were used to measure the concentration of indoor and outdoor pollutants as well as relative humidity (RH) and temperature (RHT) simultaneously for each site. This particular AQE was used because of its commercial availability, factory calibrated, and easy data accessibility and transmissions with lower purchase and operation cost [8,42]. In addition, according to the test reports from the US EPA and from the Air Quality Sensor Performance Evaluation Center (AQ-SPEC, SCAQMD), the field tests results of both laboratory showed that AQE sensors can provide reliable indoor air quality (IAQ) data with low intra-model variability and 100% data recovery [44][45][46][47]. Each AQE unit is assembled with a particulate matter module (Dual Plantower PMS5003), a CO module (3SP_CO_1000 Package 110-102), a NO 2 module (3SP_NO2_5F P Package 110-507), and a RHT sensor (DHT22). The specifications of each sensor module are shown in Table 2.

Descriptive Statistics and Correlation Analysis
The indoor and outdoor environment parameters measured by the air quality monitors are subjected to descriptive statistics and correlation analysis using Python (version 3.6.12) language and Jupyter Notebooks. A quantile-quantile (Q-Q) plot graph was applied in Python to standardize reference data and test the data distribution [48,49]. The Spearman correlation coefficients were calculated to analyze the monotonic relationship and the inter-dependency between each pollutant on another [50,51]. The Spearman rank-order correlation coefficient (ρ) can be expressed as an equation [50]: where d i represents the difference between the corresponding ranks, and n is the number of data points. The coefficient value (r) ranges between −1 (highest negative correlation) and 1 (highest positive correlation), while a p-value less than 0.05 was considered statistically significant.

Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that elucidates the multicollinearity phenomenon among variables with a smaller set of uncorrelated variables called principal components (PCs) [35,52,53]. PCA can be used to validate the correlation between the original variables by determining the most significant parameters [54,55]. Each principal component is an orthogonal projection of the original variables, with a minimum loss of traits. The eigenvectors and eigenvalues of a covariance matrix are the main elements required for PCA to capture the visual orientations of new data points and their magnitude [56]. The eigenvalues λ, of the covariance matrix, is computed by the following expression [57,58]: where det is the determinant of the matrix, I is the identity matrix, and C is the covariance matrix. Solving the above equation will result in k possible eigenvalues λ. The scores of PCs (eigenvectors) can be expressed as an equation [53,59,60]: where Z ir is the score for the ith data point on the rth principal component, α is the component loading, x is the variable, and k is the total number of variables. In this study, we focused specifically on identifying factors that affect the indoor PM 2.5 , PM 10 , and NO 2 concentrations during the measurement periods. PCA and linear regression were used to validate the correlation analysis results and determine the significant independent variables contributing to the degradation of target pollutants. A PCA and biplot-based data visualization were carried out using the Scikit-learn machine learning library and the yellow-brick visualizer in Python [61]. were observed in site 3. For both sites 1 and 2, the recorded mean of indoor PM 2.5 and PM 10 concentrations were lower than outdoor PM 2.5 and PM 10 concentrations. Conversely, mean PM 2.5 , PM 10 , and NO 2 concentrations were higher indoors than outdoors in the residential building (site 3). The indoor concentrations of NO 2 at each site ranged from 14.8 to 46.5 ppb, 38.7 to 86.3 ppb, and 30.9 to 69.3 ppb. In all cases, the mean outdoor NO 2 values were significantly higher than indoors.

Results and Discussion
Time series of indoor and outdoor PM 2.5 , PM 10 , and NO 2 concentrations measured during the sampling periods are plotted in Figure 2. The majority of indoor PM 2.5 and PM 10 concentrations for all the sites met the minimum requirements of the ASHRAE 62.1-2019 standard, which is 35 µg/m 3 (24 h mean) for PM 2.5 and 50 µg/m 3 (24 h mean) for PM 10 [8,62]. The crests with unhealthy levels of indoor PM 2.5 at site 3 may be attributed to regular household cooking and human behavior activities (lunch or dinner break) [63][64][65]. The overall trends between indoor and outdoor particulate matter (PM 2.5 and PM 10 ) concentrations were similar for site 1 and 2. It can be seen from Figure 2d,e that there is a time-delay affected peaks shift between indoor and outdoor particulate matter (PM 2.5 and PM 10 ) values in site 2 (office room). The potential reason for this trend might be due to the city traffic in rush hours, since site 2 has the shortest distance to the nearest busy road among all sites while having sedentary human behavior during working hours [66,67]. The time-series concentration for NO 2 shows a significantly stable pattern than the outdoor concentration values for all sites due to a lack of indoor emitting sources [17]. Except for site 2, which is close to a busy road, most of the indoor concentration values for NO 2 lie below the index of ASHRAE 62.1-2019 standard 53 ppb (1-year mean) and 100 ppb (1-h mean) [8,62].   Figure 3 represents quantile-quantile (Q-Q) plots applied to verify and visualize the distributional difference between indoor air quality data and the corresponding outdoor data by plotting their quantiles against each other [48]. The type of data distribution can be determined by characterizing the spatial pattern of the normalized data points. If two set distributions are mostly similar (normally distributed), then plots of the quantiles of distributions will fall close to the identity line [48,49]. Different type of distributions leads to various deviation ratios. From Figure 3a-c, the site 1 (indoor and outdoor) data are normally distributed with a small deviation from the identity line. In Figure 3d-f, which represents site 2, the quantile plots of the distribution of indoor and outdoor pollutants show a skewed distribution with low and high degrees of variation from the identity line toward higher concentrations. For site 3, the PM 2.5 and PM 10 data are clustered along the low-to-medium spectrum of the data range, while positive deviations (site 3) were found between the higher concentration range. Figure 3i shows that the distribution of NO 2 indoor and outdoor concentrations is highly concentrated along the identity line.

Correlation Analysis
The Spearman rank-based correlation was used to extract the nonparametric relationship to outcome the probabilistic association between target parameters by assigning a coefficient value bounded between −1 and 1 [50,51]. According to Figure 4, site 1 and site 2 reveal a stronger correlation between indoor and outdoor measurements compared to site 3. Significant positive correlations were found between indoor NO 2 and indoor relative humidity (R site1 = 0.85, R site2 = 0.99, R site3 = 0.77) at all sites. NO 2 has the propensity to react with water vapor appear in building structures, which may lead to an increase in NO 2 concentrations [68,69]. This positive trend can also be found at site 2 and site 3 among indoor NO 2 and outdoor NO 2 (R site2 = 0.87, R site3 = 0.7), while site 1 shows a negligible correlation between them. Both sites 1 and 2 have a high degree of positive correlation between indoor particulate matters (PM 2.5 and PM 10 ) and corresponding outdoor values (R site1_pm2.5 = 0.65, R site2 pm2.5 = 0.64, R site1_pm10 = 0.64, R site2_pm10 = 0.56). Many relevant studies reported similar positive correlation value between indoor and outdoor PM concentrations in public buildings. Site 3 shows a high negative correlation between indoor and outdoor particulate matters (PM 2.5 and PM 10 ). The similar negative correlation between indoor and outdoor PM within a mechanical ventilated living space was observed in serial studies [16,70,71]. This indicates that the indoor PM values are affected significantly by day-to-day household activities compared to educational and office spaces [72][73][74].

Biplot-PCA for Site 1, 2, 3
A PCA-based multivariate linear regression model was employed for parameters such as PM 2.5 , PM 10 , NO 2 , and temperature and humidity for both indoor and outdoor test conditions [32,33]. This was used to evaluate results obtained from the correlation analysis [52,56]. The measured values are utilized to formulate a multidimensional dataset, which is projected onto a biplot. A biplot is a scatter plot that depicts the relationship between observed data and dependent variables in terms of principal components [34,35]. In a PCA-based biplot, points are the projected observations, vectors are the projected variables. However, biplot cannot be used to estimate the exact coordinates because the vectors have been centered and scaled. The multivariate dataset was dimensionally redacted down into 3D and 2D plots where the above-mentioned parameters were plotted with respect to indoor PM 2.5 , PM 10 , and NO 2 . In order to plot the biplot, PCA results are to be interpreted, which is followed by identifying the number of principal components [35]. Sites 1 and 3 are represented in 3D plots as the sum of the first, second, and third principal components, which result in an aggregate of less than 90%. Whereas for site 2, PC1 and PC2 sum to more than 90% and hence are represented as a 2D plot. Figure 5a,b display similar variance across the first, second, and third principal components. This can be attributed to the close dependency of PM 2.5 to PM 10 in site 1. This trend is also observed in site 2 (Figure 5d,e) and site 3 (Figure 5g,h). Site 2 shows the highest NO 2 principal component values at PC1 = 82%. Table 5 depicts PCA-based linear regression analysis coefficient values for all three sites with respect to indoor PM 2.5 , PM 10 , and NO 2 with 95% confidence interval. The linear regression results are in an agreement with the Spearman inter-parameter correlation. Three levels of statistical significance, 0.001, 0.05, and 0.1 in decreasing order of significance were observed. Indoor PM 2.5 and PM 10 show a strong dependence with site 1; this trend can be witnessed similarly in the Spearman correlation matrix. NO 2 for site 1 shows negative principal component values, which is similar to the correlation values (r) obtained through Spearman correlation. The first and second principal component numbers for site 2 for PM 2.5 and PM 10 are almost identical. Likewise, from Figure 4, PM 2.5 and PM 10 are strongly correlated. This may be attributed to the minimal occupant behavior owing to quarantine protocol restricting the active maximum number of occupants to 1 at a given time. For site 2, NO 2 possesses contradictory first and second principal component numbers (PC1 = −0.255, p < 0.001, and PC2 = 0.135, p < 0.001). This pattern can be consistently observed from the Spearman correlation as well. Site 3 has weak dependence across all three principal component numbers, while PC3 displays the least significance amongst all sites. This is contrasting from the indoor NO 2 , outdoor NO 2 , and relative humidity observed from the correlation heatmap. Site 1 and Site 2 show an overall stronger inter-parameter dependence, which is also witnessed from the PCA-based linear regression analysis. A relatively weaker correlation among parameters is displayed by the residential building (site 3). There is a mismatch between results derived from the PCA (site 3) and the values of the corresponding correlation coefficient. The proposed study could contribute to developing efficient solutions to identify and verify the significant variables that affect indoor air pollutant concentrations in buildings, such as windows operation, ventilation control, and building material selections.

Conclusions
The purpose of this study was to introduce a biplot-based PCA approach that could serve as a novel method to visualize and validate the relations between indoor and outdoor air quality data. PM 2.5 , PM 10 , NO 2 , and RHT data were collected continuously in three different building types (Supplementary Materials: elementary school-site 1, laboratorysite 2, and residential-site 3) with a span of two weeks. The highest means and standard deviations of indoor PM 2.5 and PM 10 (13.0 ± 30.2 µg/m 3 ; 15.0 ± 35.3 µg/m 3 ) were observed in site 3. For both sites 1 and 2, the recorded mean of indoor PM 2.5 and PM 10 concentrations were lower than outdoors. The average indoor NO 2 levels were significantly lower and steadier than outdoors. The Spearman coefficients showed a stronger correlation among these target environmental measurements on sites 1 and 2, while it showed a weaker correlation on site 3. Three and two principal components were found for sites 1 and 3, and site 2, respectively, from the biplot-based PCA. The PCA-based linear regression results showed higher dependency for site 1 and site 2 (p < 0.001) when compared to the Spearman correlation values (r) and showed a lower dependency for site 3. The results displayed a mismatch between the PCA-based regression and Spearman correlation for site 3. The method utilized in this research can be implemented in studies and analyzes high volumes of multiple building environmental measurements along with optimized visualization. For further studies, building characteristics, occupant behaviors, and seasonal variations with a larger sample size are recommended to be included in order for better understanding and analyzing the relationships between indoor and outdoor air quality.

Conflicts of Interest:
The authors declare no conflict of interest.