Simulation and Analysis of Indoor Air Quality in Florida Using Time Series Regression (TSR) and Artiﬁcial Neural Networks (ANN) Models

: Exposures to air pollutants have been associated with various acute respiratory diseases and detrimental human health. Analysis and further interpretation of air pollutant patterns are correspondingly important as monitoring them. In the present study, the 24-h and four-month indoor and outdoor PM 2.5 , PM 10 , NO 2 , relative humidity, and temperature were measured simultaneously for a laboratory in Gainesville city, Florida. The indoor PM 2.5 , PM 10 , and NO 2 concentrations were predicted using multiple linear regression (MLR), time series regression (TSR), and artiﬁcial neural networks (ANN) models. The modeling conducted in this study aims to perform a cross comparison study between these models in a symmetric environment. The value of root-mean-square error was improved by 18.33% in comparison with the MLR model. In addition, the value of the coefﬁcient of determination was improved by 24.68%. The ANN model had the best performance and could predict the target air pollutants at 10-min intervals of the studied building with 90% accuracy levels. The TSR model showed slightly better performance compared to the MLR model. These results can be accordingly referred for studies analyzing indoor air quality in similar building types and climate zones.


Introduction
Indoor air quality (IAQ) deterioration has been linked to the increased health risk of sick building syndrome (SBS) and building-related illness (BRI) [1,2]. People spend most of their daily lives in indoor environments, even though indoor air pollutant concentrations are on average three to five times higher than outdoors [3][4][5]. The target air pollutants selected in this study were particulate matter (PM 2.5 , PM 10 ) and nitrogen dioxide (NO 2 ). They are the most common indoor air pollutants defined in both the United States Environmental Protection Agency (US EPA) and the World Health Organization (WHO) standards [6][7][8]. Fine particle matter is mainly generated from incomplete combustion, vehicle emissions, and dust [1,6,9]. The most common source of reddish-brown gas (NO 2 ) is the combustion of fossil fuel [1,6,10,11]. Effective analysis and prediction of indoor air pollutant patterns are correspondingly critical as monitoring them.
In recent years, laser-based semiconductor techniques and low-cost air pollution monitoring techniques have rapidly surged [12][13][14]. There has been growing interest in determining the relationships between indoor and outdoor air quality using real-time measured data. Various prediction models have been developed to predict the IAQ based on environmental parameters. For example, Hassanvand et al. [15] and Chithra et al. [16] developed linear regression models to study and forecast the mass concentrations of indoor and outdoor particulate matter (PM 10 , PM 2.5 , and PM 1 ) in a residential house and a school building, respectively. Mohammadyan et al. [17] applied multiple regression models using indoor and outdoor environmental data (fine particles, temperature, and relative humidity). The model performed moderately predictive for the measured indoor pollutants. Jafta et al. [18] developed leave-one-out cross-validation-based multivariate models using PM 2.5 concentrations, meteorological information, and household human activities as inputs, while Shupler et al. [19] added housing material, season as one of the inputs. Kim et al. [20] used principal component analysis (PCA)-based linear models to predict indoor particulate matter (PM) by performing dimension reduction. Tong et al. [21] reported that the linear mixed-effects regression model showed moderately predictive for indoor PM 2.5 , and the fitted model resulted that outdoor PM 2.5 , relative humidity, and indoor PM 2.5 were positively associated. Yuchi et al. [22] and Lin et al. [23] reported that random forest-based machine learning predictive models and multiple linear regression models showed similar performance when predicting levels of certain indoor pollutants. Previous literature review reveals that most studies predicting indoor air quality focused on outdoor concentrations and human behaviors. Although most of the measured data are ordered by time, and many observations showed potential time delay between outdoor pollutants leak and the indoor concentrations [24][25][26][27]. However, to our knowledge, no study of relationships between indoor and outdoor air quality has attempted to include time-delay effects within the variables in regression analysis. Besides, artificial neural networks (ANN) have gained attention by a few researchers in recent years for indoor air quality forecasting [28][29][30][31][32][33]. Park et al. [25], Challoner et al. [34], and Saad et al. [35] reported that ANN-predicted indoor pollutant concentrations showed better performance than other regression models. Conversely, Liu et al. [31] developed nonlinear regression models that predicted IAQ levels showed higher performance over other prediction models (partial least squares, backpropagation ANN, and least-squares).
This study performed numerical simulations with monitored indoor and outdoor environmental data, coupled with the multiple linear regression model, the time series regression model, and the ANN-based model, to determine the relationships between indoor and outdoor air quality. Furthermore, these models are expected to predict the indoor air pollutant concentrations in a laboratory building in Florida. The objective of this paper is to perform a cross comparison study between multiple linear regression (MLR), time series regression (TSR), and non-linear ANN models. The hypothesis of this study is that the real-time monitored air quality data can fit better under non-linear models such as ANN compared to linear prediction models. The outline of this paper is as follows. In Section 2, the sampling protocols and methods of TSR and ANN are introduced. In Section 3, results of a cross-comparison of the aforementioned models are presented. The final section addresses the conclusion of this article.

Sampling Site and Sampling Protocol
The measurement of the test building was conducted between 17 May 2020 (Sunday) and 18 September 2020 (Friday) in Gainesville city, Florida, United States. The building chosen for air quality monitoring is mechanically conditioned throughout the year, and the windows are closed to meet the American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE) Standard 62.1-2019 [36]. The laboratory building mainly functions as an office space with a 321 sq. ft floor area coupled with a single ducted air handling unit, which can provide a PM 2.5 removal efficiency of 25% [37]. The standardized US EPA protocol for characterizing IAQ in large office buildings is followed to reduce measurement uncertainty [12,38]. A four-month indoor and outdoor air quality measurement was carried out with 10 min sampling intervals for 24 h continuously. Two US EPA (Air Quality Sensor Performance Evaluation Center) qualified monitors (Air Quality Egg, Version-2018) were used to measure the concentration of indoor and outdoor PM 2.5 , PM 10 , and NO 2 , as well as relative humidity and temperature (RHT) [11,39,40]. The basic specifications of the monitor are shown in Table 1. The indoor monitor was placed about 3.6 feet above the floor, 1.5 feet away from an interior wall. Outdoor measurement was conducted simultaneously with indoors. A weatherproofed monitor was set up 5 feet above the surface of the deck outside the room ( Figure 1).

Multiple Linear Regression Model
The monitored data are extracted and subjected to descriptive statistics such as mean, standard deviation, indoor/outdoor ratio (I/O), and variance using Anaconda Python and Jupyter Notebooks [41]. Multiple linear regressions were performed to assess and predict the relationship between indoor air pollutants (PM 2.5 , PM 10 , and NO 2 ) and other measured environmental parameters. A multiple linear regression model can be expressed as Equation (1) [22,42,43]: . . x p−1 are the independent variables, and ε is the random error term. The performance of developed MLR models was examined by calculating the coefficient of determination (R 2 ), and root means square error (RMSE). The performance indicators were calculated as Equations (2) and (3) [44][45][46]: where i is the prediction variable at present whileŷ i is the predicted value at each time, y i is the measured NO 2 (PM 2.5 or PM 10 ) concentration, y is the average value of measured NO 2 (PM 2.5 or PM 10 ) concentrations.

Time Series Regression Model
To predict the dependency between recorded variables when considering time-lag and time-accumulation effects, the TSR models were developed using MATLAB R2020a [47,48]. The autoregressive (AR) approaches were implemented for forecasting based on the same datasets used in the MLR models [49,50]. In an AR method, each variable is a linear function of the past values of itself and other previously recorded values [50][51][52]. The inputs of the model were the indoor and outdoor levels of temperature ( • F), relative humidity (%), PM 2.5 (µg/m 3 ), PM 10 (µg/m 3 ), and NO 2 (ppb). The indoor PM 2.5 , PM 10, or NO 2 of the laboratory building was considered as the output. The input variables were then slowly increased by adding the target pollutant of previous minutes (10 min intervals) to find the best-fit model (See Figure 2). To evaluate the performance of the TSR models and avoid the biased estimation, we computed RMSE between simulated and real-time measured data [29,44]. From Figure 2a,b, the best results were found by the implementation of the past 250 min observed value of indoor PM 2.5 and indoor PM 10 , respectively. Figure 2c shows the best results were found by the implementation of the past 470 min observed value of indoor NO 2 . The following are the obtained equations for the TSR model [49,50]: where a, b, c, d, e, f, g, h, i, and j are the coefficient. k is the number of time intervals (10 min). T out (t) and T in (t) are the outdoor and indoor temperature measured at time t, respectively. H out (t) and H in (t) represent the outdoor and indoor relative humidity at time t. PM 2.5_out (t), PM 10_out (t), PM 2.5_in (t), and PM 10_in (t) are the measured values of outdoor and indoor PM 2.5 and PM 10 at time t. NO 2_out (t)and NO 2_in (t) represent the outdoor and indoor calibration of NO 2 (pbb) at time t.

Artificial Neural Networks Model
Artificial neural networks have been widely employed to investigate complex ambientair processes. Many studies have predicted the concentrations of indoor air pollutants with ANN [33,35,46]. Multi-layer perceptron (MLP) network is one of the major types of ANN architectures [53]. Usually, the MLP network is consists of input, output, and multi-layers (including a hidden layer) [54]. Each layer is composed of several interconnected nonlinear processing components called neurons or nodes [29,32]. The selected neurons were connected to other neurons in adjacent layers through adaptive synaptic coefficients [29,54]. In this study, two-layer feed-forward neural network models were developed using the neural network toolbox of MATLAB R2020a [48]. The same dataset used in the MLR and TSR approach, 18,144 time-series data for each input neuron used to train the ANN models. The logistic sigmoid transfer function and linear activation function were used to avoid the fitting problem between input and output datasets. The input dataset was normalized using min-max normalization. The ANN models were trained with the Levenberg-Marquardt (LM) based back-propagation (BP) algorithm [44,52,55]. The LM algorithm is one of the most effective algorithms which accelerates the convergence rate of the ANN with multilayer perceptron (MLP) architectures and reserves computational resources [35,46]. Studies have found the optimal number of hidden neurons to obtain the best network performance. In this study, the number of hidden layers was computed to be 10 from several iterations [52,[54][55][56][57]. As shown in Figure 3, the input layer contains nine neurons of indoor and outdoor air quality parameters. The measured indoor concentrations of PM 2.5 , PM 10 , and NO 2 were considered as the desired target of the ANN models. For each training, the database was randomized into three groups: 70% training, 15% validation, and 15% testing [32,54]. The optimum network architecture of the ANN models was developed with 10 hidden layers [52,56]. The performance of developed ANN models was examined by calculating the coefficient of determination (R 2 ) and root means square error (RMSE) [44,46].

Measured Environmental Parameters
The descriptive statistics of measured indoor and outdoor environmental data are presented in Table 2 and Figure 4. The mean indoor and outdoor temperatures were 70.7 • F and 81.5 • F, respectively, while outdoor relative humidity had a slightly higher mean and a larger standard deviation than indoors. The average mean of indoor PM 2.5 , PM 10 , and NO 2 was significantly lower indoors than outdoors. It shows that the indoor concentration of PM 2.5 range was 0 µg/m 3 to 38.5 µg/m 3 , indoor PM 10 was 0 µg/m 3 to 68 µg/m 3 , and NO 2 was 72.3 ppb to 83.9 ppb. Figure 5 shows the plots of the four-month-long continuous records based on 10 min intervals. Most indoor PM 2.5 and PM 10 data met the 24 h safe levels in ASHRAE 62.1-2019 standard [36]. The indoor concentrations of both particulate matters followed a similar diurnal pattern as the outdoor concentrations. It can be seen from Figure 4c. The indoor NO 2 data show a significantly stable pattern than the outdoors across the sampling period. The potential reasons for this trend might be due to most gaseous pollutants have infiltrated and a lack of indoor NO 2 sources [1,12,58]. However, the majority of indoor NO 2 data lie above the annual requirements of ASHRAE 62.1 standard (53 ppb) [11,36].

Correlation between Indoor and Outdoor Data
Simple linear regression and the nonlinear Spearman's rank correlation were performed to extract the direct correlation between indoor data and their corresponding outdoor data. The dataset was also used to investigate the symmetry among indoor and outdoor time-series concentrations. The input dataset was normalized using standardized normalization. Figure 6a,b show the linear correlation and symmetric distributions between indoor and outdoor particulate matters (PM 2.5 and PM 10 ) with very coefficient of determination (R 2 _ PM2.5 = 0.26; R 2 _ PM10 = 0.24, p < 0.01). However, statistically moderate correlation (R_ PM2.5 = 0.53; R_ PM10 = 0.52) were found in the Spearman's rank results. From Figure 7, Spearman's coefficient shows a significant correlation coefficient (R_ NO2 = 0.87) between indoor and outdoor NO 2 . Conversely, a very low correlation was found between indoor and outdoor NO 2 in the linear regression analysis (Figure 6c). There were statistically significant mismatches between the non-linear Spearman's coefficients and the linear correlation coefficients. The results also indicated that the univariate linear regression method might not accurately predict the indoor concentrations of PM 2.5 , PM 10 , and NO 2 . Since these results are not compelling, multivariate analysis and further validation should be considered for the IAQ prediction model.

Cross-Comparison of the MLR, TSR, and ANN Models in the Prediction of the Indoor PM 2.5 and PM 10
The developed TSR and ANN models were used to simulate the 10 min average indoor particulate matter concentrations (PM 2.5 and PM 10 ) of a laboratory building near downtown Gainesville city, Florida. To better evaluate the prediction performance of each model, their output results were compared to the results of conventional multiple linear regression (MLR). Figure 8 shows the original and simulated indoor PM 2.5 by the time-series and developed models for the period of 17 May 2020 to 18 September 2020. It can be observed that the MLR, TSR, and ANN models can successfully predict the indoor PM 2.5 concentrations and their trend. The adjusted p-values which are less than 0.001 were validated by using F-test. From the simulated data, the R 2 values of the models (MLR, TSR, ANN) are above 0.9 (p < 0.001), and their MRSE values are less than 0.1 (p < 0.001) ( Table 3). TSR and MLR had the same R 2 values, but the RMSE value of TSR was 1.31% less than that of MLR. Additionally, the R 2 value of the ANN model was found to be 0.08% higher than the MLR model, and the RMSE of ANN was lower than the MLR by 5.37%. Similar performance was observed in the case of indoor PM 10 ( Figure 9). MLR (R 2 = 0.9985, p < 0.001), TSR (R 2 = 0.9986, p < 0.001), ANN (R 2 = 0.9995, p < 0.001) all successfully predicted indoor PM10 concentrations. The RMSE of the TSR model was lower than the MLR by 1.02%. The R 2 value of the ANN model was 0.09% higher than the MLR, and the RMSE of ANN was 13.66% lower than the MLR. In summary, the ANN models had the best performance among all the models to predict indoor particulate matters (PM 2.5 and PM 10 ), and the TSR models had better performance than the MLR. A random forest-based ANN in a residential environment studying indoor PM 2.5 concentrations found with R 2 value and cross-validation R 2 values of 0.93 and 0.65, respectively [59]. Park et al. [25] also conducted an ANN study in which the average value of indoor PM 10 concentration had an average value of 0.62 and mean RMSE of 33.04 in Seoul metropolitan subway stations. These studies had different environmental conditions which might have caused alternate results compared to our study. Besides, several studies have proved that factors, such as building materials, indoor environment conditions, human activities, and others, may contribute to the indoor particulate matter concentrations [57,[60][61][62][63][64].   The predicted results of the developed models for indoor NO 2 concentrations are illustrated in Figure 10 and Table 4. The adjusted p-values which are less than 0.001 were validated by using F-test. The validation of each developed model using 15% of the dataset showed that the R 2 (p < 0.001) value of models were 0.7230 (MLR), 0.7233 (TSR), and 0.9014 (ANN), while the RMSEs (p < 0.001) were 3.8861, 3.8558, and 3.1737, respectively, indicating that the models successfully predicted indoor NO 2 concentrations. The reason RMSE is higher in NO 2 prediction models than particulate matter models can be ascribed to the high spatial variation for NO 2 concentrations in the outdoor environment. However, again the ANN model shows the best performance in predicting the indoor NO 2 levels comparing to the TSR and MLR. The R 2 value (p < 0.001) of the ANN model was found to be 24.68% higher than the MLR model, and the RMSE (p < 0.001) of ANN was lower than the MLR by 18.33%. The TSR model gave slightly better performance compared to the MLR model. A study in an office room in a commercial building in Dublin showed that the prediction ability of R2 was 0.899 through ANN model compared to 0.11 through linear regression model [34].

Conclusions
The study measured the time-series data of concentrations of air pollutants (PM 2.5 , PM 10 , and NO 2 ), as well as temperature and relative humidity for both indoor and outdoor of a research lab, for a duration of four months in the midst of the COVID-19 pandemic (Supplementary Materials). The real-time measured data were used to develop predictive models to analyze the relationship between pollutant concentrations along with its symmetry. The hypothesis was verified using the abovementioned predictive models. In addition, the results showed that there was a complex non-linear relationship between indoor and outdoor environmental parameters. Indoor air quality (PM 2.5 , PM 10 , and NO 2 ) can be accurately estimated by the MLR, TSR, and ANN models. The comparison results showed that the ANN model had the best performance and could predict the PM 2.5 and PM 10 at 10 min intervals of the studied building with 90% accuracy levels. The TSR model had a similar performance with the MLR model when considering the time delay between indoor and corresponding outdoor air pollutants. In case of indoor PM 2.5 and PM 10 prediction, the R 2 values under the MLR models were more than 0.95 (p < 0.01) and the RMSE values were less than 0.1 (p < 0.01). In comparison, ANN model prediction performance was better than the TSR model resulting in an increase in R 2 value by 0.09% and decrease in RMSE value by 13.66%. In the case of indoor NO 2 prediction, the mean R 2 value under the MLR model was more than 0.7 (p < 0.01) and the RMSE value was less than 4. In comparison, ANN model prediction performance was better than the TSR model resulting in an increase in R 2 value by 24.68% (p < 0.01) and decrease in RMSE value by 18.33%. The results in this study indicated that ANN should be predominantly used compared to linear regression models for future research involving time-symmetric environmental parameters. This study can be accordingly referred by building managers to make dynamic adjustments to indoor air quality in similar building types and climate zones. However, the accuracy of the prediction may be influenced by other uncertain parameters such as building type and indoor human activities. The building parameters can be included in the ANN model analyses for better understanding of the relationships between indoor and outdoor air quality.

Conflicts of Interest:
The authors declare no conflict of interest.