Global and Geographically and Temporally Weighted Regression Models for Modeling PM2.5 in Heilongjiang, China from 2015 to 2018

Objective: This study investigated the relationships between PM2.5 and 5 criteria air pollutants (SO2, NO2, PM10, CO, and O3) in Heilongjiang, China, from 2015 to 2018 using global and geographically and temporally weighted regression models. Methods: Ordinary least squares regression (OLS), linear mixed models (LMM), geographically weighted regression (GWR), temporally weighted regression (TWR), and geographically and temporally weighted regression (GTWR) were applied to model the relationships between PM2.5 and 5 air pollutants. Results: The LMM and all GWR-based models (i.e., GWR, TWR, and GTWR) showed great advantages over OLS in terms of higher model R2 and more desirable model residuals, especially TWR and GTWR. The GWR, LMM, TWR, and GTWR improved the model explanation power by 3%, 5%, 12%, and 12%, respectively, from the R2 (0.85) of OLS. TWR yielded slightly better model performance than GTWR and reduced the root mean squared errors (RMSE) and mean absolute error (MAE) of the model residuals by 67% compared with OLS; while GWR only reduced RMSE and MAE by 15% against OLS. LMM performed slightly better than GWR by accounting for both temporal autocorrelation between observations over time and spatial heterogeneity across the 13 cities under study, which provided an alternative for modeling PM2.5. Conclusions: The traditional OLS and GWR are inadequate for describing the non-stationarity of PM2.5. The temporal dependence was more important and significant than spatial heterogeneity in our data. Our study provided evidence of spatial–temporal heterogeneity and possible solutions for modeling the relationships between PM2.5 and 5 criteria air pollutants for Heilongjiang province, China.


Introduction
Air pollutants can be emitted from anthropogenic and natural sources and may be either emitted directly (primary pollutants) or formed in the atmosphere (as secondary pollutants) [1]. They may be transported or formed over long distances and have influences on human health, ecosystems, the built environment, and climate in large areas. It has been confirmed by extensive epidemiological studies that air pollution is closely associated with increased risks of mortality or morbidity for cardiovascular and respiratory diseases [2][3][4][5]. It was reported that air pollution ranked the 7 th killer to the global public health and contributed to 3.2 million and 4.2 million premature deaths Worldwide in 2010 and 2016, respectively [6,7]. The European Environment Agency estimates that about 30% of Europe's urban population is still exposed to air pollution concentrations exceeding the European Union air quality standards [1,8]. With the rapid urbanization and industrialization, China has experienced a serious challenge of preventing and controlling air pollution as well. About 1.2 million premature deaths in 2010 and 25 million disability-adjusted life-years annually in China resulted from ambient or outdoor air pollutions [9]. Ambient particulate matter pollution has become the 4th killer to the health of Chinese people in 2010, after dietary risk, high blood pressure, and smoking [6]. It is of great importance to prevent and control air pollution for constructing ecological civilization in China.
In 2012, the Chinese Ministry of Environmental Protection (MEP) released the revised Ambient Air Quality Standards and defined the 6 criteria air pollutants including the particular matter with aerodynamic diameter of 10 µm or less (PM 10 ), the particular matter with aerodynamic diameter of 2.5 µm or less (PM 2.5 ), ozone (O 3 ), nitrogen dioxide (NO 2 ), and sulfur dioxide (SO 2 ), and carbon monoxide (CO) [10]. To mitigate the conflicts between air pollution and public health, the Chinese State Council issued the Air Pollution Prevention and Control Action Plan (APPCAP) in September 2013 and proposed a goal of reducing PM 2.5 by 10% in major cities by the end of 2017 compared to 2012 levels [11]. The Plan was a milestone of air pollution prevention and control efforts in China and considered the most stringent air pollution control policy in China to date [12]. Although the improvements to air quality in most cities in recent years, decreasing PM 2.5 in the Northern Plain of China is still a tough challenge and requires strong transregional collaboration [13]. Most studies focused on the levels of air pollution in the Beijing-Tianjin-Hebei region, the Yangtze River Delta, and the Pearl River Delta, which are the most developed regions in China in terms of a high level of per capita GDP and high population density (e.g., [14][15][16][17][18]). There is no adequate attention paid to the air pollution problems in northern China compared to above-developed regions, especially in the Heilongjiang Province.
To clarify the pollution level, characteristics, and main sources of particulate matter, the Heilongjiang government began to investigate the sources of atmospheric particulate matter within the province in 2013 [19]. It found that the main pollutants in the cities of Heilongjiang Province were inhalable particulate matter (PM 10 ) and fine particulate matter (PM 2.5 ). In particular, the increase in coal-fired emissions during the heating season (late October, November, December, January, February, and March) significantly increased the concentration of PM 2.5 . The local sources of PM 10 and PM 2.5 in Heilongjiang Province are mainly from coal-fired, followed by motor vehicle exhaust and dust. The contributions of different sources have certain seasonal variations [19]. Although the sources of air pollutants were explored, the spatial and temporal variation characteristics of different air pollutants in Heilongjiang Province were barely investigated.
Space and time are significant determinants of PM 2.5 . Similar to Tobler's (1970) first law of geography that is "everything is related to everything else, but near things are more related than distant things" [20], many geographic processes (such as air pollution) followed similar rules in the temporal domain. Over the last two decades, researchers developed various statistical models to deal with spatial and/or temporal heterogeneities. Geographically weighted regression (GWR) was designed to extend the traditional global model fitted by ordinary least squares (OLS) and could effectively deal with spatial heterogeneity and autocorrelation in model errors [21][22][23]. GWR belongs to local modeling techniques and fits a regression model at each geographic location based on the neighbors within a specific bandwidth and distance-dependent weight function. In recent years, researchers have extended GWR to a temporal dimension for spatiotemporal modeling, which is called geographically and temporally weighted regression (GTWR) [24,25]. GTWR deals with both spatial and temporal nonstationarities simultaneously by constructing a weight matrix based on spatiotemporal distance. GTWR greatly expands the boundary of local modeling techniques, and has been applied in various fields, such as environment, economics, sociology, and so on [24,[26][27][28][29][30]. As a special case of GTWR, GWR ignores the temporal non-stationarity, while TWR (temporally weighted regression) ignores the spatial non-stationarity thus that GTWR integrates GWR and TWR into one uniform framework.
Besides GTWR, linear mixed models (LMM) can be applied to deal with both temporal autocorrelations and spatial heterogeneity problems. LMM can account for the sources of heterogeneity and dependence in the data by using the random-effects and model temporal autocorrelations by using appropriate covariance matrices for the model residuals, thereby aiding in statistical inference [31]. LMM can improve model fitting and performance if a random variation is focused, particularly in the studies of ecological heterogeneity or the heritability of discrete characters [32]. However, LMM is not widely applied for air pollution data but particularly suitable to this study because it can deal with air pollutants as the fixed-effects and cities (spatial) as the random-effects within longitudinal observations (temporal). Therefore, LMM is an appropriate alternative to model PM 2.5 in this study.
Given the need for a better understanding of spatial-temporal heterogeneity between the criteria air pollutants in Heilongjiang Province, spatial-temporal statistical techniques present an important tool to quantitatively describe the amount of air pollution at particular locations at specific times. The objective of this study was to model PM 2.5 with the 5 criteria air pollutants (SO 2 , NO 2 , PM 10 , CO, and O 3 ) in Heilongjiang Province, China, from 2015 to 2018 using global and geographically and temporally weighted regression models. Specifically, this study implemented 5 models, including ordinary least square regression (OLS) that was applied as the benchmark for model comparisons, linear mixed model (LMM) that considered different cities (or regions) as the random-effects and repeated measurements over time, traditional geographically weighted regression (GWR), temporally weighted regression (TWR), and geographically and temporally weighted regression (GTWR). The goodness-of-fit, prediction accuracy, uncertainty accuracy of models, and model residuals were evaluated and compared based on corrected Akaike's information criterion (AICc), adjusted coefficient of determination (R 2 a ), and root mean squared errors (RMSE), mean absolute error (MAE), normal (Z) scores, and Moran's I of the model residuals. Understanding the spatial and temporal heterogeneity of various air pollutants can assist in the PM 2.5 prediction and air pollutants control management in the future.

Study Area and Data
Heilongjiang Province is located in the Northeast of China between 43 • 26 N and 53 • 33 N Latitude, and 121 • 11 E and 135 • 05 E Longitude. It has 13 prefecture-level administrative regions, including 12 prefecture-level cities (i.e., Harbin, Qiqiha'er, Mudanjiang, Jiamusi, Daqing, Jixi, Shuangyashan, Yichun, Qitaihe, Hegang, Suihua, and Heihe) and 1 region (Da Xing'an Mountain) with a total area of 473,000 km 2 ( Figure 1). Generally, the terrain of Heilongjiang is high in the northwest, north, and southeast, low in the northeast and southwest, and is mainly composed of mountains, terraces, plains, and water. The elevations of mountains, terraces and plains are between 300 and 1000 m (accounting for 58% of the total area), between 200 and 350 m (14%), and between 50 and 200 m (28%), respectively. Heilongjiang Province belongs to the continental monsoon climate of cold temperate and temperate zone with a short frost-free period and has large regional differences in climate. The main characteristics are low-temperature and drought in spring, warm and rainy in summer, dry and early-frost in autumn, and long cold in winter. The precipitation is abundantly affected by the southeast monsoon in summer and insufficient controlled by the dry and cold northwest wind in winter, which presents obvious monsoon characteristics.
In this study, 6 criteria air pollutants, including SO 2 , NO 2 , PM 10 , CO, O 3 , and PM 2.5, were obtained from the Environmental Monitoring Station of Heilongjiang Province and had already aggregated into the 12 prefecture-level cities and 1 region based on the 56 environmental monitoring sites in Heilongjiang Province. All the pollutants were recorded by µg/m 3 , except CO (mg/m 3 ). The daily records of the criteria air pollutants were aggregated into weekly measurements of the 13 cities (region) of Heilongjiang Province from 2015 to 2018 (i.e., 210 weeks per city or region, and 2730 records in total). The descriptive statistics of the 6 air pollutants are summarized in Table 1.   Figure 2 presents the matrix of Pearson correlation coefficients (ρ) of the 6 air pollutants. It showed that PM 2.5 was highly correlated (ρ >0.5) with PM 10 , NO 2 , CO, and SO 2 ; whereas O 3 was weakly (−0.25 < ρ < 0) to moderately (−0.5 < ρ < −0.25) correlated with the group of PM 10 , NO 2 and PM 2.5 , and the group of SO 2 and CO, respectively. Figure 3 presents the variability of the 6 criteria air pollutants cross locations (12 prefecture-level cities and 1 region) and periods (210 weeks from 2015 to 2018). There were obvious seasonal trends for all the air pollutants in most cities (region), especially in Harbin, Qiqiha'er, and Suihua. On the other hand, Heihe, Yichun, and Da Xing'an Mountain region showed relatively stable trends for all pollutants, except O 3 compared to other cities. It indicated both spatial and temporal variability visually existing in Heilongjiang Province.

OLS and LMM
Traditional ordinary least square (OLS) regression was used as the benchmark for model comparisons as follows:

OLS and LMM
Traditional ordinary least square (OLS) regression was used as the benchmark for model comparisons as follows: where Y i is the response variable (i.e., PM 2.5 in this study), where i = 1, 2, . . . , n; β 0~β5 are the regression coefficients to be estimated from the data; the predictors X 1~X5 represent SO 2 , NO 2 , PM 10 , CO, and O 3 , respectively; and ε i is the random error term following the normal distribution with zero mean and constant variance, i.e., N(0, σ 2 I), with I denoting an n × n identity matrix. The model coefficient vector β T = [β 0 , β 1 , . . . , β 5 ] of OLS is estimated bŷ where the superscript T denotes the transpose of a matrix, X and Y are the vectors of predictors and response variables, respectively. The OLS regression represents a universal or constant relationship between predictors and response variables across the entire study area. The traditional OLS model was not only a global model but also a fixed-effects model for the predictors (SO 2 , NO 2 , PM 10 , CO, and O 3 ), which were the only levels or factors under consideration for the statistical inference. In contrast, the random-effects were the factors that were randomly selected from an infinite population of the possible levels or factors and could vary if the experiment was implemented for another time thus that the statistical inference was direct towards the entire population of factor levels, not just those "random" levels that were incorporated into the experiment. In this study, the 13 cities (region) were considered as the random-effects because they were one subset of the cities in Heilongjiang Province. Thus, the linear mixed model was used to incorporate both fixed-effects (SO 2 , NO 2 , PM 10 , CO, and O 3 ) and random-effects (13 cities), as well as account for time series observations across the 4 year periods. LMM was an extension of the linear models, and can be expressed as: where Y is an n × 1 column vector of the response variable, X is an n × p matrix of the (p − 1) predictors with the first column of 1 for estimating the intercept coefficient, n is the number of sample observations, p is the number of fixed-effects parameters, β is an p × 1 vector of unknown fixed-effects parameters, Z is a known n × q design matrix for the q random-effects, γ is an q × 1 vector of unknown random-effects parameters, and ε is an n × 1 vector of the random model errors. LMM follows several assumptions: where E(·), Var(·), and Cov(·) denote expectation, variance, and covariance, respectively; "Normal" represents "following a normal distribution". The variance of Y can be estimated by random-effects design matrix Z, and covariance matrices G and R. The estimates of the fixed-effects and random-effects parameters can be expressed by Equations (5) and (6), respectively, whereĜ,R, andV are the reasonable estimates of G, R, and V, respectively, from the data.

GWR Model and Parameter Estimation
GWR model extends the traditional ordinary least square regression from a global to local framework [24], and can be expressed as follows: where Y i is the response variable, (u i , v i ) denotes the coordinates of the location i in space, β 0 (u i ,v i ) and β k (u i ,v i ) represent the intercept and a set of (p − 1) slope parameters at the location i, respectively. X ik represents a set of (p − 1) predictors (k = 1, 2, . . . , p−1) at the ith location, p is the total number of parameters to be estimated, ε i is the error term of location i. Comparing to the "fixed" coefficient estimates of the global OLS model (Equation (1)), GWR captures the spatial heterogeneity using varied parameter estimates over space. GWR follows Tobler's first law of geography [20], and is calibrated using a locally weighted least squares approach, and the estimation of parameters is obtained by the following equation: where W(u i ,v i ) is an n × n weight matrix whose off-diagonal elements are zero and diagonal elements denote the geographical weighting of the neighboring observations for the focal observation i as follows: It is critical to select an appropriate weight matrix for estimating the parameters of GWR. The spatial weights can be estimated by a spatial kernel function, also called a distance-decay function. According to whether the bandwidth is varied, the 2 basic types of spatial kernels are fixed and adaptive kernels, which use fixed bandwidth and a fixed number of nearest neighbors within an adaptive bandwidth, respectively [33]. The commonly used spatial kernel functions include exponential kernel function, Gaussian kernel function, and bi-square kernel function [21]. In this study, the bi-square function was selected because it had the best (smallest) AICc for fitting the GWR model to the data. The adaptive kernel of bi-square function is defined as follows [34]: where d ij is the distance between locations i and j; h i is the bandwidth used to estimate parameters at location i. The optimal bandwidth is usually selected based on a goodness-of-fit criterion such as cross-validation or Akaike Information Criterion (AIC) [25]. The corrected AIC (AICc) approach was applied for the optimal bandwidth selection in this study.

GTWR and TWR Models
GTWR is an extension of GWR with temporal variations and incorporates both spatial and temporal heterogeneity in the data. The spatiotemporal nonstationary in the parameter estimates is captured by constructing the weight matrix based on spatiotemporal distances [24,25]. The GTWR model can be expressed as follows: The parameter β k (u i , v i , t i ) should be estimated for every predictor k and every space-time location i. The estimation of β k (u i , v i , t i ) is very similar to that in GWR (Equation (8)), and can be expressed as follows: Since GTWR considers both space and time, a spatial-temporal weight matrix is constructed based on a spatio-temporal distance. In this study, the Euclidean spatio-temporal distance was applied and determined as follows: where λ and µ are the scale factors in space and time metric system, respectively, which are used to balance the different effects that measure the spatial and temporal distance; t i and t j are the observed times at the locations i and j. In this study, an adaptive Gaussian distance-decay function was applied to construct a weight matrix. The weight matrix was still a diagonal matrix in GTWR and the diagonal elements are determined as follows [24]: where h 2 ST is a parameter of spatio-temporal bandwidth, and h 2 is the spatial distance, is the temporal distance. If no spatial variation exists in the observed data, the parameter λ would be zero (i.e., λ = 0) in Equation (13), that leads to the temporally weighted regression. TWR only considers the temporal variation based on the temporal distance. On the contrary, the parameter µ would be zero (i.e., µ = 0) in Equation (13) and GTWR would reduce to the traditional GWR without considering the temporal variation. Thus, both GWR and TWR are special cases of GTWR. All GWR-based local models (GWR, TWR, and GTWR) were performed using R package Gwmodel [35] under R version 3.5.1 [36].

Model Assessment
The model goodness-of-fit was evaluated by the corrected Akaike's information criterion (AICc) [21], adjusted coefficient of determination (R 2 a ), root mean squared errors (RMSE), mean absolute error (MAE), and Z score of the model residuals. AICc is one of the most commonly used goodness-of-fit criteria for model comparisons and defined as Equation (15). The smaller the AICc is, the better the model performs.
where n is the sample size,σ is the estimated standard deviation of the error term, and tr(S) denotes the trace of hat matrix S (i.e.,Ŷ = SY). The hat matrix is a function of bandwidth and defined as follows in the GWR model: The coefficient of determination (R 2 ) is another basic and widely used good-of-fit (Equation (17)). It represents the percentage of the total variation in the observed Y that is explained by the model. However, R 2 tends to exaggerate the explained percentage since it never decreases by adding more predictor variables. The adjusted coefficient of determination (R 2 a ) overcomes this drawback by dividing RSS and SST by their associated degrees of freedom (see Equation (18)). As multiple predictors were included, R 2 a was adopted.
where RSS is the residual sum of squares, and SST is the sum of squares of total variation of the response variable, n is the sample size, p is the number of coefficients (including all of predictor coefficients and intercept).
The R 2 statistics of LMM is more complicated than the fixed-effects models and attracted many scientists' attention [37,38]. In this study, the conditional R 2 (R 2 c ) was used, representing the variance explained by the entire model (including both fixed and random effects) [37], which can be expressed as follows: where σ 2 f , σ 2 r , and σ 2 ε represent the variance of the fixed-effects, random-effects, and model residuals, respectively.
In addition to AICc and R 2 a , the root mean squared errors (RMSE), mean absolute error (MAE), and Z score of the model residuals are calculated to evaluate model performance by Equations (20−22), respectively: where y i andŷ i are the observed and predicted values of the response variable, and std denotes standard deviation. The Moran's I is commonly used to investigate the spatial dependencies in the model residuals from each regression model [23]. Moran's I was positive when the model residuals tended to be similar, negative when they tended to be dissimilar, and approximately 0 when they arranged randomly and independently over space [39] and can be expressed as follows: where w ij is the diagonal elements of the spatial weight matrix, W is the sum of all w ij , e i denotes the residual at location i and e denotes the mean of residuals. The expected value of Moran's I under the null hypothesis of no spatial autocorrelation (i.e., randomization) is E(I) = −1 n−1 . In this study, the Moran's I of the model residuals were calculated and plotted against the number of nearest neighbors (i.e., bandwidth) to present the relationship between spatial autocorrelation of the residuals and bandwidth. The p-value (two-tailed) calculated for the null hypothesis test of randomization was also plotted against the number of nearest neighbors.

OLS and LMM
The OLS regression was used to model the relationships between PM 2.5 and 5 criteria air pollutants as the benchmark for model comparisons. OLS was global and fixed-effects in nature and represented the average relationship between the response variable and predictors. Table 2 lists the parameter estimates of the OLS model (Equation (1)). All model coefficients were statistically significant and positive for predicting PM 2.5 , except O 3 . The OLS model indicated that PM 2.5 increased as the 4 air pollutants (SO 2 , NO 2 , PM 10 , and CO) increased, while PM 2.5 increased as O 3 decreased. This was evident by the negative correlation between PM 2.5 and O 3 (Figure 2), which was consistent with a previous study [40]. Table 2 also indicated that PM 10 was the most influential factor on PM 2.5 because it had the largest standardized estimate (0.716). PM 2.5 would increase 0.52 µg/m 3 when PM 10 increased 1 µg/m 3 while keeping other predictors as constants. The OLS model fitted the data well according to the R 2 a , i.e., 85% of the total variation of PM 2.5 can be explained by the OLS model. The linear mixed model was applied for the 5 criteria air pollutants (SO 2 , NO 2 , PM 10 , CO, and O 3 ) as the fixed-effects and the 13 cities (or region) as the random-effects. The G matrix (Equation (4)) of the random-effects was estimated by the covariance structure of variance components (VC), while the R matrix (Equation (4)) of the model residuals was estimated by the covariance structure of first-order autoregressive (AR(1)). The covariance structures of VC and AR(1) can be expressed by Equations (24) and (25), 1 r r 2 r 3 r 4 r 5 r 1 r r 2 r 3 r 4 r 2 r 1 r r 2 r 3 r 3 r 2 r 1 r r 2 r 4 r 3 r 2 r 1 r r 5 r 4 r 3 r 2 r 1 where σ 2 1~σ 2 6 represent the variance of β 0 (intercept), β 1 (SO 2 ), β 2 (NO 2 ), β 3 (PM 10 ), β 4 (CO), and β 5 (O 3 ), respectively; σ 2 represents the variance of the model residuals, and r represents the first-order temporal autocorrelation. The parameter estimates of the G and R matrices were listed in Table 3. It seems that the temporal autocorrelations between the weekly observations were relatively high (r = 0.552) and significant.  Table 4 listed the parameter estimates of the fixed-effect (overall) of the LMM model. It indicated that all the coefficient estimates followed the same signs of the OLS estimates. The standard errors of the LMM coefficient estimates were much larger than those of the OLS estimates. This was because LMM treated the air pollutants as the fixed-effects but adjusted each coefficient estimate for each of the 13 cities (the random-effects), while OLS fitted an average model using the air pollutants, thus that underestimated the standard errors of the model coefficients. The LMM model explained 89.8% of the total variation of PM 2.5 in the data. The AICc (18,756.7) of the LMM model was much smaller than that (20,109.26) of the OLS model, indicating a much better model fitting by LMM over OLS.
The 13 cities (or region) were treated as the random-effects in the LMM model thus that the model coefficients can be derived for each of the 13 cities (or region), which were listed in the Supplementary Materials (Table S1) due to the page limit. Figure 4 represents the relationships between predicted PM 2.5 and the most significant predictor (PM 10 ) as an example (while keeping other predictors as constants of means). The solid black line was the overall or fixed-effects model, while the colored dotted lines were the 13 models for the 13 cities (region). All the regression lines had slightly different slopes compared with the slope of the overall model. Among the 13 cities (region), the slope coefficient β 3 (PM 10 ) of Jiamusi ( Figure 4a) and Shuangyashan (Figure 4b) were significantly greater (i.e., steeper slope) than that of the overall model, indicating that increasing the PM 10 level in Jiamusi and Shuangyashan would cause a much higher PM 2.5 than the other cities, especially for the higher concentrations of PM 10  The 13 cities (or region) were treated as the random-effects in the LMM model thus that the model coefficients can be derived for each of the 13 cities (or region), which were listed in the Supplementary Materials (Table S1) due to the page limit. Figure 4 represents the relationships between predicted PM2.5 and the most significant predictor (PM10) as an example (while keeping other predictors as constants of means). The solid black line was the overall or fixed-effects model, while the colored dotted lines were the 13 models for the 13 cities (region). All the regression lines had slightly different slopes compared with the slope of the overall model. Among the 13 cities (region), the slope coefficient β3 (PM10) of Jiamusi ( Figure 4a) and Shuangyashan (Figure 4b) were significantly greater (i.e., steeper slope) than that of the overall model, indicating that increasing the PM10 level in Jiamusi and Shuangyashan would cause a much higher PM2.5 than the other cities, especially for the higher concentrations of PM10. In contrast, Daqing (Figure 4a), Heihe (Figure 4a), and Suihua ( Figure  4b) had significantly smaller β3 (i.e., shallower slope) than that of the overall model, indicating that increasing the PM10 level in Daqing, Heihe, and Suihua would cause a lower increase in PM2.5 than the other cities.

Local Models (GWR, TWR, and GTWR)
The local models, including GWR, TWR, and GTWR, were used to explore spatial and/or temporal heterogeneity in PM2.5 and 5 criteria air pollutants across the study area. The adaptive bisquare kernel function was used as the spatial weighting kernel function in this study. The appropriate bandwidth (number of neighbors) was selected based on the smallest AICc. The

Local Models (GWR, TWR, and GTWR)
The local models, including GWR, TWR, and GTWR, were used to explore spatial and/or temporal heterogeneity in PM 2.5 and 5 criteria air pollutants across the study area. The adaptive bi-square kernel function was used as the spatial weighting kernel function in this study. The appropriate bandwidth (number of neighbors) was selected based on the smallest AICc. The parameters and model fitting information of GWR, TWR, and GTWR are summarized in Table 5. The signs and magnitudes of the all median coefficients of the three GWR-based models (GWR, TWR, and GTWR) were compatible with those of the OLS model. However, the TWR's model coefficients were more similar to those of GTWR, while GWR's model coefficients were slightly different from those of TWR and GTWR.  Table 6 showed the model goodness-of-fit (i.e., R 2 a and AICc), prediction accuracy (i.e., RMSE and MAE), and prediction uncertainty (i.e., the mean and standard deviation (Std) of the Z score of model residuals). The three GWR-based models fitted the data better than the OLS model (R 2 a = 0.846), with the order of TWR (R 2 a = 0.968), GTWR (R 2 a = 0.968), and GWR (R 2 a = 0.884). Both GTWR and TWR also performed better than the LMM model (R 2 c = 0.898), while GWR was slightly worse than LMM (Table 4). In the terms of AICc, TWR (17,944.31) fitted the data best, followed by GTWR (18,016.58), LMM (18,756.70), GWR (19,403.51), and OLS (20,109.26). The results indicated that the temporal autocorrelations in the 210-week time series data played a more important role than the spatial heterogeneity across the 13 cities (region). It was a bit surprising that LMM performed better than GWR in terms of goodness-of-fit. However, the prediction accuracy measures of GWR were slightly better than that of LMM in terms of RMSE (GWR 8.207 vs. LMM 8.482) and MAE (GWR 5.621 v.s. LMM 5.794). It was clear that TWR and GTWR were significantly superior to OLS, LMM and GWR according to all comparison criteria, i.e., they had higher R 2 a , smaller AICc, smaller RMSE, and MAE. TWR and GTWR reduced RMSE and MAE of the model residuals by 67% from the OLS model, while GWR only reduced RMSE and MAE by 15%. Table 6. Comparison of OLS, LMM, and GWR-based models (GWR, TWR, and GTWR). To investigate the spatial dependencies in the model residuals from each model, the Moran's I of the model residuals of OLS, GWR, TWR, and GTWR were calculated and compared. Figure 5 presents the Moran's I calculated using different bandwidths (i.e., number of neighbors) for all 5 models (Figure 5a). The positive Moran's I indicated that the model residuals were clustering in similar model residuals (either positive or negative), which may cause a serious violation of the independence assumption of the model residuals and lead to insufficient estimation of the model coefficients [39]. Therefore, the OLS model either under-predicted or over-predicted PM 2.5 across the study area. The Moran's I of the model residuals for all 5 models approached zero (i.e., randomness) with the increase of bandwidth (number of neighbors) and tended to be stable when the number of neighbors reached about 650. To investigate the spatial dependencies in the model residuals from each model, the Moran's I of the model residuals of OLS, GWR, TWR, and GTWR were calculated and compared. Figure 5 presents the Moran's I calculated using different bandwidths (i.e., number of neighbors) for all 5 models (Figure 5a). The positive Moran's I indicated that the model residuals were clustering in similar model residuals (either positive or negative), which may cause a serious violation of the independence assumption of the model residuals and lead to insufficient estimation of the model coefficients [39]. Therefore, the OLS model either under-predicted or over-predicted PM2.5 across the study area. The Moran's I of the model residuals for all 5 models approached zero (i.e., randomness) with the increase of bandwidth (number of neighbors) and tended to be stable when the number of neighbors reached about 650.

Num of Neighbor
However, the Moran's I of the LMM and GWR-based models were much smaller than those of OLS and approached zero in the opposite direction (negative) to that of OLS (positive), which indicated that the residuals of LMM and GWR-based models were dissimilar to OLS when the bandwidth was small (Figure 5a). Figure 5b focused on the Moran's I of LMM and GWR-based models only to see the differences between the 4 models. The LMM and GWR-based models generated little under-predictions or over-predictions for the patches of local areas and produced a significant reduction of spatial autocorrelation in the model residuals to various degrees by explicitly applying the local information among neighboring locations and/or time. The differences of Moran's coefficients among LMM, GWR, TWR, and GTWR were more obvious with small bandwidths than large bandwidths. The LMM and GWR-based models had almost the same and little spatial autocorrelation when the number of neighbors was greater than 850. TWR performed over GTWR, and then both over LMM and GWR when the number of neighbors was less than 250. The performances of TWR and GTWR were almost the same when the bandwidth was larger than 250. It was consistent with the results of Table 6 that used the optimized bandwidth.  Figure 6 presents the p-value (two-tailed) for the randomization null hypotheses test of the model residuals of OLS, LMM, GWR, TWR, and GTWR calculated using different bandwidths (i.e., number of neighbors). It indicated that the residuals of OLS were not random at any bandwidths since the p-value was much less than 0.05 (reject randomization null hypotheses) across all bandwidths. The residuals of LMM and all GWR-based models presented randomized character across all bandwidth except the residuals of LMM and GWR when the number of neighbors was smaller than 250. The magnitude of randomness gradually increased with the increase of the number of neighbors and followed by order of GWR, LMM, TWR, and GTWR. The TWR and GTWR had a However, the Moran's I of the LMM and GWR-based models were much smaller than those of OLS and approached zero in the opposite direction (negative) to that of OLS (positive), which indicated that the residuals of LMM and GWR-based models were dissimilar to OLS when the bandwidth was small (Figure 5a). Figure 5b focused on the Moran's I of LMM and GWR-based models only to see the differences between the 4 models. The LMM and GWR-based models generated little under-predictions or over-predictions for the patches of local areas and produced a significant reduction of spatial autocorrelation in the model residuals to various degrees by explicitly applying the local information among neighboring locations and/or time. The differences of Moran's coefficients among LMM, GWR, TWR, and GTWR were more obvious with small bandwidths than large bandwidths. The LMM and GWR-based models had almost the same and little spatial autocorrelation when the number of neighbors was greater than 850. TWR performed over GTWR, and then both over LMM and GWR when the number of neighbors was less than 250. The performances of TWR and GTWR were almost the same when the bandwidth was larger than 250. It was consistent with the results of Table 6 that used the optimized bandwidth. Figure 6 presents the p-value (two-tailed) for the randomization null hypotheses test of the model residuals of OLS, LMM, GWR, TWR, and GTWR calculated using different bandwidths (i.e., number of neighbors). It indicated that the residuals of OLS were not random at any bandwidths since the p-value was much less than 0.05 (reject randomization null hypotheses) across all bandwidths. The residuals of LMM and all GWR-based models presented randomized character across all bandwidth except the residuals of LMM and GWR when the number of neighbors was smaller than 250. The magnitude of randomness gradually increased with the increase of the number of neighbors and followed by order of GWR, LMM, TWR, and GTWR. The TWR and GTWR had a similar trend. In general, the p-value of the randomization null hypotheses test followed the same trend of Moran's coefficients since they all reacted to the spatial autocorrelation of the model residuals.

Discussion
Heilongjiang Province is the northernmost province in China, with the characters of high latitude, low temperature, four distinct seasons, and long and chilly winter. The cold winter leads to a very long heating season (from late October to March in the next year) in Heilongjiang. The industrial production, urban heating, emissions from unorganized pollution sources (such as straw burning) around cities, and motor vehicle exhaust emission mainly contributed to the air pollution of Heilongjiang [19]. Thus, effective measures should be implemented, such as the prohibition of open biomass burning, improvement of coal energy efficiency, and full use of clean fuels (nuclear, wind, and solar energy) for municipal heating [41].
In recent years, the Heilongjiang government has attached great importance to the fight of "blue sky defense" and made obvious progress in the prevention and control of air pollution. Fortunately,

Discussion
Heilongjiang Province is the northernmost province in China, with the characters of high latitude, low temperature, four distinct seasons, and long and chilly winter. The cold winter leads to a very long heating season (from late October to March in the next year) in Heilongjiang. The industrial production, urban heating, emissions from unorganized pollution sources (such as straw burning) around cities, and motor vehicle exhaust emission mainly contributed to the air pollution of Heilongjiang [19]. Thus, effective measures should be implemented, such as the prohibition of open biomass burning, improvement of coal energy efficiency, and full use of clean fuels (nuclear, wind, and solar energy) for municipal heating [41].
In recent years, the Heilongjiang government has attached great importance to the fight of "blue sky defense" and made obvious progress in the prevention and control of air pollution. Fortunately, the annual concentration of PM 2.5 from 2015 to 2018 presented a decreasing trend in most cities, except Jixi (Figure 7). The air pollutants are divided into 6 grades (I: Excellent; II: Good; III: Light pollution; IV: Moderate pollution; V: Heavy pollution; VI: Serious pollution) in terms of the Technical Regulation on Ambient Air Quality Index of China currently being tried [42,43]. Only Harbin (1 out of 13) exceeded the national annual PM 2.5 standards (≤35 µg/m 3 ) in all 4 years, but with a declining tendency within grade II. On the contrary, Heihe, Yichun, and Da Xing'an Mountain region were in excellent condition of air in all 4 years, which was consistent with the stable seasonal trend of the 6 air pollutants in Figure 3. The annual PM 2.5 concentration of the other cities oscillated around the grade I limit (35 µg/m 3 ). Some cities exceeded the grade I limit in 2015 and 2016 and then fell within the limit in the following 2 years (like Daqing, Hegang, and Mudanjiang), which indicated progress in the prevention and control of air pollution. However, Jixi, one of the most vital coal origins of the Heilongjiang Province, faced a more serious air pollution problem in 2017 than other years due to the poor air quality in heating seasons resulting from long and icy winters. This situation was alleviated in 2018. The provincial city Harbin had an even bigger air pollution problem than other cities because of the massive urban heating of around 9.5 million population, industry, and increased motor vehicle exhaust emissions every year. The phenomenon of spatial and temporal non-stationarity of PM 2.5 shown in Figure 7 is consistent with that in Figure 3. Recently, researchers have investigated the relationship between PM2.5 and various factors, such as aerosol optical depth (AOD) and normalized difference vegetation index (NDVI) derived from satellite imagery, meteorological factors (like temperature, wind speed, relatively humidity), transportation emission factors, density of industrial plants, land-use, gross domestic product (GDP), and digital elevation model (DEM) and so on [44][45][46][47][48][49][50]. These studies have led to an increasingly comprehensive understanding of PM2.5 and impact factors. Although PM2.5 correlates with many factors, it has the most direct correlation with the criteria air pollutants, especially PM10. It is meaningful to evaluate the spatial-temporal relationships between PM2.5 and criteria air pollutants since it can provide useful information on the PM2.5 concentration without direct PM2.5 monitoring, especially before 2015, when the PM2.5 monitoring network was sparse in Heilongjiang.
Meanwhile, PM2.5 modeling techniques vary according to the different predictors applied and objectives. In general, there are four major categories for modeling PM2.5: (1) Time series analysis and related statistical analysis (e.g., [16,[51][52][53][54][55]); (2) GTWR and its derivative models (e.g., [30,[56][57][58]); (3) machine learning method using plenty of predictors (e.g., [48,49]); (4) comprehensive approach by integrating several above methods (e.g., [46]). Each method has pros and cons. It is vital to choose an Recently, researchers have investigated the relationship between PM 2.5 and various factors, such as aerosol optical depth (AOD) and normalized difference vegetation index (NDVI) derived from satellite imagery, meteorological factors (like temperature, wind speed, relatively humidity), transportation emission factors, density of industrial plants, land-use, gross domestic product (GDP), and digital elevation model (DEM) and so on [44][45][46][47][48][49][50]. These studies have led to an increasingly comprehensive understanding of PM 2.5 and impact factors. Although PM 2.5 correlates with many factors, it has the most direct correlation with the criteria air pollutants, especially PM 10 . It is meaningful to evaluate the spatial-temporal relationships between PM 2.5 and criteria air pollutants since it can provide useful information on the PM 2.5 concentration without direct PM 2.5 monitoring, especially before 2015, when the PM 2.5 monitoring network was sparse in Heilongjiang.
In this study, the spatial and/or temporal heterogeneity of PM 2.5 concentrations of Heilongjiang Province were investigated using LMM, GWR, TWR, and GTWR models based on the criteria air pollutants from 2015 to 2018. The major problem of global models applied to environmental processes is that they assume the processes to be constant across space or time. The spatial or temporal effects (spatial/temporal autocorrelation and heterogeneity) may violate the assumptions of independent observation and/or invariant variance, which biases the estimates of standard errors and results in imprecise coefficient estimates [59,60]. Although LMM is a global method, it could deal with spatial and temporal dependence and heterogeneity using appropriate variances of random-effects and random errors (i.e., G and R matrices), thus obtaining a better model performance than OLS (5% improvement of adjusted R 2 ). However, the local model GWR did not perform better than LMM because it only handled spatial heterogeneity by "borrowing" the data from surrounding locations and ignored temporal information [25]. TWR and GTWR showed advantages over OLS by incorporating temporal heterogeneity (12% improvement of adjusted R 2 ). The obvious seasonal variation ( Figure 3) and decreasing tendency (Figure 7) lead to the high efficiency of incorporating temporal information in local models. However, the GTWR was not significantly superior to TWR in terms of the adjusted R 2 , AICc, RMSE, and MAE of the model residuals, which might result from the inadequate geographical locations (only 13 cities or regions). It was also the reason why the improvement of the GWR (3.8% of adjusted R 2 ) was so limited. It is more efficient to incorporate temporal information than spatial information in this case. The criteria air quality data from all 56 air monitoring stations across the entire Heilongjiang Province could be applied to model the spatial and temporal heterogeneity of PM 2.5 . However, it is a challenge to process a massive dataset with both spatial and temporal information due to the limited calculation capability of the computer. Sometimes, it is a prerequisite to balance between the calculation efficiency and data richness. We sacrificed some spatial information by using the citywide average concentrations since they were the averages of concentrations at all monitoring sites in each city, and also the daily concentrations of air pollutants reported to the public by the government [61]. We also sacrificed some temporal information by aggregating daily data into weekly data for balancing those two. Nevertheless, the TWR model based on 210 weekly data was sufficient to describe the relationship between PM 2.5 and the other 5 standard air pollutants in this study. Thus, it is scientific to apply the same policy for prevention and control of air pollution throughout the entire Heilongjiang Province with special attention paid to temporal changes.
Heilongjiang Province has begun to systematically and repeatedly measure the ambient air quality data (6 air pollutants) since 2015. How to reduce or eliminate the harm of PM 2.5 to public health has become one of the most challenging problems for air pollution prevention in Heilongjiang Province. PM 2.5 is inextricably related to other standard pollutants. Although some researchers have been investigated the spatial-temporal heterogeneity in the PM 10 -PM 2.5 relationship using GWR-based models (e.g., [62]), the relationship between PM 2.5 and criteria air pollutants (e.g., PM 10 , SO 2 , NO 2 , CO, O 3 ) is still unclear and the spatial-temporal heterogeneity in the relationship is still to be studied.

Conclusions
In this study, we investigated the relationships between PM 2.5 and 5 criteria air pollutants (i.e., PM 10 , SO 2 , NO 2 , CO, O 3 ) for 13 cities (or region) of Heilongjiang Province during 2015~2018 using global and graphically and temporally weighted regression (i.e., OLS, LMM, GWR, TWR, and GTWR). The daily data were integrated into weekly data due to excessive computation. The model performance and spatial autocorrelation of model residuals were compared. The results showed that all parameter estimates tended to be positive for predicting PM 2.5 , except O 3 . The LMM and all GWR-based models (i.e., GWR, TWR, and GTWR) showed great advantages over OLS in terms of model fitting, such as producing higher R 2 and more desirable model residuals, especially TWR and GTWR that incorporated temporal variation. The GWR, LMM, and TWR and GTWR improved the explanation of variance in PM 2.5 by 3%, 5%, and 12%, respectively, from the R 2 (0.85) of OLS. TWR yielded slightly better model performance, prediction accuracy and uncertainty accuracy than GTWR (smaller AICc, RMSE, MAE and standard deviation of Z score of model residuals), and also reduced RMSE and MAE of model residuals by 67% comparing to OLS model; while GWR only reduced RMSE and MAE by 14%~15% comparing to OLS. The traditional OLS and GWR were inadequate for describing the nonstationary of PM 2.5 . The LMM slightly performed better than GWR since it considered different locations as a random effect and meanwhile handled the repeated measurements using the R matrix, which provided an alternative solution besides the GWR-based models. Although GWR that incorporates the spatial non-stationarity of PM 2.5 improved the model performance of OLS, it is still far from TWR that incorporates temporal nonstationary of PM 2.5 . GTWR did not bring any improvements to TWR by adding spatial information because of the limited number of locations. The temporal heterogeneity is more obvious than spatial heterogeneity in this case. Thus, the incorporation of temporal information is inevitable and adequate for modeling the relationship between PM 2.5 and the other air pollutants in this study. This work provides evidence of spatial-temporal heterogeneity in the relationship between PM 2.5 and the other standard air pollutants, and also provides possible solutions for modeling PM 2.5 with the other air pollutants for Heilongjiang province. In addition, the localized model coefficients and predictions of the GWR models can provide spatio-temporal "hot spots" of PM 2.5 pollution, which should be useful for assisting the governmental agencies to pin-point the seriousness of air pollution or local emission in order to make better management decisions.