Geographic Graph Network for Robust Inversion of Particulate Matters

: Although remote sensors have been increasingly providing dense data and deriving reanalysis data for inversion of particulate matters, the use of these data is considerably limited by the ground monitoring samples and conventional machine learning models. As regional criteria air pollutants, particulate matters present a strong spatial correlation of long range. Conventional machine learning cannot or can only model such spatial pattern in a limited way. Here, we propose a method of a geographic graph hybrid network to encode a spatial neighborhood feature to make robust estimation of coarse and ﬁne particulate matters (PM 10 and PM 2.5 ). Based on Tobler’s First Law of Geography and graph convolutions, we constructed the architecture of a geographic graph hybrid network, in which full residual deep layers were connected with graph convolutions to reduce over-smoothing, subject to the PM 10 –PM 2.5 relationship constraint. In the site-based independent test in mainland China (2015–2018), our method achieved much better generalization than typical state-of-the-art methods (improvement in R 2 : 8–78%, decrease in RMSE: 14–48%). This study shows that the proposed method can encode the neighborhood information and can make an important contribution to improvement in generalization and extrapolation of geo-features with strong spatial correlation, such as PM 2.5 and PM 10 .


Introduction
As criteria air pollutants, particulate matters have adverse health effects [1] and important influence on climate changes [2][3][4]. For fine particulate matter (PM) with a diameter of 2.5 microns or less (PM 2.5 ), short-term exposure has been shown to be associated with respiratory symptoms, acute and chronic bronchitis, asthma attacks, and restricted activity days, etc. [5,6]. For particulate matter with a diameter of 10 microns or less (PM 10 ), short-term exposure has been reported to be associated with worsening asthma and chronic obstructive pulmonary disease etc. [7][8][9]. Long-term (months to years) exposure to PM 2.5 and PM 10 has been linked to premature death, reduced lung function and lung cancer, etc. [9,10]. On the other hand, as a particulate component of PM, black carbon-mainly from combustion-considerably contributes to climate warming, but as another component of PM, particulate sulfates cool the atmosphere of the Earth [11]. Accurate estimation of PM 2.5 and PM 10 is important for their emission track, prevention and control and health effect evaluation [12,13].
As ground aerosols, PMs are closely associated with aerosol optimal depth (AOD) that can be used as a critical input parameter for concentration estimation. There are a wide range of AOD remote sensing products, from the early Advanced Very High Resolution Radiometer (AVHRR), the Global Ozone Monitoring Experiment (GOME), the Moderate Resolution Imagining Spectroradiometer (MODIS), to the recent Visible Infrared Imaging Radiometer Suite (VIIRS) and Advanced Himawari Imager (AHI). Compared with the One shortcoming of conventional geostatistical and machine learning methods is the lack or the limited ability in modeling neighborhood information. As a recent deep learning method, the graph neural network (GNN) enables strong interaction modeling from the neighborhood through embedding learning of graph nodes. With a theoretical mathematical basis in spectral graph theory [56], GNN can be used to model complex geometric relationships and their interactions. As a powerful form of geometric deep learning, the GNN can well deal with irregular non-Euclidean data with limited labels, and has achieved many successful applications in a variety of domains including recommendation systems [57,58], physical systems [59], combinational optimization [60], computer vision [61], molecule findings [62] and drug discovery [63,64]. Given irregular monitoring data and complex interactions with environmental factors, GNN is an appropriate tool to encode the neighborhood information for PM pollutants. However, the typical graph network has a fixed network structure and its prediction is just limited to those nodes in the existing network. Such a transductive network cannot be used to make predictions to the unseen or new nodes in the graph, which seriously limits the applications of the graph network in many domains [65]. The existing applications of GNN in predicting PM [66,67] showed such a limitation for generalization and extrapolation.
This paper proposes a novel method of geographical graph hybrid network (GGHN) to generalize the neighborhood feature from the surrounding remote sensed data, and other spatiotemporal covariates to improve spatiotemporal inversion of PM 2.5 and PM 10 . Based on Tobler's First Law of Geography and local graph convolutions, we constructed the architecture of a geographic graph-level hybrid network to be a flexible inductive rather than transductive model for any unseen input data. Based on such a geography network, the convolutional kernel was also designed according to Tobler's law to encode a neighborhood feature through powerful embedding learning of the graph network [68]. In addition, full residual layers were concatenated with the graph convolution (GC) outputs to boost the learning and reduce over-smoothing deriving from graph convolutions. This paper showed robustness of the proposed geographic graph hybrid network for inversion of PM 2.5 and PM 10 in mainland China, and the proposed method can also be generalized to other similar geo-features that have strong spatial correlation and involve surrounding massive remote sensing data and other covariates.

Study Area
The study area of mainland China is located approximately between 18 • and 54 • north latitude and 73 • and 135 • east longitude, with a population of about 1.4 billion in 2016 and 9.6 million square kilometers ( Figure 1). The complex climate in the study area is affected by monsoon circulation and topography variability. The average air temperature is about 9.6 • C, the average annual total solar radiation is about 5.6 × 10 3 MJ/m 2 , the average annual precipitation is about 629.9 mm, the average relative humidity is about 68.0%, and the average wind speed is about 1.9 m/s [69][70][71]. The northerly wind blowing from the mainland to the ocean prevails in winter, and the southerly wind blowing from the ocean to the land prevails in summer [72]. Based on the reanalysis data [73], the study area has an average PBLH of about 591.9 m and an average cloud fraction of about 2.8%.
Air pollution is a major environmental issue in mainland China as a result of increasing industrialization and complex climate. PM 10 and PM 2.5 are two typical air pollutants, especially in the winter of mainland China. PM 2.5 mainly comes from combustion of gasoline, oil, diesel fuel or wood, cement production, etc. In addition to PM 2.5 emission sources, PM 10 also comes from dust from construction sites, landfills, agriculture, desert and atmospheric transportation [74], etc. In recent years, rigorous air-pollution controls have been taken to have a great effect in reduction of the PM 2.5 levels in the atmosphere [75].
1 Figure 1. The study region of mainland China with seven geographic regions, and the PM monitoring sites and those selected for the site-based independent testing.

PM Measurement Data
The hourly PM 2.5 and PM 10 measurement (unit: µg/m 3 ) data from 2015 to 2018 were gathered from 1594 monitoring sites of the China Environmental Monitoring Center (CNEMC) (http://www.cnemc.cn, accessed on 10 March 2020). PM 2.5 and PM 10 concentrations were measured via beta attenuation, tapered element oscillating microbalance method (TEOM), or TEOM with a filter dynamics measurement system (FDMS) [76,77]. These TEOM monitors measured PM 2.5 or PM 10 depending on the sampling head installed. For more technical details of the PM monitors, please refer to [76,78].
The raw hourly PM 2.5 and PM 10 measurements were first preprocessed to remove invalid values and outliers caused by instrument malfunction and measurement errors [79]. Then, the daily averages were obtained from the valid hourly data. In total, 1,988,424 daily measurement samples for PM 2.5 and PM 10 were obtained for mainland China. The spatial distribution of these sampling locations with their means are also shown (including the independent testing sites) in Figure 1.

Remotely Sensed Data
The advanced MAIAC AOD remote sensing data of 2015-2018 were collected from the NASA data sharing site (https://lpdaac.usgs.gov/products/mcd19a2v006, accessed on 18 March 2020). The daily data had a spatial resolution of 1 × 1 km 2 . In this study, due to a high correlation (0.51 vs. 0.30) with ground particulate matters, we also used the ground aerosol extinction coefficient (https://doi.org/10.7910/DVN/YDJT3L, accessed on 15 March 2021) [80], which was obtained by conversion from MAIAC AOD using planetary boundary layer height (PBLH) and relative humidity. The gaps of the MAIAC AOD data have been filled using powerful deep learning [81]. The normalized difference vegetation index (NDVI) and the enhanced vegetation index (EVI) 1 km data of 16-day intervals were obtained from NASA (https://modis.gsfc.nasa.gov/data/dataprod/mod13.php, accessed on 1 June 2020).

Geographic Zone
The geographic region datum (Figure 1) [82] was used to encode the region factor through seven binary (0 or 1) variables to include it in the model to account for the zonal variance in PMs.

Reanalysis Data
The coarse-resolution (0.625 • × 0.5 • ) MERRA-2 Global Modeling Initiative data (MERRA-GMI) were obtained from https://portal.nccs.nasa.gov/datashare/merra2_gmi (accessed on 1 September 2020). The dataset was generated through the simulation for the atmospheric composition coupling MERRA2 meteorological variables with the Global Modeling Initiative (GMI)'s stratosphere-troposphere chemical mechanism. The simulation is interactively coupled to the Goddard Chemistry Aerosol Radiation and Transport module, with inclusion of similar emissions for MERRA-2 [83]. Overall, 15 modeled gaseous air pollutants and particulate matter source contributions of MERR2-GMI and 6 MERRA2 parameters were selected given their acceptable correlation (absolute correlation ≥ 0.01). See Appendix A Table A1 for specific variables. In order to match the target spatial resolution (1 × 1 km 2 ), bilinear interpolation [84] was used as the resampling method to convert the coarse-resolution daily reanalysis data to fine-resolution data.

High-Resolution Meteorology and the Other Data
In addition to the reanalysis data, the high-resolution (1 km) surface meteorology data were also obtained from the high-resolution meteorological interpolation dataset of mainland China [85,86]. The full residual deep learning method [55] was used to interpolate the daily 1 × 1 km 2 grids of meteorological variables. In interpolation, the input variables included latitude, longitude, day of year, elevation, and meteorological reanalysis data (see [80] for technical details). The finely resolved dataset had high interpolation accuracy, which exactly matched the target spatial (1 km) and temporal (daily) resolution of this study. These high-resolution meteorology data included daily air pressure (hPa), air temperature ( • C), relative humidity (%), and wind speed (m/s).
We also gathered the other spatial and temporal variables, including the coordinates (x and y) and their derivatives (squares of x and y, and product of x and y), elevation, multiscale time indices (day of year and natural month). The elevation datum of 500 m resolution was obtained from Shuttle Radar Topology Mission (SRTM, https://www2.jpl. nasa.gov/srtm, accessed on 1 March 2019).

Preprocessing
The preprocessing included data cleaning, resampling, temporal alignment and normalization. Data cleaning was to remove noisy input data and improve model training.
Here, the loose outer fence [87] was defined as (mean − 5 × IQR, mean + 5 × IQR) (IQR: interquartile range) to filter out those noise or invalid measurement or covariate data that are not within the normal range. Due to complex meteorology and atmospheric chemical transmission, PM 2.5 and PM 10 may occasionally have extreme concentration values, so the PM measurement data may cover a relatively high value range [88,89]. Therefore, the loose intervals were defined here to remove noisy data [90]. A bilinear interpolation method was used to resample the data at different spatial and/or temporal resolutions into the standard target spatial (1 × 1 km 2 ) and temporal (daily) resolution data. Temporal alignment was used to match the data at different temporal resolutions (e.g., hourly, monthly and yearly) into daily resolution data. For estimation of PM 2.5 and PM 10 , log transformation was conducted to make them normally distributed. For deep learning, standard normalization [91] was conducted for the log-transformed PM variables (PM 2.5 and PM 10 ) and each covariate to make them have a standard normal distribution (mean: 0; standard deviation: 1). The PM normalization parameters were also used for the inverse normalization of the outputs so that they are converted back to the original value range.

Geographic Graph Network
The geographic graph network is defined as a graph-level neural network (G) in a geographic coordinate system, where each node corresponds to a spatial location or spatiotemporal point for the vector dataset, or one cell for the grid dataset: where V represents the set of graph nodes, E represents the set of edges connecting the nodes and C is the geographic coordinate or spatiotemporal coordinate system. Unlike the general graph neural network [92], where the connections between nodes can be determined manually or learned by learning through all the features, we can construct a geographic graph network according to Tobler's First Law of Geography: "everything is related to everything else, but near things are more related than distant things" [93]. The k nearest neighbor (k-NN) method can be used to retrieve k nearest neighbors based on geographic coordinates to construct a geographic graph neural network. As shown in Figure 2a, a target node can be linked with k immediate (one-hop) nearest neighbors to form a small local graph; then, each of the target node's neighbors can also be linked to its nearest neighbors (two hops). Thus, a small local graph of two levels can be formed by the target node and its one-hop and two-hop neighbors. Recursively, a small local graph of three or more levels can be constructed for the target node. Such multilevel interconnected nodes of all the target nodes make up a local geographic graph network.
For construction of a geographic spatiotemporal graph, the spatial coordinates and the time index first need to be normalized to be unitless. Thus, the spatiotemporal distance can be calculated based on normalized coordinates and temporal index, and k-NN can be used to retrieve k nearest nodes based on such spatiotemporal distances. As shown in Figure 2b, the nodes of three temporal slices (T − 1, T and T + 1) are used to retrieve nearest one-hop nodes for the target node in T (red square in Figure 2(II)). Similarly, the two or more hop neighbors for the target node can be retrieved recursively. Such interconnected multilevel neighbors form a small graph for the target node. All the interconnected nodes for all the target nodes make up a local spatiotemporal geographic graph network. Different from GraphSAGE [65], we limit the features used in k-NN to spatial coordinates or normalized spatiotemporal features. Based on Tobler's First Law of Geography, we defined the mean aggregate operator weighted by the reciprocal of spatial or spatiotemporal distance as: where i represents the index of the target node, N (i) denotes the set of the nearest neighbors for i, h k N (i) represents the generalized neighborhood feature of the kth graph convolution for i, h k−1 j denotes the output of the jth neighbor node of the k − 1 graph convolution, d ij is the spatial or spatiotemporal distance between i and j, m d N (i) denotes the function of weighted mean, k = 1, 2, . . . , K,K is the number of graph convolutions (the number of hops).
Then, the update function of the kth convolution layer is defined as: where h k−1 i represents the output of the k − 1th convolution, σ is the activation function (Rectified Linear Unit, ReLU), BN denotes batch normalization, W k l and W k r represent the parameter matrices of h k N (i) and h k−1 i , respectively. The last convolution has the 1-d output that represents the generalized neighborhood feature.
The algorithm of the geographic graph convolution minibatch forward is presented in Algorithm 1. The mean aggregator is almost equivalent to the convolutional messaging and propagation used in the fixed transductive graph convolution [94]. By introducing the weights of the distance reciprocal, linear transformation is conducted for the mean aggregator. This weighted convolutional aggregator is a rough, linear approximation of a localized spectral convolution. Through powerful embedding learning, this convolution is appropriate to capture spatial or spatiotemporal correlation features from the neighborhood data.

Concatenating with Full Residual Layers
In order to reduce the potential overfitting issue by graph convolutions [95], a graph hybrid network architecture is constructed ( Figure 3). Here, the output of graph convolutions is concatenated with the original input as the input of a full residual deep encoder-decoder [55]. The full residual deep network has a strong learning ability through full residual connections between the encoders and the decoders [81,96], and can strengthen the representation learning of features, thus reducing the overfitting in graph neighborhood learning. The full residual deep encoder-decoder has a symmetric network topology and consists of the input layer, encoding layers, a coding layer, decoding layers and the output layer. Each encoding layer has a corresponding decoding layer with the same number of nodes and a residual link is connected between them to enhance backpropagation of the error information through shortcuts in learning. For air pollution modeling, the sensitivity analysis of different network topologies (different layers and nodes for each layer) showed that the network topology with the number of nodes (512, 265, 128, 64,32,16,8,16,32,64,128, 256, 512) had good performance, which was measured by the high performance score in the test dataset.

Parameter Sharing Output Subject to the Relationship Constraint
As part of PM 10 , the concentration of PM 2.5 is always equal to or lower than that of PM 10 . In addition to the emission sources of PM 2.5 , PM 10 also comes from desert and construction dust, agriculture and atmospheric transformation. In order to make the model distinguish PM 2.5 and PM 10 well, we trained a model to predict the concentrations of PM 2.5 and PM 10 at the same time. Through parameter sharing and specification, the trained model has good generalization and the ability to distinguish between PM 2.5 and PM 10 . In addition, the PM 2.5 -PM 10 relationship constraint is encoded in the loss function, so that the model can make reasonable predictions for PM 2.5 and PM 10 . The loss function is defined as: where C PM 2.5 and C PM 2.5 represent the observed concentrations or their transformations (log-transformed and normalized) of PM 2.5 and PM 10 , respectively, RSE is the RSE loss function, f PM 2.5 and f PM 10 is the prediction functions for the transformations of PM 2.5 and PM 10 , respectively, N is the number of training samples, and λ r is the weight (defined as a value between 0 and 1, usually determined through sensitivity analysis) for e r , that is the constraint term of the relationship between PM 2.5 and PM 10 (PM 2.5 ≤ PM 10 ) for the prediction: that can be interpreted as: if f PM 2.5 (x) > f PM 10 (x), e r > 0 will lead to an increase in the loss, which in turn propagates back to change the parameters, thus making the loss smaller during the gradient descent optimization. By encoding (5) in the loss function, the trained model tries to maintain a reasonable relationship (PM 2.5 ≤ PM 10 ) when making predictions.

Evaluation
In order to evaluate the proposed method, regular stratified validation and a sitebased independent test were conducted. For the site-based independent test, about 15% of the monitoring sites were selected through stratified sampling for independent testing and the remaining 85% sites were used for regular training and testing ( Figure 1). Here, the geographic zone datum of mainland China was used as the stratifying factor; the seven geographic regions (zones) were shown in Figure 1. Any samples from the sites of the independent test were not used for model training, but only for the independent testing.
The regional and seasonal indices were used as the combinational stratifying factor for sampling in regular validation. The seasonal index was defined as spring (March, April and May), summer (June, July and August), autumn (September, October and November) and winter (December, January and February). Of all the samples of the 85% monitoring sites, 68% were used for model training and the other 32% were used for regular testing.
The performance metrics included R-squared (R 2 ) and root mean square error (RMSE) between predicted values and observed values. The training, testing and independent testing metrics were reported for PM 2.5 and PM 10 , respectively. Compared with testing in cross-validation, the site-based independent testing can better show the actual generalization or extrapolation accuracy of the trained models. From all the samples, we selected 20 datasets of different training and test samples using bootstrap sampling, and each set of samples was used to train a model. A total of 20 models were trained using 20 sets of samples, and their average performance metrics were summarized.  Table S1 also showed the descriptive statistics of the meteorological covariates of the monitoring sites involved in the modeling. From these daily samples, 283,719 samples were selected based on the stratified regional factor (seven geographic regions in Figure 1) from 230 sites for the site-based independent test. From the remaining samples, 1,159,199 were selected using the combinational stratifying factor of region and season for model training and 545,506 were used for testing in validation.

Selection of Significant Covariates
Correlation analysis was conducted for PM 2.5 /PM 10 and covariates. The absolute Pearson correlation of 0.01 was used to filter out the less useful covariates. In total, 35 covariates were selected as the model input from 41 candidate covariates (Figure 4 for their correlation and units). These covariates included four meteorology (air temperature, wind speed, air pressure, relative humidity), two aerosol covariates (MAIAC AOD and ground aerosol coefficient), NDVI and enhanced vegetation index (EVI), six MERRA2 variables (ozone, cloud fraction, PBLH, AOD, wind stagnation and mixing), an Aura ozone monitoring instrument (OMI) NO 2 , fifteen MERRA-GMI variables (daily satellite overpass fields: nitric oxide (NO), ozone (O 3 ), carbon monoxide (CO), NO 2 , sulfur dioxide (SO 2 ); aerosol diagnostics: organic carbon surface mass concentration, black carbon surface mass concentration, dust surface mass concentration-PM 2.5 , nitrate surface mass concentration PM 2.5 , SO 2 , nitric acid surface mass concentration; bottom layer diagnostics: NO 2 , NO, PM, PM 25 ), latitude and longitude, land-use areal proportion, and two traffic variables.

Modeling Performance
The total loss (Equation (4)) included PM 2.5 loss, PM 10 loss, and the PM 2.5 -PM 10 relationship loss. The learning curves of total loss, PM 2.5 loss and PM 10 loss showed a gradual downward trend ( Figure 5). Especially as the learning progressed, the relationship loss curve approached zero, indicating that the physical relationship of PM 2.5 ≤ PM 10 was maintained during the learning process. The learning curve of R 2 and RMSE in training, testing and site-based independent testing ( Figure 6) showed a trend of learning convergence. The sample size of the training dataset was very large (1,159,199), so a large number of learning epochs (250) was selected to ensure sufficient learning in the dataset to obtain a stable convergence state. After 250 learning epochs, the learning curve was approaching an optimal solution for the model. Through sensitivity analysis, we obtained the optimal solutions for the other hyperparameters, including a minibatch size of 2048, a learning rate of 0.01 and λ r of 0.5, respectively.  The optimal model was trained using the proposed method (  10 . The scatter plots between observed values and predicted values in the site-based independent testing (Figure 7) showed that most of the variance was captured by the trained model with few outliers. The scatter plot of observations and residuals ( Figure 8) showed a slight underestimation of extreme high values, which was normal for most regression models due to data measurement errors and modeling uncertainties [98]. The residuals presented normal distribution (Figure 9), and their averages were close to zero, indicating minimal bias in the independent test. The average SHapley Additive exPlanations (SHAP) [99,100] score of each covariate was summarized as a measure of feature importance (Supplementary Figure S1). Given that the proposed GGHN was a nonlinear modeling method, Pearson's linear correlation between each covariate and the target variable (PM 2.5 or PM 10 ) could not quantify such a nonlinear relationship. Compared with Pearson's correlation, the SHAP value better quantified the contribution of each covariate to the predictions.
Compared with other seven typical methods including a full residual deep network, local graph convolution network, random forest, XGBoost, regression kriging, kriging and a generalized additive model, the proposed geographic graph hybrid network improved test R 2 by 5-57% for PM 2.5 and 4-87% for PM 10 , and independent test R 2 by 8-57% for PM 2.5 and 8-78% for PM 10 ; correspondingly, it decreased test RMSE by 11-49% for PM 2.5 and 6-61% for PM 10 , and independent test RMSE by 14-46% for PM 2.5 and 15-48% for PM 10 10 ), which indicated more improvement in generalization and extrapolation than the two methods. Compared with generalized additive model (GAM), the proposed geographic graph hybrid network achieved the maximum improvement in testing (R 2 by 57% for PM 2.5 and 87% for PM 10 ) and independent testing (R 2 by 57% for PM 2.5 and 78% for PM 10 ).    In addition, without the downstream network structure of the full residual deep network (Figure 3c), the local graph convolutional network only had a low performance (independent testing R 2 : 0.65 for PM 2.5 and PM 10 ; RMSE: 20.98 µg/m 3 for PM 2.5 and 33.78 µg/m 3 for PM 10 ). The downstream full residual deep network improved the per-formance by about 29% in testing R 2 and about 26-28% in site-based independent testing R 2 . This showed that the downstream full residual deep network considerably reduced over-smoothing in the graph convolutional network. Without geographic graph convolutions, the full residual deep network just had a low performance (independent testing R 2 : 0.71-0.72). The addition of GCs improved the performance by about 4-5% in testing R 2 and about 15% in site-based independent testing R 2 . The results showed that geographic graph convolutions made a considerable contribution to the reduction in over-fitting and the improvement in generalization for the trained model. Statistics of the performance measures of the site-based independent test showed significant difference between regions (Supplementary Table S2). In particulate, compared with the other regions, Northwest China had fewer monitoring sites with complex terrain, and lower R 2 and higher RMSE.  (Figure 10j-l) had the highest PM 2.5 concentration, mostly located in Central and North China, and the local area of Urumqi. This winter day also had higher PM concentration than summer and autumn. The ratio of PM 2.5 to PM 10 in autumn and winter was higher than that in other seasons, indicating that many components of PM 10 in these two seasons were composed of PM 2.5 components.

Seasonal and Yearly Variation of the Predicted Surfaces
The daily and seasonal averages of predicted PM 2.5 and PM 10 surfaces were summarized for mainland China (Figure 11 for mean and median) and for seven geographic regions within it (Figures 12 and 13). Due to the influence of extreme high values, the seasonal trend of the mean was higher than that of the median; compared with the mean, the median was more robust to the extreme high values and presented a more stable seasonal trend. In addition, PM 10 showed much higher standard deviation than PM 2.5 (Supplementary Figure S2). PM 2.5 and PM 10 showed an overall downward trend year by year, and the concentration in winter was higher than that in summer. From 2015 to 2018, the annual average concentration decreased by about 15.6% for PM 2.5 and by 8.9% for PM 10 . The ratio of PM 2.5 to PM 10 decreased by about 5.4%, indicating that the decrease in PM 2.5 contributed more to the air pollution caused by particulate matters. The results were consistent with the focus of the "declare war on pollution" policy of the government of China [103,104], which was to reduce PM 2.5 concentration.     The pollution level of PM 2.5 was highest in central China, followed by eastern and northern China, as shown by seasonal changes (Figures 12 and 13). The pollution level of PM 10 was highest in northwestern China, followed by northern and central China. Southern and southwestern China had the lowest levels of PM 2.5 and PM 10 . Compared with the central, eastern and northern regions of mainland China, although the PM 2.5 pollution level in northwestern China was lower, its pollution level of PM 10 was the highest due to the impact of sand dust from the Taklimakan Desert, which was one of the main components of PM 10 in this area. The emission sources of PM 2.5 contained the combustion of gasoline, oil, diesel and wood burning, etc., in the developed regions such as central, eastern and northern China, leading to higher levels of PM 2.5 in these regions compared to the other, less-developed regions. In addition, seasonal statistics ( Figure 13) showed that compared with other regions with low PM 2.5 , the PM 2.5 decline in central, eastern, northern and northeastern China was more pronounced, and the PM 10 decline in central, eastern and northwestern China was more pronounced. Due to the impact of sandstorms from Mongolia, the decrease in PM 10 in northern China was relatively small compared to that in PM 2.5 . Figure 14 shows the enlarged grid surfaces of predicted PM 2.5 and PM 10 for four typical seasonal days and four representative regions: a spring day for the Jingjintang metropolitan area (a,b), a summer day for the surrounding area of Urumqi (c,d), a fall day for the Pearl River Delta (e,f) and a winter day for the Yangtze River Delta (g,h).  These enlarged 1 × 1 km 2 daily surfaces of predicted pollutants clearly showed spatial distribution of PM 2.5 and PM 10 concentrations and significant difference between the two. For the Jingjintang region, the PM 10 level in the whole area was high but the PM 2.5 pollution in the northwest area was low in the sandstorm day of 2015; the desert area of Xinjiang had a higher pollution level of PM than the other regions in the summer day of 2016; the Pearl River Delta had less PM pollution than other regions in the fall day of 2017; the Yangtze River Delta had more PM 2.5 pollution than PM 10 in the winter of 2018.

Discussion
This paper proposes a powerful deep learning method of a geographic graph hybrid network to model the neighborhood feature to improve the generalization and extrapolation accuracy of PM 2.5 and PM 10 . Using Tobler's First Law of Geography and local graph convolutions, the flexible hybrid framework was constructed based on spatial or spatiotemporal distances. Through powerful semi-supervised weighted embedded learning of graph convolutions, the neighborhood feature was learned from multilevel neighbors. Compared with seven representative methods, our geographic graph hybrid method substantially improved the generalization in R 2 by about 8-57% for PM 2.5 and 8-78% for PM 10 , as shown in the site-based independent test. Compared with the transductive graph network, the proposed method modeled the spatial neighborhood feature by a local inductive network structure, and thus was more generable for new samples unseen by the trained model. Compared with the-state-of-the-art methods such as random forest, XGBoost and full residual deep network, the proposed method achieved better generalization although their training performances were pretty similar. Compared with other deep learning methods, the stable learning processes of testing and site-based testing tend to converge as the index of learning epochs increases, and the fluctuations are small, indicating that the generalization has been improved. For remote areas in the study area, such as the northwestern region, compared with the other areas, there were fewer monitoring sites with complex terrain, and the site-based test performance was slightly lower, and the proposed method still worked. As far as we know, this is one of the first studies to propose the geographic graph hybrid network to improve the generalization and extrapolation of the trained model for PM 2.5 and PM 10 .
With the strong learning ability supported by automatic differentiation and embedded learning, the proposed geographic graph hybrid network has the ability to approximate arbitrary nonlinear functions [105]. Compared with traditional spatial interpolation meth-ods such as kriging and regression kriging, it better captured spatial or spatiotemporal correlation, without the need to satisfy the assumptions of second-order stationarity and spatial homogeneity [39,106], therefore substantially improving the generalization by about 15-51% in R 2 for PM 2.5 and about 17-49% in R 2 for PM 10 . Sensitivity analysis showed that three levels of graph convolutions with 12 nearest neighbors had an optimal solution for spatiotemporal neighborhood modeling of PM. The reduction in graph convolutions and/or the number of nearest neighbors reduced the generalization of the trained model. Although a further increase in graph convolutions can further improve the generalization ability of the trained model, this improvement is trivial for PM modeling and requires more intensive computing resources. This showed that compared with neighbors that were closer to the target geo-features, the remote neighbors beyond a certain range of spatial or spatiotemporal distance had limited impact on spatial or spatiotemporal neighborhood modeling.
As the results showed, although the full residual deep network had a performance similar to the proposed geographic graph method, it performed poorer than the proposed method in regular testing and site-based independent testing. In addition, there were considerable differences (≥10%) in the performance between the independent test and test (R 2 increased by about 4-5% vs. 15%; RMSE decreased by about 6-10% vs. 18-20%). This showed that the site-based independent test measured the generalization and extrapolation capability of the trained model better than the regular validation test. Sensitivity analysis also showed that the geographic graph model performed better than the nongeographic model in which all the features were used to derive the nearest neighbors and their distances. This showed that for geo-features such as PM 2.5 and PM 10 with strong spatial or spatiotemporal correlation, it was appropriate to use Tobler's First Law of Geography to construct a geographic graph hybrid network, and its generalization was better than general graph networks.
Compared with decision tree-based learners such as random forest and XGBoost, the proposed geographic graph method did not require discretization of input covariates [55], and maintained a full range of values of the input data, thereby avoiding information loss and bias caused by discretization. In addition, tree-based learners lacked the neighborhood modeling by graph convolution. Although the performance of random forest in training was pretty similar to the proposed method, its generalization was worse compared with the proposed method, as shown in the site-based independent test.
Compared with the pure graph network, the connection with the full residual deep layers is crucial to reduce over-smoothing in graph neighborhood modeling. The residual connections with the output of the geographic graph convolutions can make the error information directly and effectively back-propagate to the graph convolutions to optimize the parameters of the trained model. The hybrid method also makes up for the shortcomings of the lack of spatial or spatiotemporal neighborhood feature in the full residual deep network. In addition, the introduction of geographic graph convolutions makes it possible to extract important spatial neighborhood features from the nearest unlabeled samples in a semi-supervised manner. This is especially useful when a large amount of remotely sensed or simulated data (e.g., land-use, AOD, reanalysis and geographic environment) are available but only limited measured or labeled data (e.g., PM 2.5 and PM 10 measurement data) are available.
For PM modeling, the physical relationship (PM 2.5 ≤ PM 10 ) between PM 2.5 and PM 10 was encoded in the loss through ReLU activation and RSE. Compared with a model with a single output, a model with two or more output variables (such as PM 2.5 and PM 10 concentrations) has the advantage that the parameters in the geographic graph model can be shared and the PM 2.5 -PM 10 relationship can be embedded in the model. Sharing network parameters between different outputs also helps to reduce overfitting and improve generalization ability [107,108]. In particulate, the trained model can maintain a physically reasonable relationship between the output variables, which is important for the generalization and extrapolation of the trained model. Taking into account the significant differences in the emission sources and components of PM 2.5 and PM 10 , the concentration grid surfaces predicted by the trained model presented significant differences in spatial and seasonal changes between the two, which were consistent with observational data and mechanical knowledge [109]. Sensitivity analysis showed that a model with a single output (PM 2.5 or PM 10 concentration) and not restricted by the PM 2.5 -PM 10 relationship generated a few outliers with predicted PM 2.5 greater than predicted PM 10 , indicating that two or more shared outputs and the relational constraint between them made an important contribution to the correct predictions.
This study has several limitations. First, the unavailability of high-resolution meteorological data in certain regions and time periods may limit the applicability of the proposed PM 2.5 and PM 10 inversion method. However, based on the publicly shared measurement data of meteorological monitoring stations and coarse-resolution reanalysis data, reliable high-resolution meteorological data can be easily inversed by using existing deep learning interpolation methods [85,86]. In addition, the other high-resolution meteorological dataset can alternatively be used for the proposed method. For example, the Gridded Surface Meteorological (gridMET) Dataset [110] can be used to estimate PM 2.5 and PM 10 concentrations for contiguous U.S. Second, the proposed method only estimated the total concentrations of PM 2.5 and PM 10 , which was limited for accurately identifying the health risks of PM pollutants. The compositions and sizes of PM are different in different countries and regions, with different toxicity and health effects [102]. Accurate estimation of the hazardous components of the PM pollutants is important for downstream assessment of their health effects, and pollution prevention and control. However, considering the lack of expensive measurement data of PM constituents and their high regional variability, the inversion of PM compositions is really challenging. Third, although a total of 20 geographic graph hybrid networks were trained to obtain average performance, the training model had no uncertainty estimation, which was one of the limitations of this study.
In terms of future prospects, an extension of this research is to adapt the proposed method to effectively predict the most hazardous constituents of PM, in a semi-supervised manner, when only limited measurement data of PM constituents are available. Thereby the health risk of PM pollutants can be more accurately identified. Another future extension is uncertainty estimation, which is important as it can be provided as useful information for downstream applications. For the proposed method, the nonparametric bootstrapping method can be used to estimate the prediction error as an uncertainty measure. This method performs similarly to the parametric one, but it is widely used for various applications, including non-normal noise and nonlinear data, such as PM estimation.

Conclusions
This study presents a novel deep geometric learning method that combines a geographic graph network and a full residual deep network for robust spatial or spatiotemporal prediction of PM 2.5 and PM 10 . According to Tobler's First Law of Geography and local graph convolutions, compared with nongeographic models, the geographic graph hybrid network is constructed to be flexible, inducive and generalizable. The spatial or spatiotemporal neighborhood feature is encoded by local multilevel graph convolutions and extracted from the surrounding nearest sensed data from satellite and/or UAVs. Limited measured or labeled data of the dependent (target) variable(s) are then used to drive adaptive learning of the geographic graph hybrid model. The physical PM 2.5 -PM 10 relationship is also encoded in the loss function to decrease over-fitting and intractable bias in the prediction. In the national forecast of PM 2.5 and PM 10 in mainland China, compared with seven representative methods, the presented method considerably improves R 2 by 8-87% and reduces RMSE by 14-48% in site-based independent tests. With high R 2 of 0.82-0.83 in the independent test, the geographic graph hybrid method created the inversion of PM 2.5 and PM 10 at the high spatial (1 × 1km 2 ) and temporal resolution (daily), which was consistent with observed spatiotemporal trends and patterns. This study has important implications for high-accuracy and high-resolution robust inversions of geo-features with strong spatial or spatiotemporal correlation such as air pollutants of PM 2.5 and PM 10 .
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/rs13214341/s1: Figure S1: Bar plots of SHAP values of the trained model (a for PM 2.5 and b for PM 10 ); Figure S2: Time series plots of the standard deviations of predicted PM 2.5 and PM 10 concentrations across mainland China; Table S1: Statistics of meteorological factors for the PM monitoring sites; Table S2: Statistics of the performance metrics of the site-based independent test in mainland China and its geographic regions.

Acknowledgments:
The support of NVIDIA Corporation through the donation of the Titan Xp GPUs. The author acknowledges the contribution of Jiajie Wu for data processing.

Conflicts of Interest:
The authors declare no conflict of interest. Wind mechanical mixing, derived from wind speeds of single-level diagnostics. m/s w stag = u 2 10 + v 2 10 − u 2 2 + v 2 2