BresNet: Applying Residual Learning in Backpropagation Neural Networks to Predict Ground Surface Concentration of Primary Air Pollutants

: Monitoring air pollution is important for human health and the environment. Previous studies on the prediction of air pollutants from satellite images have employed machine learning, yet there are few enhancements to the constructure of model. Moreover, the existing models have been successful in predicting pollutants like PM 2.5 , PM 10 , and O 3 . They have not been as effective in predicting other primary air pollutants. To improve the overall prediction performance of the existing model, a novel residual learning backpropagation model, abs. as BresNet, has been proposed in this research. This model has revealed the availability to precisely predict the ground-surface concentration of the six primary air pollutants, PM 2.5 , PM 10 , O 3 , NO 2 , CO, and SO 2 , based on the satellite imagery of MODIS AOD. Two of the most commonly used machine learning models so far, viz. the multilayer backpropagation neural network (MLBPN) and random forest (RF), were employed as the control. In the conducted experiments, the proposed BresNet model demonstrated significant improvements of 18.75%/31.94%, 33.82%/85.71%, 15.00%/35.29%, 39.06%/134.21%, 23.23%/68.00%, and 137.14%/260.87% in terms of R 2 for the six primary air pollutants, compared to the RF/MLBPN model. Moreover, the convergence speed and loss function of the BresNet model compared to that of the MLBPN decreased by 55.15%, revealing superior convergence speed with the lower loss function.


Introduction
In the past two decades, climate change has become increasingly apparent.Simultaneously, the frequent occurrence of extreme weather events and shifting climate patterns have garnered widespread attention [1].This phenomenon has a multitude of effects that extend beyond the boundaries of human production, daily life, the economy, and society.It also has the potential to render the global climate uninhabitable.For example, its consequences may include, but are not limited to, the deterioration of glaciers due to global warming, soil erosion caused by desertification, and so on [2][3][4].These phenomena have demonstrated that the climate is becoming increasingly inhospitable to biological life [5].Consequently, the capacity to meticulously observe or monitor the pollutant gases generated by anthropogenic activities is of paramount importance.
Moreover, it is of the utmost importance to gain a comprehensive understanding of the patterns of pollutant emissions and the magnitude of their dispersion.Nevertheless, the current ground-based pollutant monitoring and observation methods are constrained in their capacity to provide comprehensive data due to the dispersed nature of pollutant monitoring stations across the globe [6].This approach is inadequate for fully understanding the continuous spatial distribution and temporal trends of major air pollutants in a wide region.Satellite remote sensing techniques have arisen as a viable and effective approach to overcoming the constraints inherent in ground-based monitoring methods, which have conventionally been employed for atmospheric pollution surveillance.Simultaneously, satellite technology stands out due to its unparalleled efficiency, expansive coverage, and exceptional resolution.Therefore, in order to fully exploit satellite data and harness the synergies of ground and satellite monitoring, it is of utmost importance to develop an advanced model that correlates ground surface pollutant concentrations with satellite imagery.
The deep learning network, which has a long history dating back to the 1940s, has been proposed as an artificial neural network (ANN) and was inspired by biological neural networks.Over time, artificial neural networks have undergone significant evolution, progressing from basic perceptrons to complex deep learning neural networks [5].They are adept at processing unstructured and unlabeled data, demonstrate exceptional performance in addressing nonlinear problems, and possess the capacity to learn from vast quantities of data.These characteristics, in conjunction with the recent surge in computational power and data accessibility, have propelled deep learning to the forefront of AI research.This is consistent with the vast quantity of remote sensing data.By leveraging deep learning techniques, models can be developed that accurately correlate ground surface pollutant concentrations with satellite imagery, unlocking new possibilities for environmental monitoring and pollution control.Such models can process vast amounts of satellite data, detect subtle changes in pollutant levels, and provide early warnings of potential environmental hazards [7].The advancement of optical satellite remote sensing technology and the implementation of novel machine learning methodologies in the field of remote sensing have consistently demonstrated a robust correlation between aerosol optical depth (AOD) and primary atmospheric pollutants [8][9][10][11][12][13][14].
In general, most of the studies mentioned above applied common machine learning models to predict air pollution situations with no or merely a few minor adjustments.Hence, further research is necessitated to make more substantial improvements to the model structure to accommodate the vast quantities of data characteristic for better predictive capabilities.Inspired by residual learning in computer vision, this research propose one backpropagation residual learning model, abbr.as BresNet.The BresNet model innovatively introduced a residual block structure into the optimizable backpropagation learning model and constructed a new algorithm to predict the ground surface concentrations of the six primary air pollutants, incl.the key particulate matters of PM 2.5 and PM 10 as well as the trace gases of O 3 , CO, NO 2 , and SO 2 .The prediction was mapped from the satellite data of MOIDS AOD, the results from which can provide a scientific basis for characterizing air pollution and for formulating relevant environmental protection measures.

Data Collection and Preprocessing
In the field of environmental and climatic research, accurate data collection and meticulous preprocessing are paramount for drawing meaningful conclusions.This chapter delves into the methodology adopted for gathering and processing data pertinent to the research area.

Research Area
The Guanzhong Region, situated in the central part of Shaanxi Province, China, is notable for its warm, temperate continental monsoon climate, as demonstrated in Figure 1.Specifically, the average annual temperature fluctuates between 12 • C and 14 • C, with an annual average rainfall of 530 to 750 mm [15].The region is predominantly influenced by northeastern winds, complemented by southwestern winds, resulting in an average relative humidity within the range of 60% to 70%.Geographically speaking, as illustrated in Figure 1, the Guanzhong Region is delimited by the Qinling Mountains in the south and the expansive Loess Plateau in the north.It extends from 106 • 56 ′ E to 110 • 22 ′ E in longitude and from 33 • 39 ′ N to 35 • 52 ′ N in latitude, spanning approximately 360 km from east to west, with an average elevation of approximately 500 m [16].The terrain gradually slopes downward from west to east, showcasing a characteristic plain topography.
relative humidity within the range of 60% to 70%.Geographically speaking, as illustrated in Figure 1, the Guanzhong Region is delimited by the Qinling Mountains in the south and the expansive Loess Plateau in the north.It extends from 106°56′E to 110°22′E in longitude and from 33°39′N to 35°52′N in latitude, spanning approximately 360 km from east to west, with an average elevation of approximately 500 m [16].The terrain gradually slopes downward from west to east, showcasing a characteristic plain topography.The Guanzhong Region, comprising five prominent cities-Xi'an, Baoji, Xianyang, Weinan, and Tongchuan-and spanning a vast area of 55,623 square kilometers, was selected as our research focus primarily due to its unique topographical characteristics and the critically high levels of air pollution it faces [17].Cities such as Xi'an, Xianyang, and Weinan consistently feature among the bottom 20 out of 168 key Chinese cities in terms of air quality, underscoring the urgency of our research.Accurate data collection relies on the extensive network of monitoring stations operated by both CNEMC and CMA-NOAA.Specifically, over 1500 monitoring stations across mainland China are managed by CNEMC, with 41 stations strategically located within the research area.Similarly, CMA-NOAA operates more than 300 cooperative monitoring stations in China, 4 of which are situated in our target region.This comprehensive monitoring setup ensured robust data collection for our scientific analysis (refer to Figure 1 for station locations).As illustrated in Figure 1, due to the limited number of NOAA stations in Guanzhong (only four), the ERA5-Land data were employed to enhance the precision of the dataset.

Data Collection
Effective deep learning applications hinge on a substantial amount of high-quality data.As evident in Table 1, this research utilizes datasets from various sources.These include station data provided by the China National Environmental Monitoring Center (CNEMC), and satellite imagery from the Google Earth Engine [18] and the United States Geological Survey [19].Specifically, Moderate Resolution Imaging Spectroradiometer (MODIS) AOD imagery was incorporated in this research.Additionally, meteorological data from ground-based monitoring stations operated by both the China Meteorological Administration (CMA) and the National Oceanic and Atmospheric Administration (NOAA)in the United States and reanalysis meteorological data operated by European The Guanzhong Region, comprising five prominent cities-Xi'an, Baoji, Xianyang, Weinan, and Tongchuan-and spanning a vast area of 55,623 square kilometers, was selected as our research focus primarily due to its unique topographical characteristics and the critically high levels of air pollution it faces [17].Cities such as Xi'an, Xianyang, and Weinan consistently feature among the bottom 20 out of 168 key Chinese cities in terms of air quality, underscoring the urgency of our research.Accurate data collection relies on the extensive network of monitoring stations operated by both CNEMC and CMA-NOAA.Specifically, over 1500 monitoring stations across mainland China are managed by CNEMC, with 41 stations strategically located within the research area.Similarly, CMA-NOAA operates more than 300 cooperative monitoring stations in China, 4 of which are situated in our target region.This comprehensive monitoring setup ensured robust data collection for our scientific analysis (refer to Figure 1 for station locations).As illustrated in Figure 1, due to the limited number of NOAA stations in Guanzhong (only four), the ERA5-Land data were employed to enhance the precision of the dataset.

Data Collection
Effective deep learning applications hinge on a substantial amount of high-quality data.As evident in Table 1, this research utilizes datasets from various sources.These include station data provided by the China National Environmental Monitoring Center (CNEMC), and satellite imagery from the Google Earth Engine [18] and the United States Geological Survey [19].Specifically, Moderate Resolution Imaging Spectroradiometer (MODIS) AOD imagery was incorporated in this research.Additionally, meteorological data from groundbased monitoring stations operated by both the China Meteorological Administration (CMA) and the National Oceanic and Atmospheric Administration (NOAA)in the United States and reanalysis meteorological data operated by European Centre for Medium-Range Weather Forecasts (ECMWF) in Europe were integrated.Importantly, all these datasets are publicly accessible.In this research, daily land AOD data from MODIS satellite imagery were employed, specifically those from the MCD19A2 dataset.The dataset offers spatial resolutions spanning from 250 to 1000 m, thereby ensuring comprehensive spectral coverage.MODIS instruments are frequently used in the Chinese mainland, with one or two observations per day [14].
CNEMC, a highly esteemed institution under the Ministry of Ecology and Environment of China, engages in comprehensive environmental surveillance across a multitude of fronts, including air, water, soil, and more.The platform provides real-time monitoring of key pollutants, including PM 2.5 , PM 10 , and others, the data of which can be accessed by the public via the following link: http://www.cnemc.cn(accessed on 27 June 2024).Furthermore, the National Oceanic and Atmospheric Administration (NOAA), the U.S. meteorological authority, publicly releases meteorological data for mainland China every three hours through a collaboration with the China Meteorological Administration (CMA), which is accessible via the following link: https://gis.ncdc.noaa.gov/maps/ncei/cdo/hourly(accessed on 27 June 2024) [20].
Meanwhile, an open-access reanalysis dataset, ERA5-Land, made public by the European Centre for Medium-Range Weather Forecasts (ECMWF), also provides highly reliable surface meteorological data.The ERA5-Land reanalysis dataset, with a spatial resolution of 9 km × 9 km and temporal resolution of 1 h, utilizes atmospheric forcing and lapse rate correction to enhance data quality.Surface meteorological data are also available at the following website: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview (accessed on 27 June 2024) [21].This research employs these valuable datasets to enhance remote sensing data based on predictions of ground surface air pollutant concentrations by focusing on atmospheric factors such as pressure, humidity, temperature, and wind conditions.

Data Preprocessing
Given the diverse range of data sources employed in this research, substantial variations exist in data formats and resolutions, necessitating rigorous preprocessing and the spatiotemporal alignment of the datasets.Firstly, to address outliers in land station data, we systematically eliminated them by referencing the "Technical Regulation on Ambient Air Quality Index (AQI) (on trial)".To ensure the integrity of satellite data, the quality control functionalities of GEE were harnessed to exclude samples significantly impacted by cloud cover.Through advanced techniques like mosaicking, projection transformation, resampling, reprojection, and cropping, high-quality satellite remote sensing imagery of our designated research area was acquired.Furthermore, Kriging interpolation was employed to generate the delta values of spatial meteorological data, facilitating modifications to the meteorological data on the research region from NOAA stations and ERA5-Land meteorological measurements.
Secondly, inconsistencies in the observation timestamps, temporal resolutions, and spatial resolutions among the datasets necessitated the implementation of spatiotemporal alignment.To reconcile temporal resolution discrepancies, the imaging time frame of each satellite was established as the temporal reference for alignment.Specifically, air quality and meteorological data were synchronized based on the imaging time characteristics of the respective satellites.With the use of MODIS satellite products that provide daily averaged data, to align time scales, the air quality and meteorological hourly data were calculated to obtain daily averages for the purpose of aligning of all training data.
Finally, in order to achieve spatial alignment, the near analysis algorithm (NAA) was employed in order to merge CEMC observations with NOAA and ERA5-Land datasets.Moreover, the normalized difference vegetation index (NDVI) was directly matched using the NDVI band from the MOD13A2 product.By implementing these comprehensive preprocessing and alignment procedures, the integrity and comparability of the diverse datasets were ensured.Given the differences in data scales and measurement units, normalization of the data was necessary to minimize model errors.This research employed the min-max normalization method, which is defined in Equation (1): where I represents the original feature data, I min and I max represent the minimum and maximum values of the feature data, respectively, and I n represents the normalized feature data.

Methodology
This section presents a novel model architecture, viz. the backpropagation residual neural network (BresNet), which is based on the common multi-layer backpropagation neural network (MLBPN) and founded upon a comprehensive preliminary analysis of recent advancements in machine learning models, with a particular focus on the predictive analysis of spatial distributions of primary air pollutants.After a meticulous examination of the most advanced methodologies currently available, a deep learning network architecture was formulated that is rooted in residual learning.The proposed architecture was designed to address the limitations of traditional deep networks, in particular the vanishing gradient problem and the need to improve training efficiency.

Structure of MLBPN
The MLBPN has been employed with increasing frequency over the past decade for nonlinear regression and prediction work in a variety of fields, offering advantages in addressing nonlinear system problems.As illustrated in Figure 2, the MLBPN model comprises an input layer, multiple hidden layers, and an output layer.All layers are fully connected (FC) layers [22].The input layer receives the feature data, which are then processed through the hidden layers, and finally produces an output through the output layer.Each layer comprises numerous neurons, which execute autonomous computation, as exemplified in Equation (2): where X (n+1) and X (n) denote the output matrices transmitted to the next layer and the input matrices received from the preceding layer, respectively.Here, σ signifies the activation function employed by the current layer, while W and b represent the weight matrices and bias vector, respectively.
matrices received from the preceding layer, respectively.Here, σ signifies the activation function employed by the current layer, while W and b represent the weight matrices and bias vector, respectively.These neurons undergo weighting and activation function computations, wherein neurons situated in adjacent layers are fully interconnected.Subsequently, connection weights are adjusted layer by layer, from the output layer to the input layer, following a certain direction to minimize the target output and actual error.During backward propagation, this error is propagated back through the network, and the weights are adjusted accordingly.Ultimately, the weight parameters are optimized continuously through the calculation of the loss function with automatic differentiation in order to reach the optimal parameters.

Establishment of BresNet Model
The continuous development of machine learning and deep learning, which has been a significant area of interest in the last decade, has led to a reassessment of the nonlinear prediction abilities of traditional artificial neural networks.The capacity of deep learning to discern intricate patterns and representations from voluminous data has transformed numerous fields, including remote sensing and environmental studies.In general, the predictive performance of MLBPN has been improved with an increased number of neurons in the model.However, simply adding the number of neurons to the depth of layers will lead to the phenomenon of gradient vanishing or gradient explosion, which will, in turn, lead to unstable model training results [23].Consequently, due to the depth limitation, the conventional MLBPN faces constraints that hinder improvements in the model's predictive efficacy through incremental depth augmentation.

Residual Block
The residual block, an extremely powerful constructure produced during the evolution of deep learning, has significantly enhanced the performance of convolutional neural networks in the field of computer vision [23].Therefore, inspired by ResNet, the first deep learning network utilizing residual learning, we introduced residual blocks into a backpropagation neural network with only FC layers, and constructed a new residual block, These neurons undergo weighting and activation function computations, wherein neurons situated in adjacent layers are fully interconnected.Subsequently, connection weights are adjusted layer by layer, from the output layer to the input layer, following a certain direction to minimize the target output and actual error.During backward propagation, this error is propagated back through the network, and the weights are adjusted accordingly.Ultimately, the weight parameters are optimized continuously through the calculation of the loss function with automatic differentiation in order to reach the optimal parameters.

Establishment of BresNet Model
The continuous development of machine learning and deep learning, which has been a significant area of interest in the last decade, has led to a reassessment of the nonlinear prediction abilities of traditional artificial neural networks.The capacity of deep learning to discern intricate patterns and representations from voluminous data has transformed numerous fields, including remote sensing and environmental studies.In general, the predictive performance of MLBPN has been improved with an increased number of neurons in the model.However, simply adding the number of neurons to the depth of layers will lead to the phenomenon of gradient vanishing or gradient explosion, which will, in turn, lead to unstable model training results [23].Consequently, due to the depth limitation, the conventional MLBPN faces constraints that hinder improvements in the model's predictive efficacy through incremental depth augmentation.

Residual Block
The residual block, an extremely powerful constructure produced during the evolution of deep learning, has significantly enhanced the performance of convolutional neural networks in the field of computer vision [23].Therefore, inspired by ResNet, the first deep learning network utilizing residual learning, we introduced residual blocks into a backpropagation neural network with only FC layers, and constructed a new residual block, as shown in Figure 3b.Compared with the three FC layers, as shown in Figure 3a, in the residual block, we proposed adding a shortcut connection (SC) layer after the three FC layers, which can be expressed with the following Formula (3): where F(X n ) represents the matrices processed by the three FC layers, and x and F(x) denote the original feature data and the data processed by the SC layer, respectively.Here, G(X) signifies the matrices output to the next layer.
as shown in Figure 3b.Compared with the three FC layers, as shown in Figure 3a, in the residual block, we proposed adding a shortcut connection (SC) layer after the three FC layers, which can be expressed with the following Formula (3): where (  ) represents the matrices processed by the three FC layers, and  and () denote the original feature data and the data processed by the SC layer, respectively.Here, () signifies the matrices output to the next layer.In the context of predictive modeling, it is common for fully connected layers to have varying numbers of nodes, or neurons, across different layers.This disparity in node count precludes the direct summation of input values from the input residual blocks with the output values, as such an operation would be dimensionally inconsistent.Consequently, the introduction of an intermediary fully connected layer becomes crucial for adjusting the channel size, analogous to the function served by a 1 × 1 convolutional kernel in Goog-LeNet [24], a convolutional neural network (CNN).By treating the number of nodes in the fully connected layer as akin to the channel size in a CNN, we can gain a deeper understanding of the role played by this critical component.This layer ensures compatibility between the dimensions of the residual input and the main pathway, facilitating the effective integration of information.The detailed mechanism underlying this process is elucidated in Figure 4, providing a comprehensive visualization of the specific workings of this architecture.In the context of predictive modeling, it is common for fully connected layers to have varying numbers of nodes, or neurons, across different layers.This disparity in node count precludes the direct summation of input values from the input residual blocks with the output values, as such an operation would be dimensionally inconsistent.Consequently, the introduction of an intermediary fully connected layer becomes crucial for adjusting the channel size, analogous to the function served by a 1 × 1 convolutional kernel in GoogLeNet [24], a convolutional neural network (CNN).By treating the number of nodes in the fully connected layer as akin to the channel size in a CNN, we can gain a deeper understanding of the role played by this critical component.This layer ensures compatibility between the dimensions of the residual input and the main pathway, facilitating the effective integration of information.The detailed mechanism underlying this process is elucidated in Figure 4, providing a comprehensive visualization of the specific workings of this architecture.
as shown in Figure 3b.Compared with the three FC layers, as shown in Figure 3a, in the residual block, we proposed adding a shortcut connection (SC) layer after the three FC layers, which can be expressed with the following Formula (3): where (  ) represents the matrices processed by the three FC layers, and  and () denote the original feature data and the data processed by the SC layer, respectively.Here, () signifies the matrices output to the next layer.In the context of predictive modeling, it is common for fully connected layers to have varying numbers of nodes, or neurons, across different layers.This disparity in node count precludes the direct summation of input values from the input residual blocks with the output values, as such an operation would be dimensionally inconsistent.Consequently, the introduction of an intermediary fully connected layer becomes crucial for adjusting the channel size, analogous to the function served by a 1 × 1 convolutional kernel in Goog-LeNet [24], a convolutional neural network (CNN).By treating the number of nodes in the fully connected layer as akin to the channel size in a CNN, we can gain a deeper understanding of the role played by this critical component.This layer ensures compatibility between the dimensions of the residual input and the main pathway, facilitating the effective integration of information.The detailed mechanism underlying this process is elucidated in Figure 4, providing a comprehensive visualization of the specific workings of this architecture.

Model Design
As previously mentioned, the incorporation of residual blocks to enhance the depth of the MLBPN led us to introduce BresNet, a novel backpropagation neural network based on residual learning, as illustrated in Figure 5. BresNet's architecture is meticulously designed, consisting of three primary modules: the input module, the deep feature learning module, and the output module.The input module serves as the foundation for incoming data, and begins with a splitting function that divides the input data into two streams.One stream is fed directly into the residual blocks of the deep feature learning module, while the other is routed to the shortcut layer.This design allows for a direct comparison between the initial and transformed features, facilitating the learning of residual functions.

Model Design
As previously mentioned, the incorporation of residual blocks to enhance the depth of the MLBPN led us to introduce BresNet, a novel backpropagation neural network based on residual learning, as illustrated in Figure 5. BresNet's architecture is meticulously designed, consisting of three primary modules: the input module, the deep feature learning module, and the output module.The input module serves as the foundation for incoming data, and begins with a splitting function that divides the input data into two streams.One stream is fed directly into the residual blocks of the deep feature learning module, while the other is routed to the shortcut layer.This design allows for a direct comparison between the initial and transformed features, facilitating the learning of residual functions.The deep feature learning module lies at the heart of BresNet, which comprises six residual blocks, each carefully designed to extract complex features from the input data.Following the residual block, a shortcut layer is introduced.This layer ensures that the number of neurons remains consistent with the initial input, allowing for a seamless integration of the residual features learned by the blocks.The shortcut layer effectively bridges the gap between the original input and the transformed output, enabling the network to learn the residual mappings more efficiently.
The final component of BresNet is the output module.It consists of three FC layers, each designed to learn from the rich feature representations output by the deep feature learning module.These FC layers gradually distill the features, preparing them for the final output layer.The output layer, in turn, produces the predicted pollutant values, leveraging the deep features learned by the preceding layers.
Therefore, BresNet boasts a total of thirty FC layers, exhibiting a depth on par with that of state-of-the-art convolutional neural networks.To enhance the network's nonlinearity and learning capabilities, all activation functions within BresNet utilize ReLU.This The deep feature learning module lies at the heart of BresNet, which comprises six residual blocks, each carefully designed to extract complex features from the input data.Following the residual block, a shortcut layer is introduced.This layer ensures that the number of neurons remains consistent with the initial input, allowing for a seamless integration of the residual features learned by the blocks.The shortcut layer effectively bridges the gap between the original input and the transformed output, enabling the network to learn the residual mappings more efficiently.
The final component of BresNet is the output module.It consists of three FC layers, each designed to learn from the rich feature representations output by the deep feature learning module.These FC layers gradually distill the features, preparing them for the final output layer.The output layer, in turn, produces the predicted pollutant values, leveraging the deep features learned by the preceding layers.
Therefore, BresNet boasts a total of thirty FC layers, exhibiting a depth on par with that of state-of-the-art convolutional neural networks.To enhance the network's nonlinearity and learning capabilities, all activation functions within BresNet utilize ReLU.This choice of activation function contributes to the efficiency and effectiveness of the model, enabling it to handle complex relationships and patterns inherent in the input data.

Model Optimization
In the process of optimizing a deep learning network, it is of paramount importance to consider the role of activation functions, loss functions, and optimizers.It is reasonable to posit that the implementation of the aforementioned modules, or enhancements to their existing functionality, will result in an improvement in the predictive performance of the system.This section will therefore concentrate on the selection or optimization of these three modules.

Activation Function in Forward Propagation
Generally, each neuron needs an activation function, enabling it to process complex nonlinear calculations.Among the mainstream activation functions, sigmoid, tanh and rectified linear unit (ReLU) are defined in Formulas ( 4)-( 6), respectively: , ( 4) In Figure 6, the graphs of the three functions are presented in (a), (b), and (c).It can be observed that the sigmoid and tanh functions exhibit similar characteristics, and thus, these two functions are frequently employed in classification problems.
rectified linear unit (ReLU) are defined in Formulas ( 4)-( 6), respectively: In Figure 6, the graphs of the three functions are presented in (a), (b), and (c).It can be observed that the sigmoid and tanh functions exhibit similar characteristics, and thus, these two functions are frequently employed in classification problems.
Notably, the ReLU activation function, in its simple form, allows for fast evaluation, which significantly reduces the overall computational costs during training and inference, and effectively mitigates the vanishing gradient problem encountered in deep networks.By ensuring a constant gradient of 1 for positive inputs, it maintains healthy gradient flow during backpropagation, thereby facilitating faster convergence.Additionally, due to its unique function, some neurons are programmed to output a value of zero, which is analogous to implementing a dropout operation.This phenomenon can enhance the model's generalization capabilities and reduce overfitting by promoting regularization.Consequently, the activation function utilized for all fully connected layers within our residual block is ReLU, with the objective of ensuring the stability of training.

Loss Function
When considering the implementation of a model, it is essential to determine an appropriate measure of fit.The loss function then quantifies the discrepancy between the actual and predicted values of the target.In most cases, the loss function is non-negative, with smaller values indicating a smaller loss.The loss is zero for a perfect prediction.Notably, the ReLU activation function, in its simple form, allows for fast evaluation, which significantly reduces the overall computational costs during training and inference, and effectively mitigates the vanishing gradient problem encountered in deep networks.By ensuring a constant gradient of 1 for positive inputs, it maintains healthy gradient flow during backpropagation, thereby facilitating faster convergence.Additionally, due to its unique function, some neurons are programmed to output a value of zero, which is analogous to implementing a dropout operation.This phenomenon can enhance the model's generalization capabilities and reduce overfitting by promoting regularization.Consequently, the activation function utilized for all fully connected layers within our residual block is ReLU, with the objective of ensuring the stability of training.

Loss Function
When considering the implementation of a model, it is essential to determine an appropriate measure of fit.The loss function then quantifies the discrepancy between the actual and predicted values of the target.In most cases, the loss function is non-negative, with smaller values indicating a smaller loss.The loss is zero for a perfect prediction.Moreover, the loss function in regression tasks is typically classified into L1 loss and L2 loss based on the MAE and MSE, respectively: where n is the number of samples, y i is the true value, and ŷi is the predicted value for the i-th sample.
L2 loss, compared to L1 loss, is less robust, amplifying it significantly for errors greater than 1, and making the model more sensitive to outliers.Nevertheless, the L2 loss is differentiable at all points, providing smoother gradients during backpropagation, which aids in more efficient and stable optimization.Additionally, the L2 loss penalizes large errors more heavily than smaller ones, encouraging the model to prioritize reducing large prediction discrepancies.Overall, the choice between L1 and L2 losses depends on the specific requirements of the task and the characteristics of the dataset.

Optimizer of Backpropagation
In deep learning, optimizers play a pivotal role in adjusting the model's parameters in order to minimize the loss function and enhance the model's performance.Two of the most commonly utilized optimizers are stochastic gradient descent (SGD) and adaptive moment estimation (Adam) [25].
SGD is a fundamental optimization algorithm widely used in the field of machine learning.Contrary to batch gradient descent, which relies on the entire dataset to calculate gradients in every iteration, SGD refines the model parameters by utilizing merely a single sample or a small batch of samples.This method significantly diminishes the computational expense and facilitates the swifter convergence of the model.The updating principle for SGD is straightforward: Here, θ represents the model parameters, η is the learning rate, and ∇ θ J(θ; x(i), y(i)) denotes the gradient of the loss function, J, with respect to the parameters, θ, for a randomly selected sample (x(i), y(i)).
Adam represents an advancement compared to earlier optimization algorithms, including AdaGrad and RMSProp.Adam combines the momentum approach from SGD with RMSProp's adaptive learning rate, creating a robust and efficient optimizer [26].This optimizer adjusts learning rates for each parameter based on the first and second moments of gradients, and incorporates a momentum term to accelerate convergence by using a running average of past gradients.It also adapts the learning rate for each parameter based on the uncentered variance of its gradients, enabling dynamic step size adjustments.The update equations for Adam are as follows: here, g t represents the gradients of the loss function with respect to the parameters at time step t. m t and v t are the first and second moment predicts, respectively, which track the mean and squared mean of past gradients to adjust the update direction and step size.mt and vt are bias-corrected versions of these predictions to account for initial underestimation.β 1 and β 2 are hyperparameters that control the decay rates of these moment predictions.η is the learning rate determining the step size, while ϵ is a small constant added to prevent division by zero during parameter update, ensuring numerical stability.Collectively, these variables enable Adam to adaptively adjust the learning rate for each parameter, improving optimization performance.

Model Training
As outlined in Section 2.2., the near analysis algorithm is employed in order to merge CNEMC observations with NOAA and ERA5-Land datasets.Training and validation datasets were constructed using MODIS AOD, which were combined with the meteorological, ground surface concentrations of six primary air pollutants and NDVI values, respectively (please refer to Table 1 for further details).The final datasets comprise 15,577 groups from 2019 to 2022, with 485,015 data points in total, which were used for model training and validation purposes.
In constructing our network architecture, Adam was utilized as the optimizer, demonstrating its superior performance in the realm of non-linear prediction.With a precisely set learning rate of 0.001, rigorous model training was conducted employing k-fold crossvalidation (the k value is 5), with each fold enduring 1000 epochs for thorough assessment.The MSE was employed as the loss function, providing invaluable feedback from the original dataset during the backpropagation stage.To optimize computational efficiency, batch processing with the batch size of 256 was adopted.Additionally, a dynamic learning rate tuning system was incorporated that reduced the current learning rate when the epoch number of training progressed to 70% and 90% with a decay weight of 0.5.This sophisticated approach ensured comprehensive model optimization and optimized performance.
Concurrently, to establish a benchmark comparison, we constructed a predictive random forest model based on the findings from Li et al.'s study [27].This model was set up with 100 decision trees and a minimum leaf size of 5. Furthermore, as part of the control group for comparative analysis, a MLBPN featuring a hidden layer structure consisting of [15,15] nodes was included.To provide an additional point of comparison and further validate the regression capabilities of the model, an MLBPN without residual learning was also constructed, keeping its depth consistent with that of BresNet.To ensure a straightforward comparison with BreNet, the same hyperparameters, loss functions, and optimizers were used across all models.
All the experiments were conducted on a Windows 10 Professional 64-bit operating system with a 12th Gen Intel ® Core™ i5-12500 CPU configuration and NVIDIA GeForce GTX 1080Ti GPU configuration.

Evaluation Metrics
Evaluation metrics serve as benchmarks to assess the effectiveness of a model.To ensure a comprehensive analysis of our model's ability to address the limitations of traditional MLBPN, we adopted four statistical metrics: R 2 , RMSE, and MAE [28].Meanwhile, the mean squared error (MSE) was employed as the loss function, providing invaluable feedback from the original dataset during the backpropagation stage.The respective formulas are as follows: R 2 = MSE var(P T )vat(P G ) , ( 16) where P G and P T represent the predicted ground surface concentration of pollutants and the monitored value, respectively, and n denotes the number of samples.

Model Performances 4.3.1. Model Training Performance
To validate the impact of incorporating residual learning into the MLBPN, line graphs were constructed depicting the validation loss of three models that utilized MODIS AOD data to predict the ground surface concentration of PM 2.5 : the original MLBPN, the MLBPN with a depth equivalent to that of BresNet, and BresNet itself.These models were trained across varying epochs, and their performance is visually represented in Figure 7. Due to the k-fold (k = 5) cross-validation methodology being employed, the total epoch number reached 5000.Furthermore, each fold was trained independently from the others.
the monitored value, respectively, and n denotes the number of samples.

Model Training Performance
To validate the impact of incorporating residual learning into the MLBPN, line graphs were constructed depicting the validation loss of three models that utilized MODIS AOD data to predict the ground surface concentration of PM2.5: the original MLBPN, the MLBPN with a depth equivalent to that of BresNet, and BresNet itself.These models were trained across varying epochs, and their performance is visually represented in Figure 7. Due to the k-fold (k = 5) cross-validation methodology being employed, the total epoch number reached 5000.Furthermore, each fold was trained independently from the others.As illustrated in Figure 7, the validation loss declines as the depth of the MLBPN increases, yet the requisite number of epochs to achieve the optimal outcome also increases.The convergence speed of BresNet with the introduction of residual learning can be similar to that of the simple MLBPN while having a lower loss function (especially the third fold).Furthermore, consistent performance of BresNet was observed across different folds in particular, always demonstrating better convergence speed and lower loss function values compared to the other models.Figure 8 demonstrates similar results, with BresNet being the model with the smallest and most stable validation loss among the three models.As illustrated in Figure 7, the validation loss declines as the depth of the MLBPN increases, yet the requisite number of epochs to achieve the optimal outcome also increases.The convergence speed of BresNet with the introduction of residual learning can be similar to that of the simple MLBPN while having a lower loss function (especially the third fold).Furthermore, consistent performance of BresNet was observed across different folds in particular, always demonstrating better convergence speed and lower loss function values compared to the other models.Figure 8 demonstrates similar results, with BresNet being the model with the smallest and most stable validation loss among the three models.Table 2 presents the prediction performance of the three models, RF, MLBPN, and BresNet, training on six primary pollutants, based on the MODIS AOD data.The BresNet model demonstrates significant accuracy improvements over the MLBPN and RF, the overall performance of which, based on a comprehensive evaluation across R 2 , RMSE, and MAE metrics for six primary pollutants, showed enhancements of 42.88% and 69.40%, respectively.Table 2 presents the prediction performance of the three models, RF, MLBPN, and BresNet, training on six primary pollutants, based on the MODIS AOD data.The BresNet model demonstrates significant accuracy improvements over the MLBPN and RF, the overall performance of which, based on a comprehensive evaluation across R 2 , RMSE, and MAE metrics for six primary pollutants, showed enhancements of 42.88% and 69.40%, respectively.In Table 2, despite the RF model's already impressive performance in predicting the ground surface concentrations of PM 2.5 , PM 10 , and O 3 , our proposed model still offers slight improvements of 18.75%, 33.83%, and 15.00% in terms of the R 2 metric, respectively, in the predictions of these three pollutants, which were already accurately predicted by the RF model.Furthermore, our model significantly enhances the accuracy of results on the NO 2 , CO, and SO 2 gases, which were previously poorly predicted.Specifically, for R 2 , our model improved the prediction performance by 137.14% for SO 2 , and when compared to the MLBPN, our model's predictive performance for this pollutant increased by a staggering 260.87%.Notably, BresNet demonstrates superior prediction accuracy for gases like NO 2 , CO, and SO 2 , which were previously poorly predicted by other models.In summary, our model exhibits outstanding performance, especially in predicting SO 2 levels, achieving significant advancements, as evident in Figure 9 (in the graph is the percentage improvement in the mean value of R 2 , the RMSE, and the MAE), which clearly illustrates the substantial performance difference compared to BresNet.

Model Prediction Performance
As elaborated in Chapter 1, the core objective of our modeling efforts was to visually represent the spatial dissemination patterns of pollutants.To achieve this scientific endeavor, we mapped the spatial distribution of six primary pollutants' ground surface concentrations in Guanzhong, based on BrestNet, by utilizing MODIS AOD data.The aforementioned mappings are presented in Figures 11-14.In these figures, the pollutant concentrations are expressed in units of μg/m 3 , and the time span and time of day vary for each figure.

Model Prediction Performance
As elaborated in Chapter 1, the core objective of our modeling efforts was to visually represent the spatial dissemination patterns of pollutants.To achieve this scientific endeavor, we mapped the spatial distribution of six primary pollutants' ground surface concentrations in Guanzhong, based on BrestNet, by utilizing MODIS AOD data.The aforementioned mappings are presented in Figures 11-14.In these figures, the pollutant concentrations are expressed in units of µg/m 3 , and the time span and time of day vary for each figure.Upon scrutiny of the mapped data of MODIS AOD, in Figures 11 and 12, discernible patterns of pollution distribution manifest within the dense urban agglomeration of Guanzhong.Specifically, elevated pollution concentrations are predominantly clustered in distinct zones, namely southern Xianyang, northern Xi'an, the entirety of Weinan, central Baoji, and segments of southern Tongchuan.These high-pollution areas align with a lower-lying basins, geographically encircled by the expansive Loess Plateau in the north and the Qinling Mountains in the south.This peculiar configuration forms a trumpet- Upon scrutiny of the mapped data of MODIS AOD, in Figures 11 and 12, discernible patterns of pollution distribution manifest within the dense urban agglomeration of Guanzhong.Specifically, elevated pollution concentrations are predominantly clustered in distinct zones, namely southern Xianyang, northern Xi'an, the entirety of Weinan, central Baoji, and segments of southern Tongchuan.These high-pollution areas align with a lower-lying basins, geographically encircled by the expansive Loess Plateau in the north and the Qinling Mountains in the south.This peculiar configuration forms a trumpetshaped topographic entity, wherein the basin serves as the bell, funneling wind currents and pollutants.
A crucial aspect influencing pollutant accumulation within these low-lying basin regions is the consistent easterly and northeasterly winds sweeping through Guanzhong throughout the year.As these winds rush through the narrow eastern gateway of the basin, along the Weihe Plain, they carry pollutants that subsequently become entrapped due to the basin's lower elevation and the encompassing mountainous terrain.
In stark contrast, low-pollution zones are primarily situated in the rural fringes of the region.Characterized by rugged mountainous landscapes, these areas feature lush vegetative cover that serves as a natural filter for pollutants, minimal anthropogenic disturbances, and notably reduced industrial pollution.
Additionally, in the map of six primary pollutants' ground surface concentrations (Figures 11-14), the urban proximity of Xi'an and adjacent regions bordering Shanxi Province and Weinan City exhibit the highest concentration distributions of PM 2.5 , PM 10 , NO 2 , CO, and SO 2 .Conversely, O 3 demonstrates the lowest concentration in these locales.This divergence arises from the contrasting meteorological prerequisites for O 3 pollution formation: low humidity, low atmospheric pressure, elevated temperature, and minimal wind speed [29].These prerequisites are diametrically opposed to the relationship observed between other pollutants and meteorological variables, underscoring the intricacy of atmospheric dynamics and their profound impact on pollutant concentrations in urban milieus.
Simultaneously, to assess the model's generalizability, we employed MODIS AOD data from January 2023, which were not included in the training dataset, for prediction purposes.Additionally, we generated pollutant distribution maps for January 2022 as a control, as illustrated in Figures 13 and 14.
CO, and SO2.Conversely, O3 demonstrates the lowest concentration in these locales.This divergence arises from the contrasting meteorological prerequisites for O3 pollution formation: low humidity, low atmospheric pressure, elevated temperature, and minimal wind speed [29].These prerequisites are diametrically opposed to the relationship observed between other pollutants and meteorological variables, underscoring the intricacy of atmospheric dynamics and their profound impact on pollutant concentrations in urban milieus.Simultaneously, to assess the model's generalizability, we employed MODIS AOD data from January 2023, which were not included in the training dataset, for prediction purposes.Additionally, we generated pollutant distribution maps for January 2022 as a control, as illustrated in Figures 13 and 14.
As demonstrated in Figures 13 and 14, the similar types of pollutants' ground surface concentrations demonstrated a common spatial distribution tendency.While the specific distribution locations may vary, they are, in general, consistent with the previously analyzed resultant areas.The discrepancies in the predicted values for the concentration of the same types of pollutants are minimal, which is acceptable given the differing time periods.Hence, this indicates that BresNet has a certain degree of generalizability.

Discussion
In this research, we constructed RF and MLBPN models, adopting the same structure as outlined in Li et al.'s research, and introduced a novel approach by integrating residual learning into the MLBPN framework, giving birth to the BresNet deep residual network.One MLBPN also constructed with the depth comparable to that of BresNet, as a control variable.This advanced network was designed to predict the ground surface concentrations of key pollutants, including PM2.5, PM10, O3, NO2, CO, and SO2, utilizing MODIS As demonstrated in Figures 13 and 14, the similar types of pollutants' ground surface concentrations demonstrated a common spatial distribution tendency.While the specific distribution locations may vary, they are, in general, consistent with the previously analyzed resultant areas.The discrepancies in the predicted values for the concentration of the same types of pollutants are minimal, which is acceptable given the differing time periods.Hence, this indicates that BresNet has a certain degree of generalizability.

Discussion
In this research, we constructed RF and MLBPN models, adopting the same structure as outlined in Li et al.'s research, and introduced a novel approach by integrating residual learning into the MLBPN framework, giving birth to the BresNet deep residual network.One MLBPN also constructed with the depth comparable to that of BresNet, as a control variable.This advanced network was designed to predict the ground surface concentrations of key pollutants, including PM 2.5 , PM 10 , O 3 , NO 2 , CO, and SO 2 , utilizing MODIS AOD satellite images.
In terms of model training, since the new model was optimized mainly for MLBPNs, we conducted a comparative analysis between an MLBPN with two hidden layers, an MLBPN with the same depth as that of BresNet, and the BresNet ontology.BresNet demonstrated superior performance compared to the remaining two models, yet exhibited unstable validation loss during training.This phenomenon can be attributed to several factors.Firstly, the absence of denoising or interpolation for outliers in the data processing stage might have allowed these outliers to be learned as part of the loss function, thus destabilizing the training process.Secondly, the nature of residual learning, where input features are continuously learned, might have made the model more sensitive to outliers.Additionally, the adoption of the aggressive MSE loss function, while contributing to the observed instability, is acceptable due to its superior learning capabilities, which ultimately do not deteriorate the final results.
In terms of model performance, the trained MLBPN and RF models do not fully align with those reported in the previous study.However, we consider our results to be comparable, despite the MLBPN's performances being marginally lower.A crucial aspect that could explain these disparities lies in the training methodologies adopted.For the MLBPN model, the application of k-fold cross-validation can effectively enhance model interpretability, especially when juxtaposed with the conventional random splitting of the dataset into 80% for training and 20% for validation.For the RF model, conversely, the inherent bagging training characteristics imply that the influence of k-fold cross-validation may not be so efficient as the MLBPN model.It should be mentioned that both the MLBPN and RF model employed in this research are in the Python platform, which has an optimization logic distinct from that of the MATLAB platform.Therefore, their predictive results could be different to those of our previous studies, as reported in Li et al. (2024) [27].Importantly, the conducted experiments demonstrate that the proposed BresNet model has much higher predictive accuracy compared to that of the RF and traditional MLBPN model.
In addition, the randomized selection of input feature variables for training could serve as an ingenious method to enhance robustness.This approach resonates with the prevalent trend of multimodality in contemporary deep learning.Consequently, it is plausible to introduce multimodal training, expanding on the aforementioned broadening of the training scope.The ultimate aspiration is to construct a comprehensive model capable of predicting surface pollutant concentrations across vast temporal and spatial scales.Our long-term target is to develop a multimodal predictive model that can forecast surface pollutant concentrations for multiple pollutants, spanning extended time frames and wide geographical areas.This ambitious undertaking aligns with the latest advancements in deep learning and remote sensing technologies, paving the way for a more comprehensive and accurate understanding of environmental pollutants.

Conclusion
This research has achieved a milestone by successfully constructing a novel backpropagation neural network incorporating residual learning, viz.BresNet.The BresNet model showcases exceptional training results, distinguished by its swift convergence and remarkable prediction accuracy.In comparison to both the RF and common MLBPN models, known as the most widely employed models in the literatures published so far, the BresNet model reveals significant advantages with respect to R 2 , the RMSE, and the MAE.In the conducted experiments, the innovative model improved the R 2 metric of the six primary pollutant averages by 26.87% and 80.48% compared to the RF and MLBPN model, the RMSE metric average by 28.43% and 49.18%, and the MAE metric average by 29.32% and 51.67%, respectively.
More than a methodological advancement, this research provides a fresh perspective on surface pollutant inversion modeling, thereby making a modest yet impactful contribu-tion to the field.Nevertheless, in future research, there is still potential for enhancement in the training methodology and the overall model structure of our model, as described in Chapter 5.By presenting an innovative neural network framework, we aspire to make subsequent breakthroughs in satellite remote sensing inversion techniques.
In summation, the findings of this research hold the potential to inform critical decisions related to environmental conservation measures, helping us to enhance air quality and promote sustainable socioeconomic development, particularly in regions like Guanzhong Area where air pollution issues are extremely serious and such advancements are extremely necessary.

Figure 1 .
Figure 1.Distribution of NOAA and CNEMC monitoring stations in the research area.

Figure 1 .
Figure 1.Distribution of NOAA and CNEMC monitoring stations in the research area.

Figure 3 .
Figure 3.The diagram of the three full connected layers and the residual block; (a) three fully connected layers; (b) the residual block (ours).

Figure 3 .
Figure 3.The diagram of the three full connected layers and the residual block; (a) three fully connected layers; (b) the residual block (ours).

Figure 3 .
Figure 3.The diagram of the three full connected layers and the residual block; (a) three fully connected layers; (b) the residual block (ours).

Figure 4 .
Figure 4.The structure of the shortcut connections layer.

Figure 5 .
Figure 5.The architecture of the BresNet model.

Figure 5 .
Figure 5.The architecture of the BresNet model.

Figure 7 .
Figure 7. Line diagrams of validation loss for different models.

Figure 7 .
Figure 7. Line diagrams of validation loss for different models.

Figure 8 .
Figure 8. Histogram of minimum validation loss for different models.

Figure 8 .
Figure 8. Histogram of minimum validation loss for different models.

Figure 9 .
Figure 9. Percentage histogram of the performance of BresNet compared with that of different models.Scatterplots between the monitored values and the values predicted by BresNet using MODIS AOD data for air pollutants are presented in Figure 10.Similar scatterplots between the monitored values and the values predicted by the RF and MLBPN model using MODIS AOD data are provided in Figures A1 and A2 in Appendix A.

Figure 9 .
Figure 9. Percentage histogram of the performance of BresNet compared with that of different models.Scatterplots between the monitored values and the values predicted by BresNet using MODIS AOD data for air pollutants are presented in Figure 10.Similar scatterplots between the monitored values and the values predicted by the RF and MLBPN model using MODIS AOD data are provided in Figures A1 and A2 in Appendix A.

Figure 9 .
Figure 9. Percentage histogram of the performance of BresNet compared with that of different models.Scatterplots between the monitored values and the values predicted by BresNet using MODIS AOD data for air pollutants are presented in Figure 10.Similar scatterplots between the monitored values and the values predicted by the RF and MLBPN model using MODIS AOD data are provided in Figures A1 and A2 in Appendix A.

Figure 10 .
Figure 10.Scatterplots between the monitored values and values predicted by the BresNet model using MODIS AOD data for the ground surface concentrations of various primary air pollutants: (a) PM 2.5 , (b) PM 10 , (c) O 3 , (d) NO 2 , (e) CO, and (f) SO 2 .

Figure A2 .
Figure A2.Scatterplots between the monitored values and values predicted by the MLBPN model using MODIS AOD data for the ground surface concentrations of various primary air pollutants: (a) PM 2.5 , (b) PM 10 , (c) O 3 , (d) NO 2 , (e) CO, and (f) SO 2 .

Table 1 .
Description of the data employed in the research.

Table 2 .
Prediction performances of the RF, MLBPN, and BresNet models.

Table 2 .
Prediction performances of the RF, MLBPN, and BresNet models.