remote A High-Performance Convolutional Neural Network for Ground-Level Ozone Estimation in Eastern China

: Having a high-quality historical air pollutant dataset is critical for environmental and epidemiological research. In this study, a novel deep learning model based on convolutional neural network architecture was developed to estimate ground-level ozone concentrations across eastern China. A high-resolution maximum daily average 8-h (MDA8) surface ground ozone concentration dataset was generated with the support of the total ozone column from the satellite Tropospheric Monitoring Instrument, meteorological data from the China Meteorological Administration Land Data Assimilation System, and simulations of the WRF-Chem model. The modeled results were compared with in situ measurements in ﬁve cities that were not involved in model training, and the mean R 2 of predicted ozone with observed values was 0.9, indicating the good robustness of our model. In addition, we compared the model results with some widely used machine learning techniques (e.g., random forest) and recently published ozone datasets, showing that the accuracy of our model is higher and that the spatial distributions of predicted ozone are more coherent. This study provides an efﬁcient and exact method to estimate ground-level ozone and offers a new perspective for modeling spatiotemporal air pollutants.


Introduction
Ground-level ozone and short-lifetime greenhouse gases are harmful to human health at high concentrations, with a significant effect on radiative forcing [1]. Although China has implemented strict atmospheric protection measures in recent years, ozone pollution is still worsening [2]. An annual mean trend of 3.3 ± 4.7 ug/m 3 of maximum daily average 8-h (MDA8) ozone in China between 2015 and 2019 was reported [3]. In 2015 and 2020, the ozone-related all-cause and respiratory health impact in China increased by 94.61 and 96.54%, respectively [4]. Ozone pollution is particularly severe in eastern China. In 2015, eastern China had the highest annual mean MDA8 ozone concentration in the country [5], which increased by~13% from 2015 to 2017 [6]. This is also a region with high population density. There is an urgent need to obtain accurate information about the spatiotemporal distribution of ozone in this region. cent inputs using different filters, and the number of trainable parameters can be reduced using the local receptive field [28]. The characteristics of CNN match the requirements of atmospheric modeling well, since many pollution and predictors exhibit spatiotemporal coherence. For example, CNN can learn the temporal pattern of input variables for predicting site observations in advance and conducting outlier detection [29,30]. Furthermore, CNN can extract useful spatiotemporal features for estimating surface air pollution spatial distribution [31,32], which is also this paper's objective.
In this study, a novel CNN-based architecture was developed for estimating groundlevel ozone concentrations. We took advantage of the flexible structure of CNN and used different filters to process the multi-scale spatiotemporal information. The inputs of our model mainly include Tropospheric Monitoring Instrument (TROPOMI) ozone retrievals, China Meteorological Administration Land Data Assimilation System (CLDAS) meteorological data, and WRF-Chem model simulations of three types of air pollution (ozone, NO 2 , and PM 2.5 ). To the best of our knowledge, this study is the first to adopt CNN architecture for ground-level ozone estimation. After validating the overall fitting ability and spatiotemporal extrapolation of our model, full-coverage daily MDA8 ozone at 0.05 • spatial resolution over eastern China in 2020 was generated. Figure 1 shows a schematic diagram of our study area, which ranges from 26.98 • to 38.45 • N and 113.96 • to 124.04 • E. The study area includes 4 provinces (Shandong, Anhui, Jiangsu, and Zhejiang) and one province-level municipality (Shanghai). It is one of the most economically developed and densely populated areas in China. In addition, Shandong has the largest chemical industrial base nationwide, and is a region with one of the highest levels of pollutant emissions in China [33]. This area can be used as a representative area for ground-level ozone modeling.

Satellite-Based Total Ozone Column (TOC) Data
Although satellite-retrieved TOC cannot directly correspond quantitatively to the ground-level ozone concentration, many studies have reported the acceptable accuracy for monitoring long-term ozone variation using a ozone monitoring instrument (OMI) TOC [37]. In particular, Liu et al. evaluated the OMI TOC in east China over ten years and indicated that there is a reasonable consistency between OMI TOC and ground-level ozone [38]. Furthermore, OMI TOC has been widely used as a useful proxy variable for ground-level ozone in both numerical models and machine learning models [14,19,39]. However, there are too many missing values in OMI data due to the sensor's physical obstruction (the so-called row anomaly). As a successor to OMI, TROPOMI is carried on the Sentinel 5 Precursor (S5P) satellite, which launched in 2017. Compared to OMI, TRO-POMI is equipped with a more sophisticated hyperspectral imager for the measurements We collected hourly ground-level ozone concentration data from the China National Environmental Monitoring Center (CNEMC) [34]. Ozone concentrations were measured using UV-spectrophotometry, and the data quality at all monitoring sites was validated using HJ 818-2018 specifications [35]. The arrangement of sampling sites followed HJ 664-2013 specifications, which mainly ensure stable environmental conditions around sites, the proper distance between sites and pollution sources, and the horizontal distance between the sampling port and the highest obstacle [36]. Although satellite-retrieved TOC cannot directly correspond quantitatively to the ground-level ozone concentration, many studies have reported the acceptable accuracy for monitoring long-term ozone variation using a ozone monitoring instrument (OMI) TOC [37]. In particular, Liu et al. evaluated the OMI TOC in east China over ten years and indicated that there is a reasonable consistency between OMI TOC and ground-level ozone [38]. Furthermore, OMI TOC has been widely used as a useful proxy variable for ground-level ozone in both numerical models and machine learning models [14,19,39]. However, there are too many missing values in OMI data due to the sensor's physical obstruction (the so-called row anomaly). As a successor to OMI, TROPOMI is carried on the Sentinel 5 Precursor (S5P) satellite, which launched in 2017. Compared to OMI, TROPOMI is equipped with a more sophisticated hyperspectral imager for the measurements of seven bands, the spatial resolution is significantly improved, and the signal-to-noise ratio is increased by 1-5 times. TROPOMI TOC is retrieved from Huggins spectral bands (325-335 nm) using the GODFIT direct fitting algorithm, which is a least-squares cost function minimization method based on the differences between measured and simulated radiance. The comparison between TROPOMI TOC and ground-based TOC measurements shows that the quality of the TROPOMI TOC is excellent with a bias of max. 3.5-5% [40]. We downloaded the TROPOMI level 2 total ozone column (5.5 × 3.5 km) from the Copernicus Open Access Hub [41]. The S5P satellite passes over our study area at about 13:30 local time. Most of the S5P passing times are within the period of the highest daily concentration of ozone over eastern China. Statistical results show that the coefficient of determination (R 2 ) and mean absolute error (MAE) between the ozone concentration at the S5P passing time and MDA8 simulation were 0.86 and 11.63 µg/m 3 , respectively. Therefore, the ozone concentration at the time of S5P passing largely reflects the daily MDA8 ozone level.
These meteorological data are important in ground-level ozone formation. Ozone is the result of the photochemical reaction of precursor pollutants, and DSR can reflect the solar radiation intensity, which has a direct impact on photochemical reaction rate [46]. TMP reflects the intensity of radiation to a certain extent and also affects the photochemical reaction rate [47]. PRE means there is little radiation near the surface, which can slow the photochemical process. In addition, depositing on water droplets could promote the depletion of ozone [48]. DPT and SH carry the information of water vapor, which can convert NO X into water-soluble HNO 3 , slowing down the production of ozone [49]. BLH is a function of many meteorological factors, which has a variety of complex nonlinear effects on ozone formation; for example, lower BLH transports less moisture to higher layers to form clouds, so the clouds become more transparent and the photolysis rates for NO 2 could be increased [50]. Wind can disperse concentration from the emission sources, so there is always a negative correlation between WIN and ground-level ozone concentration [51].

WRF-Chem Simulations and Other Data
To increase the ozone sources information in our model, we obtained WRF-Chem simulations from the Regional Atmospheric Environmental Modeling System (RAEMS). RAEMS, which aims to provide information on major atmospheric pollutants, such as PM 2.5 and ozone, and key meteorological factors for eastern China, was built based on Remote Sens. 2022, 14, 1640 5 of 17 WRF-Chem version 3.2 [52] and the Multi-Resolution Emission Inventory for China [53]. The R value between simulations and in situ observations of PM 2.5 and ozone is~0.7 and that of NO 2 is~0.5 [54].
The normalized difference vegetation index (NDVI) was obtained from the MODIS 16-day product [55]. Surface elevation (digital elevation model (DEM)) was derived from the Shuttle Radar Topography Mission [56]. They are important variables that reflect ground information. Day of year (DOY) is a commonly used representation of temporal information, which is important for air pollutant modeling.
Specific information of the data used in this paper can be found in Table S1.

Data Preprocessing
The Technical Specification for the Environmental Air Quality Assessment for China addresses the maximum daily 8 h moving average from 8:00 to 24:00 local time as the daily MDA8 ozone level [57]. As for data quality control, MDA8 ozone data will be discarded if there not 14 items of hourly ozone measurements for that time period, unless the MDA8 ozone concentration exceeds the limits of the air quality standard. We processed hourly ozone measurements of the CNEMC and RAMES simulations into MDA8 concentrations based on the specifications. Hourly meteorological data of CLDAS and ERA5 were also averaged over that time period. Hourly simulations of PM 2.5 and NO 2 were processed into daily averages.
It is worth noting that the space coverage of satellite is not uniform due to the uncertainty of observation conditions and sensor's operation status. A quality assurance value, which ranges from 0 (error) to 1 (all is well), of each pixel is provided with the TROPOMI data. The readme of TROPOMI Level-2 product recommends users to use pixels associated with quality values above 0.5 to exclude outliers caused by cloud cover, the presence of snow and ice, etc. As a result, the deficiency rate of TROPOMI data was~2.8%. In this paper, the ordinary Kriging (OK) method was used for supplementing missing data. To evaluate the effectiveness of this method, a randomized sensitivity experiment was performed according to a real case (Oct. 18), which is the worst day in our data, with a missing rate of 8.9%. Then, we imputed missing data using the OK method from TROPOMI data and the bilinear interpolation method from ERA5 TOC data, respectively. The comparison results show that OK method has better performance with a higher R 2 (seen in Figure S1), and a more similar spatial distribution (seen in Figure S2). As for other data, there are few missing values, and we believed that the OK method can guarantee the data quality.
After supplementing missing data, all data except for ground-level ozone concentration data were resampled into 0.05 • spatial resolution by performing cubic convolution interpolation, in which the values of new pixels can be determined by fitting curves through the values of the 16 nearest input pixels. The grid that the site fell into was used as a sampling center grid. We used two spatiotemporal data-matching methods. The first method was used to extract the long-term time series of the center grid, with 10 days of single-point data of each variable fused into a 2-D matrix with a size of 10 × 15 (10 days and 15 variables). The second method was used to extract the short-term spatiotemporal data of the center grid, with 5 days of spatial data of each variable merged into a 4-D matrix with a size of 5 × 7 × 7 × 15 (5 days, 7 × 7 in the spatial dimension, and 15 variables). Figure S3 shows a schematic diagram of our matching methods.

Model Development
Our model consists of two sub-networks, and was developed based on CNN and the Convolutional Block Attention Module (CBAM). The CBAM is formed by the channel attention module (CAM) and the spatial attention module, which is an efficient featurerefinement module with a somewhat increased number of parameters [58]. The CNN can be divided into 1-D and 2-D CNN according to the number of dimensions of the sliding convolution kernels. The convolution kernels of the 2-D CNN slide along two dimensions of the inputs, usually corresponding to spatial scope, so this CNN is widely Remote Sens. 2022, 14, 1640 6 of 17 used in image processing. The 1-D CNN generally corresponds to temporal dimension, so it can be applied in time-series processing. Because the applications of these algorithms are extensive and the principles are complex, the schematic formulas will not be repeated in this study.
A schematic diagram and the specific parameter settings of our model are shown in Figure 2. For sub-network 1, the input is a 4-D matrix, as mentioned above. We first applied a combination of CAM and 2-D CNN to the inputs, and 2-D CNN with 1 × 1 kernel size was applied to each variable, calculating the temporal dimension. The 1 × 1 convolution kernel, also known as network-in-network, is used for cross-dimension information interaction. Here, this algorithm was used for the aggregated temporal information. The temporal feature map of each variable was obtained through two combination layers, and the calculated feature size was 7 × 7 × 15. Then, a CBAM and 2-D CNN with 7 × 7 convolution kernel size were applied to this feature, aiming to extract the spatial and channel (the information between different variables) features, and finally a feature map was obtained with a size of 3 × 3 × 64. This sub-network represents a low-cost and crude way to process temporal information while focusing on extracting spatial and channel information. For sub-network 2, three layers of 1-D CNN were applied to 10-day time-series inputs, obtaining a 6 × 16 feature map. This sub-network worked to extract long-term temporal features. In the end, the features extracted from the two sub-networks were flattened into a 1-D vector and connected, then passed into two fully connected layers for regression. In brief, the model can be summed up as follows:

Cross-Validation
We adopted a 10-fold cross-validation (CV) strategy to validate model performance. To test the overall fit, the widely adopted sample-based CV was selected [14,20,39]. In sample-based CV, all matched samples are randomly divided 10-fold, of which 9 folds are used for model training, and the rest are used for validation. This process is repeated 10 times to ensure that all samples are tested. To test spatiotemporal predictive ability, we adopted space-based and time-based CV, whose operations are similar to sample-based CV, but the sample division criteria are spatial grid (0.05°) and DOY (day of year), respectively [59]. A schematic diagram of the 3 CV methods is shown in Figure 3. In addition, the model's optimizer is Adam, the loss function is the mean square error, the learning rate is reduced by a factor of 0.2 with 5 epochs based on validation loss, and the model stops training when validation loss has stopped improving in 10 epochs. The model was developed using the TensorFlow2.4 machine learning platform.

Cross-Validation
We adopted a 10-fold cross-validation (CV) strategy to validate model performance. To test the overall fit, the widely adopted sample-based CV was selected [14,20,39]. In sample-based CV, all matched samples are randomly divided 10-fold, of which 9 folds are used for model training, and the rest are used for validation. This process is repeated 10 times to ensure that all samples are tested. To test spatiotemporal predictive ability, we adopted space-based and time-based CV, whose operations are similar to sample-based CV, but the sample division criteria are spatial grid (0.05 • ) and DOY (day of year), respectively [59]. A schematic diagram of the 3 CV methods is shown in Figure 3.

Cross-Validation
We adopted a 10-fold cross-validation (CV) strategy to validate model performance.
To test the overall fit, the widely adopted sample-based CV was selected [14,20,39]. In sample-based CV, all matched samples are randomly divided 10-fold, of which 9 folds are used for model training, and the rest are used for validation. This process is repeated 10 times to ensure that all samples are tested. To test spatiotemporal predictive ability, we adopted space-based and time-based CV, whose operations are similar to sample-based CV, but the sample division criteria are spatial grid (0.05°) and DOY (day of year), respectively [59]. A schematic diagram of the 3 CV methods is shown in Figure 3.

Predictive Performance for Major Cites
To further investigate the spatial predictive ability of our model, we evaluated the model performance for 5 major cities, 4 provincial capital cities (Jinan, Nanjing, Hangzhou, and Hefei), and Shanghai. The sites in these cities did not participate in the model training process, and the mean values of all site observations/model predictions in each city were used as city-level ozone concentrations. The geographical locations of the 5 cities are shown in Figure S4.

Predictive Performance for Major Cites
To further investigate the spatial predictive ability of our model, we evaluated the model performance for 5 major cities, 4 provincial capital cities (Jinan, Nanjing, Hangzhou, and Hefei), and Shanghai. The sites in these cities did not participate in the model training process, and the mean values of all site observations/model predictions in each city were used as city-level ozone concentrations. The geographical locations of the 5 cities are shown in Figure S4.

Statistical Metrics
Statistical metrics included coefficient of determination (R 2 ), RMSE, mean absolute percentage error (MAPE), and mean absolute error (MAE), as well as whether the model can provide accurate peak predictions, which is more important for environmental research. We preformed peak validation for samples with MDA8 ozone concentration >160 ug/m 3 , which is the interim target 1 (IT-1) formulated by the World Health Organization. These peaks occupy approximately 9% of all samples; the probability histogram of site observations can be found in Figure S5. The statistical metrics of peak validation include the hit rate (H R ), false alarm ration (F AR ), missing rate (M R ), and threat score (T S ).

Statistical Description
During 2020 in eastern China, the annual MDA8 ozone measurement was 101 ± 44 ug/m 3 (mean ± standard deviation), with a median of 97 ug/m 3 and a 90th percentile value of 160 ug/m 3 . The monthly ozone measurement ranged from 52 ± 20 to 136 ± 40 ug/m 3 , and was the highest in May and the lowest in December ( Figure S6a). Spatially, the annual measurements of Shandong, Jiangsu, Anhui, Shanghai, and Zhejiang were 107 ± 48, 103 ± 44, 97 ± 40, 99 ± 39, and 94 ± 40, respectively. The overall spatial pattern was high in the north and low in the south ( Figure S6b).
We matched 89,331 pairs of variables by monitoring sites for model building, and collected 13,104,264 pairs of variables for ozone data grid generation. Statistical descriptions of these variables can be found in Table S2. measurements of Shandong, Jiangsu, Anhui, Shanghai, and Zhejiang were 107 ± 48, 103 ± 44, 97 ± 40, 99 ± 39, and 94 ± 40, respectively. The overall spatial pattern was high in the north and low in the south ( Figure S6b).

Overall Fitting Results
We matched 89,331 pairs of variables by monitoring sites for model building, and collected 13,104,264 pairs of variables for ozone data grid generation. Statistical descriptions of these variables can be found in Table S2.

Spatiotemporal Variations of Model Performance
The spatial distributions of R 2 values of the three CV methods are shown in Figure 5. In sample-based ( Figure 5a) and space-based CV (Figure 5b), sites with R 2 values greater than 0.9 accounted for~86 and~80%, respectively. In time-based CV (Figure 5c), R 2 values exceeded 0.8 at~80% of sites. The performance of our model was relatively poor in the southwest mountainous and northeast coastal areas, mainly because of the sparse distribution of monitoring sites in these regions. On the individual month level (Figure 6), the model performed best in summer and worst in winter. The R 2 values ranged from 0.91 to 0.97 in sample-based CV and from 0.86 to 0.95 in space-based CV. In time-based CV, the model performance fluctuated more, and the R 2 value was highest in June (0.90) and lowest in December (0.72). In general, there was no obvious spatiotemporal nonstationarity in our model's performance; no exceptionally low R 2 values were shown in the three CV methods for each site and month. In addition, the mountainous and northeastern coastal areas are relatively sparsely populated, and summer is often the season with the highest ozone concentration, which further reduces the disadvantage of our model for environmental research.

Prediction for Major Cities
The time-series diagram (Figure 7)  bution of monitoring sites in these regions. On the individual month level (Figure 6), the model performed best in summer and worst in winter. The R 2 values ranged from 0.91 to 0.97 in sample-based CV and from 0.86 to 0.95 in space-based CV. In time-based CV, the model performance fluctuated more, and the R 2 value was highest in June (0.90) and lowest in December (0.72). In general, there was no obvious spatiotemporal nonstationarity in our model's performance; no exceptionally low R 2 values were shown in the three CV methods for each site and month. In addition, the mountainous and northeastern coastal areas are relatively sparsely populated, and summer is often the season with the highest ozone concentration, which further reduces the disadvantage of our model for environmental research.

Prediction for Major Cities
The time-series diagram (Figure 7) shows that the model's predictions were in good agreement with observations in all cities. For Jinan, Nanjing, Hangzhou, Hefei, and Shanghai, the R 2 values were 0.89, 0.87, 0.84, 0.83, and 0.83, and the RMSE values predicted southwest mountainous and northeast coastal areas, mainly because of the sparse distribution of monitoring sites in these regions. On the individual month level (Figure 6), the model performed best in summer and worst in winter. The R 2 values ranged from 0.91 to 0.97 in sample-based CV and from 0.86 to 0.95 in space-based CV. In time-based CV, the model performance fluctuated more, and the R 2 value was highest in June (0.90) and lowest in December (0.72). In general, there was no obvious spatiotemporal nonstationarity in our model's performance; no exceptionally low R 2 values were shown in the three CV methods for each site and month. In addition, the mountainous and northeastern coastal areas are relatively sparsely populated, and summer is often the season with the highest ozone concentration, which further reduces the disadvantage of our model for environmental research.

Prediction for Major Cities
The time-series diagram (Figure 7) shows that the model's predictions were in good agreement with observations in all cities. For Jinan, Nanjing, Hangzhou, Hefei, and Shanghai, the R 2 values were 0.89, 0.87, 0.84, 0.83, and 0.83, and the RMSE values predicted

Comparisons
We generated the daily ozone data grid for 2020 by using the optimal model in sample-based CV, and the fitting results are shown in Figure S7. Two published datasets,

Comparisons
We generated the daily ozone data grid for 2020 by using the optimal model in samplebased CV, and the fitting results are shown in Figure S7. Two published datasets, China High Air Pollutants (CHAP) and Tracking Air Pollution in China (TAP), were selected to contrast the consistency. CHAP ozone was estimated based on the spatiotemporal extremely randomized trees model, the performance of which in eastern China (samplebased R 2 = 0.93) is similar to our results (sample-based R 2 = 0.94) [60]. TAP ozone was generated using a chemical transport model and the data fusion method with R 2 equal to 0.7 nationwide [61]. First, the quarterly averages ( Figure 8) showed a high degree of consistency between our data, CHAP, TAP, and site observations (the errors of daily averages can be seen in Figure 9), and the predicted ozone showed coherent spatial distribution without visible boundaries caused by model overfitting, which confirms the reasonability of our results. Second, daily comparisons ( Figure 10) showed that our data have mainly two advantages: (i) there were no obvious interpolation centers, and the predictions in locations of monitoring sites were more consistent with the surrounding data; and (ii) more detail on ozone distributions was provided due to the higher spatial resolution. China High Air Pollutants (CHAP) and Tracking Air Pollution in China (TAP), were selected to contrast the consistency. CHAP ozone was estimated based on the spatiotemporal extremely randomized trees model, the performance of which in eastern China (sample-based R 2 = 0.93) is similar to our results (sample-based R 2 = 0.94) [60]. TAP ozone was generated using a chemical transport model and the data fusion method with R 2 equal to 0.7 nationwide [61]. First, the quarterly averages ( Figure 8) showed a high degree of consistency between our data, CHAP, TAP, and site observations (the errors of daily averages can be seen in Figure 9), and the predicted ozone showed coherent spatial distribution without visible boundaries caused by model overfitting, which confirms the reasonability of our results. Second, daily comparisons ( Figure 10) showed that our data have mainly two advantages: (i) there were no obvious interpolation centers, and the predictions in locations of monitoring sites were more consistent with the surrounding data; and (ii) more detail on ozone distributions was provided due to the higher spatial resolution.
.   With regard to the statistical metrics of the models, we compared our model (CNN) with the two sub-networks and three machine learning models, namely the deep neural network (DNN), random forest (RF), and light gradient-boosting machine (LGBM) models. DNN, RF, and LGBM are widely used and easy to deploy, and we built them using TensorFlow 2.4, SciKit-learn 0.24.2, and LightGBM 3.2.1, respectively. The CV results and main parameter settings of these models are summarized in Table 1, and the density scatter plots can be found in Figure S8. As shown in Table 1, the performance of CNN was  With regard to the statistical metrics of the models, we compared our model (CNN) with the two sub-networks and three machine learning models, namely the deep neural network (DNN), random forest (RF), and light gradient-boosting machine (LGBM) models. DNN, RF, and LGBM are widely used and easy to deploy, and we built them using TensorFlow 2.4, SciKit-learn 0.24.2, and LightGBM 3.2.1, respectively. The CV results and main parameter settings of these models are summarized in Table 1, and the density scatter plots can be found in Figure S8. As shown in Table 1, the performance of CNN was With regard to the statistical metrics of the models, we compared our model (CNN) with the two sub-networks and three machine learning models, namely the deep neural network (DNN), random forest (RF), and light gradient-boosting machine (LGBM) models. DNN, RF, and LGBM are widely used and easy to deploy, and we built them using TensorFlow 2.4, SciKit-learn 0.24.2, and LightGBM 3.2.1, respectively. The CV results and main parameter settings of these models are summarized in Table 1, and the density scatter plots can be found in Figure S8. As shown in Table 1, the performance of CNN was superior in comparison to the other models, with the R 2 being increased by 3-8%, 1-5%, and 2-4% for sample-based, space-based, and time-based CV, respectively.

Training Sample Size
The number of training samples is among the determining factors for the performance of a data-driven model. In the above validation, we adopted a 10-fold CV method, where training data accounted for 90% of the total samples. Here, we conducted 2-fold to 9-fold CV, exploring the sensitivity of our model to the volume of training data ( Figure S9). With an increase in training samples, the results of the three CV strategies slightly improved, with R 2 value increments from 0.932 to 0.943, 0.892 to 0.911, and 0.804 to 0.824 for sample-based, space-based, and time-based CV, respectively. The performance of our model remained high even if the training dataset was the same size as the testing dataset, suggesting that the influence of the amount of training data on the model performance is limited.

Site Density
A sampling radius of 0.25 • was used to calculate the spatial density of sites in the study area. The calculated spatial density ranged from 0.9 to 21.8 ( Figure S10a), and we divided the sites into 22 groups with equal spacing of 1. As shown in Figure S10b, model accuracy improved rapidly in the first three groups, and then began to oscillate slightly, suggesting that when the spatial density of the site is greater than 3, the performance of the model does not heavily depend on that.

Spatiotemporal Sampling Size
Spatiotemporal sampling size determines the amount of information contained in the inputs. For sub-network 1, we tested a spatial sampling size range from 3 × 3 to 13 × 13 ( Figure S11a). Our experimental results show that the model performance remained stable when the spatial sampling size was 7 × 7, which corresponds to 0.35 • × 0.35 • geographic space. For sub-network 2, we tested time steps ranging from 2 to 13. As shown in Figure  S11b, when the time steps were greater than 7, the R 2 value of the three kinds of CV methods no longer increased significantly, and 10 time steps was finally selected in this paper.

Configurations of the Sub-Networks
For sub-network 1, we used the CBAM for feature refinement. Although the algorithm is widely used in computer vision, it has not been reported to be applied in atmospheric pollutant estimation. We compared 2-D CNN without CBAM with sub-network 2; the number of parameters in CBAM (1350) accounted for 2.2% of the total parameters (59,958), but the R 2 was increased by 3.0, 3.4, and 3.3% in sample-based, space-based, and time-based CV, respectively (Table S3). We think that the spatiotemporal relationship between different variables and ozone is theoretically different, and CBAM can be used for slight correction of the relationship.
For sub-network 2, RNN-based models are generally regarded as more accurate for time-series processing. We compared two widely used RNN-based models, the gated recurrent unit (GRU) [62] and long short-term memory (LSTM) [63] models, with 1-D CNN. Three groups of experiments were carried out to observe the performance of different models with approximately the same number of parameters; we set the number for each group at~8200,~20,000, and~32,000. The structures of the models were roughly the same (i.e., multiple hidden layers for feature extraction and fully connected layers for regression), and there were 10 time steps. As summarized in Table S4, there were no significant differences in the performance of the three models using the same number of parameters. The reason may be that RNN-based models focus on memorizing information of long time series, which is not required for this task (e.g., 1-D CNN can achieve convergence with >7 time steps). Meanwhile, the training speed of RNN-based models is much greater than that of CNN-based models even if the number of parameters is the same, because the recursive algorithm cannot run in parallel. From this, 1-D CNN was determined to be the preferred model for our task.

Conclusions
In summary, a regional CNN-based model was developed and a high-resolution maximum daily average 8-h surface ground ozone concentration dataset during 2020 was generated. The model was formed by two sub-networks, which were used to extract the 10-day time series features and 5-day spatiotemporal features. The proposed model achieved high performance, with sample-based, space-based, and time-based CV R 2 (T S ) values of 0.94 (0.85), 0.91 (0.81), and 0.83 (0.71), respectively. To further evaluate the spatial generalization ability of our model, we tested five cities that were not part of the model training, and the R 2 values were 0.89, 0.87, 0.84, 0.83, and 0.83 for Jinan, Nanjing, Hangzhou, Hefei, and Shanghai, respectively. The superiority of the model in terms of accuracy was confirmed by comparing it with commonly used machine learning methods, and the generated ozone data grid was superior to existing datasets in terms of spatial distribution coherence and resolution. This paper presents a novel deep learning method for ozone estimation that can provide more accurate data for atmospheric environment and epidemic research.