New Deep Learning Model to Estimate Ozone Concentrations Found Worrying Exposure Level over Eastern China

Ozone (O3), whose concentrations have been increasing in eastern China recently, plays a key role in human health, biodiversity, and climate change. Accurate information about the spatiotemporal distribution of O3 is crucial for human exposure studies. We developed a deep learning model based on a long short-term memory (LSTM) network to estimate the daily maximum 8 h average (MDA8) O3 across eastern China in 2020. The proposed model combines LSTM with an attentional mechanism and residual connection structure. The model employed total O3 column product from the Tropospheric Monitoring Instrument, meteorological data, and other covariates as inputs. Then, the estimates from our model were compared with real observations of the China air quality monitoring network. The results indicated that our model performed better than other traditional models, such as the random forest model and deep neural network. The sample-based cross-validation R2 and RMSE of our model were 0.94 and 10.64 μg m−3, respectively. Based on the O3 distribution over eastern China derived from the model, we found that people in this region suffered from excessive O3 exposure. Approximately 81% of the population in eastern China was exposed to MDA8 O3 > 100 μg m−3 for more than 150 days in 2020.


Introduction
In the stratosphere, naturally occurring ozone (O 3 ) protects organisms from the harmful solar ultraviolet radiation [1]. However, ground-level O 3 is harmful to humans and other organisms at high concentrations [2]. Integrated findings from toxicological animal studies and epidemiological studies allow us to state that O 3 can react with unsaturated fatty acids, amino groups, and other proteins, causing chest pain, memory loss, and decreased vision [3]. In recent years, although many emission control measures have been stringently enforced, China has experienced severe O 3 pollution [4]. During 2015-2020, the O 3 -related health impacts for all-cause and respiratory diseases vastly increased in China [5]. O 3 pollution was particularly severe in eastern China; the annual daily maximum 8 h average (MDA8) O 3 in this region for 2015 was the highest nationwide [6], and increased by 13.3% from 2015 to 2017 [7]. In Anhui Province, which lies in the western reaches of the eastern China region, the annual mean O 3 concentration increased by 15.6% from 2016 to 2020, and the government declared that O 3 has become the primary air pollutant [8]. Thus, a more efficient method is needed to estimate the spatiotemporal distribution of O 3 and assess its population exposure level. observations are relatively low. For example, most trace-gas-monitoring satellites are in a sun-synchronous orbit, and can only provide daily products. (2) The quality control of satellite-based/reanalysis data is not as robust as in situ observations; complicated physical retrieval or assimilation processes can result in several uncertainties. To solve these problems, we embedded an attention mechanism in the LSTM model and adopted advanced atmospheric composition observation satellite Sentinel-5 Precursor total O 3 column retrievals as one of the predictive variables. The model performances including overall fitting ability, spatiotemporal extrapolation, and peak estimation ability were fully evaluated by using observations from the China air pollution monitoring network. The spatiotemporal distribution of MDA8 O 3 was estimated in 2020 over eastern China at a spatial resolution of 0.1 • × 0.1 • . O 3 spatiotemporal patterns and human exposure intensities were determined based on a complete and credible dataset. This work can provide higher quality pollution data products for environmental and epidemiological research.

Study Area and Ground-Level O 3 Observation
The study area ( Figure 1) included four provinces (Shandong, Anhui, Jiangsu, and Zhejiang) and one province-level municipality (Shanghai), located in eastern China, ranging from 26.98 • to 38.45 • N and 113.96 • to 124.04 • E. The total land area of the study area was 517,000 km 2 , with a population of 327 million, accounting for approximately 5.2% and 23% of China's land area and total population, respectively. It is thus one of the most populated zones in the world. In addition, the industrial level of the region is highly developed, with one of China's largest comprehensive industrial centers (Shanghai-Nanjing-Hangzhou Industrial Base), and its GDP in 2020 was 31% of China's total GDP.
The main challenges in the application of LSTM in satellite-based spatial predictions of O3 over a large area can be summarized as follows: (1) The O3 concentration is very dynamic, especially in the troposphere, but the temporal resolutions of remote sensing observations are relatively low. For example, most trace-gas-monitoring satellites are in a sun-synchronous orbit, and can only provide daily products. (2) The quality control of satellite-based/reanalysis data is not as robust as in situ observations; complicated physical retrieval or assimilation processes can result in several uncertainties. To solve these problems, we embedded an attention mechanism in the LSTM model and adopted advanced atmospheric composition observation satellite Sentinel-5 Precursor total O3 column retrievals as one of the predictive variables. The model performances including overall fitting ability, spatiotemporal extrapolation, and peak estimation ability were fully evaluated by using observations from the China air pollution monitoring network. The spatiotemporal distribution of MDA8 O3 was estimated in 2020 over eastern China at a spatial resolution of 0.1° × 0.1°. O3 spatiotemporal patterns and human exposure intensities were determined based on a complete and credible dataset. This work can provide higher quality pollution data products for environmental and epidemiological research.

Study Area and Ground-Level O3 Observation
The study area ( Figure 1) included four provinces (Shandong, Anhui, Jiangsu, and Zhejiang) and one province-level municipality (Shanghai), located in eastern China, ranging from 26.98° to 38.45° N and 113.96° to 124.04° E. The total land area of the study area was 517,000 km 2 , with a population of 327 million, accounting for approximately 5.2% and 23% of China's land area and total population, respectively. It is thus one of the most populated zones in the world. In addition, the industrial level of the region is highly developed, with one of China's largest comprehensive industrial centers (Shanghai-Nanjing-Hangzhou Industrial Base), and its GDP in 2020 was 31% of China's total GDP. The hourly O 3 concentrations for 2020 were obtained from the website of the China National Environmental Monitoring Center (CNEMC) [26]. The arrangements of monitoring sites followed HJ 664-2013 specifications [27]. UV spectrophotometry was employed to measure the O 3 concentrations, and HJ 818-2018 specifications were used to ensure the data quality [28].

Tropospheric Monitoring Instrument Total O 3 Column
Many studies have reported acceptable accuracy and reasonable consistency between satellite-retrieved total O 3 columns and surface O 3 [29,30] concentrations. Additionally, the useful ozone monitoring instrument (OMI)'s total O 3 column in machine learning surface O 3 modeling has been reported. However, OMI has too many invalid data records due to the sensor's physical obstruction, which largely limits the application of OMI. As a successor of the ozone monitoring instrument (OMI), the tropospheric monitoring instrument (TROPOMI) on board the Sentinel-5P (S5P) satellite aims to observe global atmospheric components. Compared to OMI, the spatial resolution of TROPOMI is significantly higher, with a signal-to-noise ratio improving between 1-5 times [31]. Therefore, various trace gases such as O 3 , SO 2 , NO 2 , and CO can be measured more accurately. The total O 3 column Level-2 data product (5.5 × 3.5 km) was obtained from Copernicus Open Access Hub [32]. The TROPOMI-O 3 was retrieved using a direct-fitting algorithm, and the bias with respect to the ground-based O 3 column density measurements was 3.5-5% [33].

Meteorological Data and Other Covariates
Meteorological variables were obtained from the ERA5 datasets (0.25 • × 0.25 • ) of the European Center for Medium-Range Weather Forecasts [34], including surface solar radiation downwards (SSRD), 2 m dew point temperature (D2M), 2 m temperature (T2M), 10 m eastward wind component (U10), 10 m northward wind component (V10), boundary layer height (BLH), mean sea level pressure (MSL), surface pressure (SP), and total precipitation (TP). The selection of meteorological variables was based on prior knowledge of O 3 . For example, SSRD can reflect the intensity of sunlight, which affects the photochemical reaction of O 3 precursor pollutants [35]. BLH affects the diffusion space of pollutants in a vertical direction [36]. Wind usually disperses concentrations from the emission sources [37]. Temperature has an effect on the process of precursor generation [38], and precipitation hints unstable atmospheric conditions [39].
Normalized difference vegetation index (NDVI), road density (RD), surface classification (SC), surface elevation (DEM), latitude (LAT), longitude (LON) and day of year (DOY) were also used as predictors. NDVI was obtained from MODIS 16-day product [40], and RD was obtained from Open Street [41]. The SC and DEM were the input data for the retrieval of O 3 column density and provided together with the TROPOMI-O 3 product. The TROPOMI SC data were derived from the U.S. Geological Survey Global Land Cover Characterization dataset [42], and DEM were derived from ECMWF and GMTED2010 [43]. In addition, the population counts data (30 arc-second) for 2020 were retrieved from Gridded Population of the World dataset v4 [44].

Date Preprocessing
A small proportion of missing values do not significantly change the spatiotemporal characteristics of data for large-scale and long-term studies, and the most widely used method involves omitting the missing sections. However, considering the limited time and space in this study, there was a possibility of a large amount of missing data in specific spatiotemporal domains that may contain significant information. For O 3 monitoring sites, the missing values occupied 0.75% to 22.9%. We compared five imputing methods: latest-valid-observation (Baseline), linear interpolation (Linear), cubic spline interpolation (CUBIC), Bayesian probabilistic matrix factorization (BPMF) [45] and low-rank matrix completion (LMC) [46]. Some sites with high data-missing rates (>8%) were identified, and then 20 testing sites with a missing rate of less than 3% near these sites were selected. The root mean square error (RMSE) and mean absolute percentage error (MAPE) were used as the metrics for evaluation of different imputed results.
For TROPOMI data, a quality threshold (ranging from 0 (poor) to 1 (excellent)) is provided with the total O 3 column product together to remove invalid data caused by poor observation conditions (e.g., cloud cover) and bad retrieved results. Following the official instructions, we only adopted data with a quality threshold greater than 0.5, leading to a few areas with missing data (2.8%). Considering the missing data did not show clear distribution characteristics; we used ordinary kriging (OK) method to process TROPOMI data.
After processing the missing values, we selected the maximal hourly 8 h moving average from 8:00 to 24:00 local time as daily MDA8 ozone level. All predictive variables were resampled to 0.1 • × 0.1 • using appropriate algorithms. The meteorological and TROPOMI-O 3 data were resampled using the OK method. The surface classification was resampled using majority resampling. Then, the TROPOMI-O 3 and meteorological data were extracted at the locations of the ground-based O 3 monitoring sites, 6 days of data for each variable were fused into a 2-D matrix with a size of 6 × 15 (6 days and 15 variables), and the Z-score normalization method was applied to convert the distribution of original variables into a standard normal distribution.

Model Development
Because the applications of LSTM are extensive and the principles are complex, the schematic formulas will not be repeated in this study. We used three groups of trainable parameter matrices to calculate the attention weight of each unit of the LSTM model. The formulas are as follows: where h i denotes the output of LSTM at time i; V and W are the trainable parameter matrices; s i is the output at time i; and s t is the final output. The attention mechanism assigns a weight to the output of each time step, and then the model uses this weight to obtain the final output. First, the attention score at each time step was generated using Equation (1), where the Softmax function was used to convert the calculation result into a probability. Then, the output of each time step was updated via the attentional weight obtained using Equation (2). The final output (s t ) was the summation of the output at each time step, obtained using Equation (3). In addition, residual connection was adopted to solve the degradation problem in the deep learning model. A one-layer LSTM with the residual connection and attention mechanism (AR-LSTM) is shown in Figure 2. The parameters of the AR-LSTM model were determined by using exploratory experiments. The preprocessed data were divided into two parts randomly, the training set and testing set (8:2). For a deep learning model, the number of hidden layers, the number of neurons in each hidden layer, the loss function, and optimization should be determined primarily. The grid search technique can be used to provide optimized parameter configuration, but it is almost impossible because of the huge computational cost. Therefore, we set the following parameters: optimizer (Adam), loss function (MSE), The parameters of the AR-LSTM model were determined by using exploratory experiments. The preprocessed data were divided into two parts randomly, the training set and testing set (8:2). For a deep learning model, the number of hidden layers, the number of neurons in each hidden layer, the loss function, and optimization should be determined primarily. The grid search technique can be used to provide optimized parameter configuration, but it is almost impossible because of the huge computational cost. Therefore, we set the following parameters: optimizer (Adam), loss function (MSE), activation function (tanh) and initial model structure (hidden layers and a fully connected layer with one neuron), which are commonly used [47]. The number of hidden layers was selected from 1 to 6, and the number of neurons in each hidden layer was determined from 16, 32, 64, 128, and 256. The time steps were selected from 1 to 15. In addition, the automatic decay of the learning rate and early stopping techniques were applied to reduce over-fitting.

Model Evaluation
Four cross-validation (CV) methods were used for model validation. The most frequently used CV methods are sample-based CV and site-based CV. The sample-based CV method is appropriate for evaluating the overall performance of the model, whereas the site-based CV method can evaluate the spatial variations in model performance. However, the distribution of sites was not inhomogeneous. There may be some training sites near the testing sites. In this case, the features of the testing sites were closely related to those of the neighboring training sites. Therefore, we introduced a city-based CV method to ensure spatial independence of the testing sites. In the city-based CV method, the data are divided into 10 folds by cities instead of sites. For temporal extrapolation capability, a month-based method was adopted, which left one month out for CV.
The metrics included R 2 , MAE, root mean square error (RMSE), and mean absolute percentage error (MAPE). In addition, peak validation was adopted to judge whether the model could provide a reliable peak concentration estimation. Hit rate (HR), false alarm ratio (FAR), missing rate (MR) and threat score (TS) were used as metrics to evaluate the model ability for estimating MDA8 O 3 over 100 µg m −3 , which is the air quality guideline (AQG) recommended by the World Health Organization (WHO). The formulas are as shown below: where a denotes the number of samples in which observations and predictions are all greater than 100 µg m −3 ; b denotes predictions more than 100 µg m −3 but observations are not; c denotes observations more than 100 µg m −3 but predictions are not.

O 3 Level and Human Exposure Assessments
For a given region and period, the population-weighted MDA8 O 3 concentrations were regarded as the O 3 level and were calculated using the formulation formula [21]: where C pw is the population-weighted MDA8 O 3 for the region (N grid cells); P i is the population density for grid i; and C i is the MDA8 O 3 concentration in grid i.
The nonattainment day was defined as >100 µg m −3 based on AQG. The exposure intensities and durations were the cumulative percent of populations exposed to different levels of MDA8 O 3 [48].

Missing Data Imputation Results
The results from the five methods for site observations imputation are shown in Figure 3. There was a negative correlation between the performances of imputed methods and the duration of missing data, and the LMC method performed with relative stability. As the missing data of TROPOMI-O 3 were very small, the OK method had strong applicability. Figure 4 shows the imputation results for TROPOMI-O 3 data on 11 June; the missing data imputed by the OK method were smooth and agreed well with the surrounding original data.

=1 =1
where is the population-weighted MDA8 O3 for the region ( grid cells); is the population density for grid ; and is the MDA8 O3 concentration in grid . The nonattainment day was defined as >100 μg m −3 based on AQG. The exposure intensities and durations were the cumulative percent of populations exposed to different levels of MDA8 O3 [48].

Missing Data Imputation Results
The results from the five methods for site observations imputation are shown in Figure 3. There was a negative correlation between the performances of imputed methods and the duration of missing data, and the LMC method performed with relative stability. As the missing data of TROPOMI-O3 were very small, the OK method had strong applicability. Figure 4 shows the imputation results for TROPOMI-O3 data on 11 June; the missing data imputed by the OK method were smooth and agreed well with the surrounding original data.

Model Configuration Selections
The performance of the AR-LSTM model with different parameters is shown in Figure 5. The number of neurons directly affected model fitting ability, but too many neurons would increase the computational cost and lead to over-fitting. We found that the model performed optimally with the number of neurons equaling 128 (Figure 5h). Stacked multi-layers can improve the nonlinear capability of deep learning models. As shown in Figure 5f, the model performance gradually stabilized with the number of hidden layers over four. The time steps determine the amount of information contained in time-series data; the mean R 2 value remained around 0.92 when the time steps were greater than 6 days ( Figure 5g). Therefore, the final AR-LSTM model had four hidden layers, 128 neurons in each hidden layer, and the input time-series data had 6 time steps.

Model Configuration Selections
The performance of the AR-LSTM model with different parameters is shown in Figure 5. The number of neurons directly affected model fitting ability, but too many neurons would increase the computational cost and lead to over-fitting. We found that the model performed optimally with the number of neurons equaling 128 (Figure 5h). Stacked multi-layers can improve the nonlinear capability of deep learning models. As shown in Figure 5f, the model performance gradually stabilized with the number of hidden layers over four. The time steps determine the amount of information contained in time-series data; the mean R 2 value remained around 0.92 when the time steps were greater than 6 days ( Figure 5g). Therefore, the final AR-LSTM model had four hidden layers, 128 neurons in each hidden layer, and the input time-series data had 6 time steps.
Stacked multi-layers can improve the nonlinear capability of deep learning models. As shown in Figure 5f, the model performance gradually stabilized with the number of hidden layers over four. The time steps determine the amount of information contained in time-series data; the mean R 2 value remained around 0.92 when the time steps were greater than 6 days ( Figure 5g). Therefore, the final AR-LSTM model had four hidden layers, 128 neurons in each hidden layer, and the input time-series data had 6 time steps.    Figure S1. Like the city-based CV, the results of the month-based CV were also not as good as the first two commonly used CV methods. The R 2 , RMSE, and MAPE values were 0.77, 21.03 µg m −3 and 21.17%, respectively. The results suggest that the spatiotemporal extrapolation of AR-LSTM was not as good as its overall prediction ability, and reveal the weak robustness of data-driven algorithms. However, the accuracy of the AR-LSTM model was still acceptable.

Model Performance and Grid-Data Generation
The optimal model in sample-based CV was selected for the generation of O 3 grid-data. The comparison of results showed that there was a high degree of consistency between the in situ observations and generated data, with the R 2 , RMSE, and MAPE equal to 0.94, 10.95 µg m −3 , and 10.2%, respectively (Figure 7). We selected Tracking Air Pollution China (TAP) MDA8 O 3 dataset, which was generated by using CTMs and the data fusion method [49], for consistency contrast. Figure 8 shows the spatial patterns of our data were generally in good agreement with the TAP data, and our RMSE was lower. The R 2 value for each site and month are shown in Figure S2. Spatially, the quality of generated data was not as good in the southwest. This could be because of the sparse distribution of monitoring sites in the south. Temporally, the highest R 2 values appeared in summer (0.97), and the lowest value was observed in winter (0.90). From the point of view of peak validation, the annual averages of HR, FAR, MR and TS were 0.94, 0.05, 0.06 and 0.90, respectively. We also calculated peak validation metrics of each site ( Figure S3) and month ( Figure S4), and there was no significant spatial nonstationarity. divisions in the city-based CV are provided in Figure S1. Like the city-based CV, the results of the month-based CV were also not as good as the first two commonly used CV methods. The R 2 , RMSE, and MAPE values were 0.77, 21.03 μg m −3 and 21.17%, respectively. The results suggest that the spatiotemporal extrapolation of AR-LSTM was not as good as its overall prediction ability, and reveal the weak robustness of data-driven algorithms. However, the accuracy of the AR-LSTM model was still acceptable. The optimal model in sample-based CV was selected for the generation of O3 griddata. The comparison of results showed that there was a high degree of consistency between the in situ observations and generated data, with the R 2 , RMSE, and MAPE equal to 0.94, 10.95 μg m −3 , and 10.2%, respectively (Figure 7). We selected Tracking Air Pollution China (TAP) MDA8 O3 dataset, which was generated by using CTMs and the data fusion method [49], for consistency contrast. Figure 8 shows the spatial patterns of our data were generally in good agreement with the TAP data, and our RMSE was lower. The R 2 value for each site and month are shown in Figure S2. Spatially, the quality of generated data was not as good in the southwest. This could be because of the sparse distribution of monitoring sites in the south. Temporally, the highest R 2 values appeared in summer (0.97), and the lowest value was observed in winter (0.90). From the point of view of peak validation, the annual averages of HR, FAR, MR and TS were 0.94, 0.05, 0.06 and 0.90, respectively. We also calculated peak validation metrics of each site ( Figure S3) and month ( Figure S4), and there was no significant spatial nonstationarity.

Comparisons with Other Methods
We selected several widely used machine learning and deep learning models including RF, deep neural network (DNN), gate recurrent unit (GRU) [50], original LSTM and CNN [51] for comparisons. The key difference between these models is how the spatiotemporal relationship of variables is processed. RF and DNN have no special spatiotemporal relationship calculation logic, GRU and LSTM are two types of recurrent neural network, and CNN uses spatial convolution structure to process spatial information. As for the parameter selection for these models, GRU and LSTM adopted the same parameters as AR-LSTM because the only difference between them was the structure of the internal calculation unit. The parameters of RF, DNN and CNN were determined by using an exploratory approach similar to that of AR-LSTM; the specific process can be found in Figures S5-S7.

Comparisons with Other Methods
We selected several widely used machine learning and deep learning models including RF, deep neural network (DNN), gate recurrent unit (GRU) [50], original LSTM and CNN [51] for comparisons. The key difference between these models is how the spatiotemporal relationship of variables is processed. RF and DNN have no special Since the purpose of this article was the reconstruction of historical pollution data, the sample-based CV and city-based CV were adopted as validation methods. Table 1 shows the CV results of various models. The performances of LSTM and GRU were better than those of RF and DNN. The results indicated that the time series of variables provided valuable information for O 3 modeling. However, the LSTM and GRU models processed the time series of variables in a relatively simple manner compared with our AR-LSTM model, leading to less accurate results. For a more detailed explanation, the attention weights of some of the samples are illustrated in Figure S8. We found that the largest attention weights did not appear in the last epoch in approximately 8% of the samples. The assumption of a decrease in the relationship between model inputs and O 3 with time was not sufficiently accurate. On the other hand, the performances of the CNN models had no clear advantages, and the results indicate that the spatial information was of limited usefulness in this paper. Presumably, that was due to the coarse spatial resolution of the input variables.
In addition, the performance of AR-LSTM was compared with similar studies published recently (Table S2). Generally speaking, our model reached a relatively high level in term of statistical metrics. However, due to the different study areas and periods, the results are for reference only.

Sensitivity Analysis of Modeling Variables
We carried out sensitivity analysis in the modeling stage. We implemented a variable screening method based on the variable importance of the random forest (RF) model. RF is an ensemble decision tree model, which performs a split according to a given splitting criterion, and the variable importance can be obtained by weighing the improvements in the splitting criterion in all the nodes where the variable appears as a splitter. Therefore, the RF model is more explanatory than other machine learning models and is a natural model for variable sensitivity analysis [52,53]. We built an RF model and removed variables in ascending order based on variable importance; sample-based CV10 was used to evaluate the model performance after removing one variable each time. Figure S9 shows the screening results. The model achieved optimal performance after removing road density (RD), probably because of its low temporal resolution. However, the result was not significant enough to support a definitive conclusion. Considering that adding a variable had little effect on the computational burden, all collected data were used in this paper.

Human Exposure Assessment
According to our predictions, the annual average population-weighted MDA8 O 3 was estimated to be 102.9 ± 33.9 µg m −3 across eastern China in 2020 (Table 2), showing an increase of~13% in five years [54]. Seasonally, the population-weighted MDA8 O 3 was predicted to be 120.1 ± 33.0, 118.6 ± 19.1, 101.3 ± 33.9, and 67.3 ± 18.4 µg m −3 for spring, summer, fall, and winter, respectively. The O 3 level peaked in spring over all provinces except Shandong. Regionally, the population-weighted MDA8 O 3 was found to be the highest in Shandong (110.0 ± 43.5 µg m −3 ) and lowest in Zhejiang (94.8 ± 33.3 µg m −3 ). The overall spatial pattern was higher in the north and lower in the south. To some extent, the spatial pattern was related to the regional industrial distribution. Energy-intensive industries, such as petrochemicals and non-ferrous smelting, which produce higher precursor emissions, are the dominant industries in Shandong, but the economy of Zhejiang is dependent on the information technology and tourism industries. In addition, the increase in VOC due to anthropogenic emissions, such as coal combustion in Zhejiang, could be lower than that in Shandong due to the lower population density. Note that although there were large intra-annual and spatial variations in O 3 levels in eastern China, the general trends showed severe O 3 pollution. This was true even in Zhejiang, which showed the best air quality, where the O 3 levels in all seasons (except winter) exceeded the AQG. Additionally, the summer O 3 level in Shandong (142.3 ± 36.5 µg m −3 ) was closer to the interim target 1 (IT-1:160 µg m −3 ) formulated by the WHO. Figure 9 shows the exposure intensities and durations, which provide more detailed information about the severity of pollution. Approximately 81% of the population in eastern China lived in areas with more than 150 non-attainment days (i.e., population-weighted MDA8 O 3 > 100 µg m −3 ) in 2020, and 15% of the population was exposed to O 3 levels higher than the IT-1 value for more than 60 days. Note that although there were large intra-annual and spatial variations in O3 levels in eastern China, the general trends showed severe O3 pollution. This was true even in Zhejiang, which showed the best air quality, where the O3 levels in all seasons (except winter) exceeded the AQG. Additionally, the summer O3 level in Shandong (142.3 ± 36.5 μg m −3 ) was closer to the interim target 1 (IT-1:160 μg m −3 ) formulated by the WHO. Figure  9 shows the exposure intensities and durations, which provide more detailed information about the severity of pollution. Approximately 81% of the population in eastern China lived in areas with more than 150 non-attainment days (i.e., population-weighted MDA8 O3 > 100 μg m −3 ) in 2020, and 15% of the population was exposed to O3 levels higher than the IT-1 value for more than 60 days.

Conclusions
This study proposed a new deep learning model to estimate the MDA8 O 3 concentrations across eastern China on a 0.1 • × 0.1 • grid. Attention mechanism and residual connection were introduced into each LSTM unit to improve its ability for temporal information processing. With the support of a small number of variables from the public dataset, the AR-LSTM model achieved good performance (sample-based CV R 2 = 0.94 and city-based CV R 2 = 0.85). In addition, this model is not limited in O 3 estimation; it is easy to employ similar model structures for the estimation of other types of air pollution, such as PM 2.5 or NO 2 . The assessments indicate that eastern China suffered severe O 3 pollution in 2020. The annual average population-weighted MDA8 O 3 concentration was estimated to be 102.8 ± 33.1 µg m −3 , and 81% of the eastern China population lived in areas with more than 150 non-attainment days. The rapid implementation of emergency control measures for O 3 pollution is essential. Normalized control measures for VOC emissions should be implemented in earnest, especially in Shandong province. The petrochemical industry and companies that use large amounts of organic solvents, photo-oxidized technology, and activated carbon adsorption cotton must be monitored vigorously. The government should actively urge these companies to upgrade or replace their substandard VOC treatment devices. Stricter measures should also be considered, such as banning asphalt and paint spraying operations during the daytime and regulating the number of vehicles based on the O 3 concentration.
There are some limitations of this study that should be noted. First, the spatiotemporal domain was relatively small. The dataset only contained a one-year temporal series and five provinces. For a data-driven model, the amount of data is one of the key factors affecting model performance. Second, the input variables were not very comprehensive. Regardless of model complexity, more covariates such as emission inventories and other pollutants should be considered, which may improve model accuracy. Finally, TROPOMI-O 3 was derived from a complex physical retrievals process that may introduce more uncertainties.