Prediction of the Tropospheric NO 2 Column Concentration and Distribution Using the Time Sequence-Based versus Inﬂuencing Factor-Based Random Forest Regression Model

: The prediction of air pollutants has always been an issue of great concern to the whole of society. In recent years, the prediction and simulation of air pollutants via machine learning have been widely used. In this study, we collected meteorological data and tropospheric NO 2 column concentration data in Beijing, China, between 2012 and 2020, and compared the two methods of time sequence-based and inﬂuencing factor-based random forest regression in predicting the tropospheric NO 2 column concentration. The results showed that prediction of the tropospheric NO 2 column concentration using random forest regression was affected by the changes of human activities, especially emergency events and policy variations. The advantage of time sequence analysis lies in its ability to calculate the distribution of air pollutants with a long-time scale of prediction, but it may produce large errors in numerical value. The advantage of inﬂuencing factor prediction lies in its high precision and that it can identify the speciﬁc impact of each inﬂuencing factor on the NO 2 column concentration, but it needs more data and work quantities before it can make a prediction about the future.


Introduction
With the development of industries, air pollution has become a problem of increasing concern and aroused widespread attention from the whole of society.Nitrogen oxides are the main air pollutants and are directly or indirectly related to atmospheric environment problems, such as photochemical smog, acid deposition and stratospheric ozone depletion, among others [1][2][3].NO 2 is the main component of nitrogen oxides in the atmosphere, and its monitoring and prediction can, to a greater extent, serve as a guide to the control of atmospheric nitrogen oxides and therefore help formulate policies for their emission, reduction and control.Large numbers of mathematical and machine-learning models have been developed to calculate and describe the distribution and change of atmospheric NO 2 .Weather research and prediction in combination with the weather research and forecasting community multiscale air quality modeling system (WRF-CMAQ) and weather research and forecasting-chemistry (WRF-Chem) have been used extensively [4][5][6].Shin et al. (2018) [7] made a linear regression analysis of NO 2 in Japanese metropolises using the spatiotemporal random tree model and found that it was advantageous to use this model to simulate spatiotemporal changes of NO 2 .Zhan et al. (2018) [8] established a new model known as random forest space-time Kriging (RF-STK) and used it to assess the exposure risks of NO 2 and SO 2 in some regions of China.
The most critical issue in the management of air pollution is the prediction of the concentration and distribution of the pollutants, and air pollution cannot be controlled by only analyzing the pollution that has occurred.Moolchand et al. (2021) [9] established a modified model of extrapolating air pollutants based on historical and current meteorological datasets and calculated the results from 196 cities in India on various classifiers, finding that the accuracy of linear robust regression was 94-96%.This accuracy could be improved to some extent after using various types of clustering algorithms, showing that the optimal accuracy of the decision-tree classifier was 99.7%, and the use of the random forest classifier could raise the accuracy by 0.02%, indicating that the accuracy of machine-learning algorithms is superior to that of the linear model in predicting air pollutants.Sriram et al. (2021) [10] predicted the air quality index (AQI) in Delhi by using the decision tree, support vector machine (SVM), naive Bayes classifier, logistics regression, random forest and Knearest neighbor as the supervised machine-learning algorithms, finding that the decision tree method produced the best results with an overall accuracy of 99.8%.The results of the prediction models, based on big data analysis and machine learning, can help assess the current air quality and compare the assessments.In the present study, we established a NO 2 column concentration distribution prediction model based on the random forest regression mainly by using the time sequence analysis and influencing factor prediction methods with the purpose of compare their advantages and disadvantages of the two methods and their respective application settings.Wang et al. [11] used TROPOMI and HRRR data to develop a random forest model of ozone to estimate ground-level ozone concentrations in California.This model allows the contribution of satellite data products to be assessed in a concise modelling framework, and their findings suggest that TROPOMI data improve the estimation of extremes in ground-level ozone modelling.It could also accelerate future research on the application of satellite data products and high-resolution meteorological data to predict ground-level ozone concentrations.Long et al. [12] developed models for estimating daily ground-level NO 2 in China using four tree-based machine learning models (decision tree (DT), gradient boosted decision tree (GBDT), random forest (RF) and extra tree (ET)), and found that the estimated high-resolution results were consistent with ground-based observations of NO 2 through spatio-temporal analysis and comparison, and that of the four models, the extra-tree model with the spatio-temporal information (based on the ST-ET) model outperformed the remaining three models for the 2019 estimation.This is, in addition, to the large number of studies based on tree models, which demonstrate the generalizability of tree-based machine learning models for atmospheric pollution studies at a global scale.
Much of the past research exists in the discussion of studies of one or several different models.Rarely has there been an analysis of different ideas and approaches to one model.Moreover, in the traditional use of machine learning models, the results of a single model are mostly used as a conclusion.In contrast to previous studies, we discuss two commonly used methods for prediction and analysis based on random forest regression models (RFR).The advantages, disadvantages and applicability of both methods are investigated, while we also provide a more detailed quantitative analysis of the relationship between influencing factors and atmospheric pollutants as an extension to the random forest regression model.
In a study by Rui F et al. (2019) [13], it was shown that machine learning takes less than one percent of the computation time of the traditional atmospheric models.Simulating hours of seven air pollutants for 4 months in 2018 using WRF-based would take more than 6 days.The same data would take less than 1 h for machine learning using a personal laptop with four cores.Considering that the random forest model has a faster computing speed and lower technical requirements than other models, such as the WRF and neural network models, it is more suitable for social communication.Therefore, we choose the random forest regression model for our research discussion.
Beijing is a world-famous ancient capital and modern international city, as well as the capital and the political, economic and cultural center of China, located in the north of China and North China Plain, adjacent to Tianjin in the east and Hebei in the west with the center at 116 • 20 E and 39 • 56 N (Figure 1).
Beijing is a world-famous ancient capital and modern international city, as well as the capital and the political, economic and cultural center of China, located in the north of China and North China Plain, adjacent to Tianjin in the east and Hebei in the west with the center at 116°20′ E and 39°56′ N (Figure 1).Geographically, Beijing is high in the northwest and low in the southeast; its west, north and northeast sides are surrounded by mountains, and the southeast side is a plain gently inclining to the Bohai Sea.The climate of Beijing belongs to the warm temperate semi-humid and semi-arid monsoon climate, hot and rainy in summer and cold and dry in winter.
As the capital of China, Beijing is the city that responds most promptly to policy and is also the earliest to monitor air pollutants in China.The changes in air pollutants in Beijing are representative of most major cities in China.

Data Sources
Satellite data were obtained from the ozone monitoring instrument (OMI) aboard NASA's Aura satellite (https://disc.gsfc.nasa.gov/(accessed on 15 October 2021)) [14].In the present study, we used the product of OMI/Aura NO2 tropospheric column L3global grid 0.25 × 0.25 degrees V3.As this product has undergone data filtration and only preserves the cloud fraction data < 30%, it is unnecessary to do additional filtration.In addition, hourly real-time monitoring data of air quality released by the National Urban Air Quality Real-time Publishing Platform of China's Environmental Monitoring Station were used (http://www.cnemc.cn/(accessed on 15 October 2021)).The data used in this study were the mean daily value calculated from NO2 data per hour.
Using the re-analysis data released by the National Centers for Environmental Prediction (NCEP)/National Cholesterol Education program/National Center for Atmospheric Research (NCAR) (https://psl.noaa.gov/data/gridded/data.ncep.reanalysis.html(accessed on 15 October 2021)) and the lifted index selected (LI, °C) from it, tropospheric temperature (K), atmospheric pressure (Pa), precipitable water volume (PWV, kg/m 2 ) and relative humidity (RH%) were calculated.Geographically, Beijing is high in the northwest and low in the southeast; its west, north and northeast sides are surrounded by mountains, and the southeast side is a plain gently inclining to the Bohai Sea.The climate of Beijing belongs to the warm temperate semi-humid and semi-arid monsoon climate, hot and rainy in summer and cold and dry in winter.

Methods
As the capital of China, Beijing is the city that responds most promptly to policy and is also the earliest to monitor air pollutants in China.The changes in air pollutants in Beijing are representative of most major cities in China.

Data Sources
Satellite data were obtained from the ozone monitoring instrument (OMI) aboard NASA's Aura satellite (https://disc.gsfc.nasa.gov/(accessed on 15 October 2021)) [14].In the present study, we used the product of OMI/Aura NO 2 tropospheric column L3global grid 0.25 × 0.25 degrees V3.As this product has undergone data filtration and only preserves the cloud fraction data <30%, it is unnecessary to do additional filtration.In addition, hourly real-time monitoring data of air quality released by the National Urban Air Quality Real-time Publishing Platform of China's Environmental Monitoring Station were used (http://www.cnemc.cn/(accessed on 15 October 2021)).The data used in this study were the mean daily value calculated from NO 2 data per hour.
Using the re-analysis data released by the National Centers for Environmental Prediction (NCEP)/National Cholesterol Education program/National Center for Atmospheric Research (NCAR) (https://psl.noaa.gov/data/gridded/data.ncep.reanalysis.html(accessed on 15 October 2021)) and the lifted index selected (LI, • C) from it, tropospheric temperature (K), atmospheric pressure (Pa), precipitable water volume (PWV, kg/m 2 ) and relative humidity (RH%) were calculated.

Methods
In the Python Sklearn random forest regression module, the max depth determined the downward frequency of the decision trees: the deeper the max depth, the more accurate the fitting result.However, excessive max depth may result in excessive fitting.The number of trees determines the size of the random forest model: the more trees, the more accurate the result obtained [15].The random number determines the occurrence of events.If there is no specified random number, each calculation would produce a different result, and therefore the specified random number can help the client find better hyperparameters.The learning curve of the drawn model indicates that an excessively complex model will reduce the accuracy of the model, meaning that the excessive number of trees and excessive depth will increase the time of calculation and reduce the accuracy of the model.For this reason, accurate selection of the hyperparameter can greatly increase the accuracy and speed of the random forest model (Figure 2).
Sustainability 2023, 15, x FOR PEER REVIEW 4 of accurate the fitting result.However, excessive max depth may result in excessive fittin The number of trees determines the size of the random forest model: the more trees, t more accurate the result obtained [15].The random number determines the occurrence events.If there is no specified random number, each calculation would produce a differe result, and therefore the specified random number can help the client find better hyperp rameters.The learning curve of the drawn model indicates that an excessively compl model will reduce the accuracy of the model, meaning that the excessive number of tre and excessive depth will increase the time of calculation and reduce the accuracy of t model.For this reason, accurate selection of the hyperparameter can greatly increase t accuracy and speed of the random forest model (Figure 2).Based on the above knowledge, three main hyperparameters are required to establi a random forest: the number of decision trees to be produced (n_estimator), the depth the tree model (max_depth) and the random number (random_state) [16].
In this study, we used Python GDAL, Pandas, Numpy, Scipy, Sklearn and Jupyt modules to treat data and generate images, among which the GDAL module has gre power in calculating grid images.In this study, we used GDAL to read raster in rast calculation followed by matrix operation.To ensure the accuracy of the model and t occurrence of excessive fitting, we selected the hyperparameter R 2 score less than 0.98 establish the model.
The time sequence prediction model was established by selecting the NO2 colum distribution for n successive year as the target value of NO2 concentration distribution tag value n + 1 year, and training was performed on it to obtain the optimal hyperparam eters.Using the trained model, we predicated the NO2 concentration of n + 2 years an obtained good prediction results.
As no grid images representing large numbers of human activity data were availab especially industrial and traffic data, and only monthly or yearly mean data were avai ble, we only selected part of the meteorological data as influencing data in establishi the influencing factor prediction model in this study, which does not mean that these a the only influencing factors.
Prediction models using influence factors, due to the large amount of human activ Based on the above knowledge, three main hyperparameters are required to establish a random forest: the number of decision trees to be produced (n_estimator), the depth of the tree model (max_depth) and the random number (random_state) [16].
In this study, we used Python GDAL, Pandas, Numpy, Scipy, Sklearn and Jupyter modules to treat data and generate images, among which the GDAL module has great power in calculating grid images.In this study, we used GDAL to read raster in raster calculation followed by matrix operation.To ensure the accuracy of the model and the occurrence of excessive fitting, we selected the hyperparameter R 2 score less than 0.98 to establish the model.
The time sequence prediction model was established by selecting the NO 2 column distribution for n successive year as the target value of NO 2 concentration distribution of tag value n + 1 year, and training was performed on it to obtain the optimal hyperparameters.Using the trained model, we predicated the NO 2 concentration of n + 2 years and obtained good prediction results.
As no grid images representing large numbers of human activity data were available, especially industrial and traffic data, and only monthly or yearly mean data were available, we only selected part of the meteorological data as influencing data in establishing the influencing factor prediction model in this study, which does not mean that these are the only influencing factors.
Prediction models using influence factors, due to the large amount of human activity data, especially industrial and traffic data, do not exist as raster images, only monthly average or annual average data, so this paper only selects some meteorological data as influence factors.This paper only discusses the scenarios of using two methods and does not analyze the NO 2 column concentration in the study area in depth, so the influence factors selected are only those that can make the model established and relatively accurate.
The model R 2 and RMSE shown in this paper are only for the training set, and the RMSE for the predicted data set is discussed in detail in the paper.
Figure 3  Figure 3 shows the flow diagram for the adjustment of the model parameters used this paper.

Changes of the Tropospheric NO2 Column Concentration in Beijing from 2012 to 2020
As shown in Figure 4, the NO2 column concentration in the target areas decrea gradually yearly from 2012 to 2020, the highest mean value being 17.49 ± 2.80 × 1 molec/cm 2 in 2012 and the lowest mean value being 7.80 ± 1.66 × 10 15 molec/cm 2 .This w cumulatively and similar to the observations of Chi et al. in 2021 [17].Compared w 2014, the NO2 column concentration in 2013 decreased significantly, mainly because of publication of the "Action Plan of Prevention and Control of Air Pollution" in China d ing 2013 and 2014 [18]; the main elements are the strengthening of the treatment of pollutants, the limitation of air pollutant emissions, the requirement to use clean ener the use of clean technology, the improvement of the monitoring system and the establi ment of an early warning system, etc.

The Time Sequence Prediction Model
As air pollutants present a typical seasonal distribution, it is necessary to establish corresponding models of calculation according to the different months.We selected March, June, September and December to establish the model and used the NO2 column concentration data from 2012 to 2019 to predict the NO2 column concentration in 2020.The data engineering and prediction results are presented in Model 1/Table 1

The Time Sequence Prediction Model
As air pollutants present a typical seasonal distribution, it is necessary to establish corresponding models of calculation according to the different months.We selected March, June, September and December to establish the model and used the NO 2 column concentration data from 2012 to 2019 to predict the NO 2 column concentration in 2020.The data engineering and prediction results are presented in Model 1/Table 1 and Figure 5   As shown in Figure 5 and Table 2, the error of the result, obtained by Model 1, was relatively great, especially for the results obtained in March and December, in which the maximum error was 40.4% and 61.53%, respectively.Considering the outbreak of COVID-19 pandemic in 2020, human activities may be greatly limited by the pandemic outbreak.To verify this hypothesis, we established a prediction model to predict the NO2 column  As shown in Figure 5 and Table 2, the error of the result, obtained by Model 1, was relatively great, especially for the results obtained in March and December, in which the maximum error was 40.4% and 61.53%, respectively.Considering the outbreak of COVID-19 pandemic in 2020, human activities may be greatly limited by the pandemic outbreak.To verify this hypothesis, we established a prediction model to predict the NO 2 column concentration in 2019.The data engineering and prediction results are presented in Model 2/Table 3        As shown in Figure 6 and Table 4, Model 2 was superior to Model 1, especially in the result error; the maximum error appeared in March 2019, being 30.65%.As shown in the distribution map, the error exceeded 20% in only a few areas.The mean error of the four months was less than 10%, and the maximum root mean square error (RMSE) of the four months was 6.71%.The prediction result of the NO 2 column concentration distribution was more accurate as compared with Model 1.These results confirmed the hypothesis that human activity changes in 2020 had a great impact on the time sequence-based prediction model.Other than emergency events, policy variations also had a huge impact on human activities and air pollutant emission.
Given the great policy variations in 2014, the data engineering and result of the 2019 NO 2 column concentration prediction model based on 2014-2018 are shown in Model 3/Table 5   As shown in Figure 7 and Table 6, the error of Model 3 was smaller than that of Model 2. The maximum RMSE of the four months appeared in March 2019, being 5.39%.The RMSE of the root mean square error of the four months was less than 5%, except March 2019.All other results of Model 3 were superior to Model 2. Knowing that the higher the learning frequency, the better the prediction result (theoretically, the more the characteristic years, the better the prediction result in principle), the phenomenon that Model 3 was superior to Model 2 demonstrates that the time sequence-based prediction model by taking into consideration the human activities or emission policy variations is better than that without considering the human activities or emission policy variations.In addition, fewer months means faster calculation, indicating that policy variations and limitations on human activities should be considered when time-sequence prediction is performed.Although the prediction error was relatively high in some target areas when time sequence was used to predict the NO 2 column concentration, its result of NO 2 column concentration distribution is acceptable.
The accuracy has been significantly improved compared to traditional models [6,17].A comparison of the previous studies using machine learning models found that the precision of our estimates was similar to the results of other studies, but slightly lower than that of similar studies that introduced other influencing factors [11,12].As shown in Figure 7 and Table 6, the error of Model 3 was smaller than that of Model 2. The maximum RMSE of the four months appeared in March 2019, being 5.39%.The RMSE of the root mean square error of the four months was less than 5%, except March

Prediction of Influencing Factors
The model established based on the meteorological factors and NO 2 column concentration from 2014 to 2018 alone was unable to predict the NO 2 column concentration in 2019, and therefore data from the ground monitoring stations were added.The data engineering (Model 4/Table 7) and results are shown in Table 8.As shown in Table 8, the result error was smaller than that of the time-sequence-based prediction model (Model 2/3) and the prediction result was closer to the actual value.However, the NO 2 concentration data obtained from the ground monitoring stations in 2019 were required during model establishment.As a result, it could only predict the pollution events that had occurred.If the time sequence-based prediction model was first used to predict the meteorological data followed by using the predicted data obtained to predict the pollutants, the error would be increased.
The method has predictive power and is more accurate than traditional studies [20,21].The accuracy of the predictions is similar to previous studies using machine learning methods [16].If the data obtained from the ground monitoring stations were used to predict tropospheric NO 2 column concentration, the result would to some extent lose its predictive meaning, because the predicted air pollutants have occurred at the time of prediction.Model 4 is more similar to an inverting model.The influencing factorbased prediction model is able to obtain the impact of each influencing factor on the NO 2 column concentration within the time interval in the target area via the importance interface and identify which influencing factor produces the greater impact on the NO 2 column concentration.The results are listed in Table 9. Figure 8 is the partial dependence plot (PDP) of the impact of each influencing factor on the NO 2 column concentration in March 2019 by using the important parameters obtained through the importance interface.By using this PDP and multiple linear regression, we can establish the conditional function relationship specific to air pollutants.8 is the partial dependence plot (PDP) of the impact of each influencing factor on the NO2 column concentration in March 2019 by using the important parameters obtained through the importance interface.By using this PDP and multiple linear regression, we can establish the conditional function relationship specific to air pollutants.The results are normalized results.As there are not enough data when X 1 is located in 32.98~37.23,we were unable to establish the functional relationship.
The results of the calculation of the No2 column concentration for March 2019 based on the obtained functional relationship are shown in Figure 9.It can be seen that the results are very close to the measured values of OMI with a trend line slope of 0.96 R 2 of 0.96 RMSE of 0.34 × 10 15 molec/cm 2 .It can be proved that the obtained functional relationship can describe the relationship between the influencing factors and the NO 2 column concentration.
The results of the calculation of the No2 column concentration for March 2019 based on the obtained functional relationship are shown in Figure 9.It can be seen that the results are very close to the measured values of OMI with a trend line slope of 0.96 R 2 of 0.96 RMSE of 0.34 × 10 15 molec/cm 2 .It can be proved that the obtained functional relationship can describe the relationship between the influencing factors and the NO2 column concentration.The above demonstrates that the result displayed by the functional relationship calculated by multiple linear regression is somewhat different from that calculated by RFR, especially in the ordination of PWV and tropospheric temperature, mainly due to the following reasons: (1) the relationship between NO2 and the influencing factors is complex and not simply a linear relationship, and therefore the multiple linear regression model can only partially reflect good fitting; (2) the 32.98~37.23 interval is lost, but this is the interval in which the greatest change may occur; (3) the concentration range of the NO2 concentration released by the ground monitoring stations is not clearly defined.The cause may be that classification of the concentration range needs sufficiently large data in each range to ensure the accuracy of the result obtained by the multiple linear regression model.Finer classification of concentration ranges often means less data in each range; it is usually difficult to control this conflict point because it is liable to make an a priori judgement to obtain a better functional relationship, which is unacceptable to result analysis.The relationship between various influencing factors and the NO2 column concentration needs to be further explored in future research.The above demonstrates that the result displayed by the functional relationship calculated by multiple linear regression is somewhat different from that calculated by RFR, especially in the ordination of PWV and tropospheric temperature, mainly due to the following reasons: (1) the relationship between NO 2 and the influencing factors is complex and not simply a linear relationship, and therefore the multiple linear regression model can only partially reflect good fitting; (2) the 32.98~37.23 interval is lost, but this is the interval in which the greatest change may occur; (3) the concentration range of the NO 2 concentration released by the ground monitoring stations is not clearly defined.The cause may be that classification of the concentration range needs sufficiently large data in each range to ensure the accuracy of the result obtained by the multiple linear regression model.Finer classification of concentration ranges often means less data in each range; it is usually difficult to control this conflict point because it is liable to make an a priori judgement to obtain a better functional relationship, which is unacceptable to result analysis.The relationship between various influencing factors and the NO 2 column concentration needs to be further explored in future research.
The limitations of the modeling approach discussed in this paper can be avoided by selecting more detailed and richer impact factors, e.g., Brokamp et al. (2018) [22] and Hu et al. (2017) [23] developed a daily pm2.5 prediction model for the U.S., using data mainly including AOD, meteorology and land use.Predictions based on influencing factors for pollutants, such as NO 2 , SO 2 , and O 3 , can be made by adding local emission data, such as emission inventories, but the time scale of their prediction is short and it is difficult to achieve long-time scale prediction.We will conduct research in this area in subsequent studies.

1.
Human activities and emission policy variations should be taken into full consideration in using the time sequence-based air pollutant RFR model.Although the result obtained by this model is not accurate enough, it can be used to predict air pollutant distributions and has the positive significance for governments or enterprises in formulating pollutant emission policies.

2.
The influencing factor-based air pollutant RFR prediction model is more accurate than the time sequence-based air pollutant RFR model in predicting pollutant concentrations, but it is unable to predict the overall pollutant distributions.It needs a large and complex amount of work to select influencing factors and perform data processing.Regardless it can calculate the impact of each influencing factor on air pollutants.It

Figure 1 .
Figure 1.Brief description of the situation in Beijing, China.

Figure 1 .
Figure 1.Brief description of the situation in Beijing, China.

Figure 2 .
Figure 2. Learning curve of the random forest regression model.

Figure 2 .
Figure 2. Learning curve of the random forest regression model.
shows the flow diagram for the adjustment of the model parameters used in this paper.Sustainability 2023, 15, x FOR PEER REVIEW 5 o factors selected are only those that can make the model established and relatively ac rate.The model R 2 and RMSE shown in this paper are only for the training set, and RMSE for the predicted data set is discussed in detail in the paper.

3 .
Results and Discussion 3.1.Changes of the Tropospheric NO 2 Column Concentration in Beijing from 2012 to 2020 As shown in Figure 4, the NO 2 column concentration in the target areas decreased gradually yearly from 2012 to 2020, the highest mean value being 17.49 ± 2.80 × 10 15 molec/cm 2 in 2012 and the lowest mean value being 7.80 ± 1.66 × 10 15 molec/cm 2 .This was cumulatively and similar to the observations of Chi et al. in 2021 [17].Compared with 2014, the NO 2 column concentration in 2013 decreased significantly, mainly because of the publication of the "Action Plan of Prevention and Control of Air Pollution" in China during 2013 and 2014[18]; the main elements are the strengthening of the treatment of air pollutants, the limitation of air pollutant emissions, the requirement to use clean energy, the use of clean technology, the improvement of the monitoring system and the establishment of an early warning system, etc. [19].

Figure 4 .
Figure 4. Distribution of the tropospheric NO2 column concentration in Beijing, China between 2012 and 2020 (Annual average value).

Figure 4 .
Figure 4. Distribution of the tropospheric NO 2 column concentration in Beijing, China between 2012 and 2020 (Annual average value).

Figure 5 .
Figure 5. Error distribution between the result obtained by the NO2 column concentration model and the actual result obtained by OMI in 2020.

Figure 5 .
Figure 5. Error distribution between the result obtained by the NO 2 column concentration model and the actual result obtained by OMI in 2020.

Figure 6 .
Figure 6.Error distribution between the result obtained by the NO2 column concentration model and the actual result obtained by OMI in 2019.

Figure 6 .
Figure 6.Error distribution between the result obtained by the NO 2 column concentration model and the actual result obtained by OMI in 2019.

Figure 7 .
Figure 7. Error distribution between the result obtained by the adjusted NO2 column concentration model and the actual result obtained by OMI in 2019.

Figure 7 .
Figure 7. Error distribution between the result obtained by the adjusted NO 2 column concentration model and the actual result obtained by OMI in 2019.

Figure 8 .
Figure 8. Partial dependence plot between various influencing factors and the NO 2 column concentration obtained by Mode 4 using random forest regression.(A): the NO2 concentration obtained by the ground monitoring station; (B): precipitable water volume; (C): tropospheric temperature; (D): lifted index selected; (E): atmospheric pressure; (F): relative humidity.

Figure 9 .
Figure 9.Comparison of the results calculated from the functional relationship with the measured values of OMI.

Figure 9 .
Figure 9.Comparison of the results calculated from the functional relationship with the measured values of OMI.

Table 1 .
and Figure 5/Table 2, respectively.Data engineering of the 2020 NO2 column concentration prediction model (x for month).

Table 1 .
/Table2, respectively.Data engineering of the 2020 NO 2 column concentration prediction model (x for month).

Table 2 .
Result error of 2020 NO2 column concentration prediction.

Table 2 .
Result error of 2020 NO 2 column concentration prediction.

Table 3 .
Data engineering of the 2019 NO 2 column concentration prediction model (x for month).The data engineering and prediction results are presented in Model 2/Table3and Figure 6/Table 4, respectively.

Table 3 .
Data engineering of the 2019 NO2 column concentration prediction model (x for month).

Table 4 .
Result error of 2020 NO2 column concentration prediction.

Table 4 .
Result error of 2020 NO 2 column concentration prediction.

Table 5 .
Adjusted data engineering of the 2019 NO 2 column concentration prediction model (x for month).

Table 6 .
Results error of adjusted 2019 NO 2 column concentration prediction.

Table 6 .
Results error of adjusted 2019 NO2 column concentration prediction.

Table 7 .
Data engineering of the influencing factor-based NO 2 column concentration prediction model.

Table 8 .
Result of the influencing factor-based NO 2 column concentration prediction model.

Table 9 .
Important parameters of the influencing factor-based NO 2 column concentration prediction model.