Global Surface HCHO Distribution derived from Satellite Observations with Neural Networks Technique

: Formaldehyde (HCHO) is one of the most important carcinogenic air contaminants. However, the lack of global surface concentration of HCHO monitoring is currently hindering research on outdoor HCHO pollution. Traditional methods are either restricted to small areas or data-demanding for a global scale of research. To alleviate this issue, we adopted neural networks to estimate surface HCHO concentration with confidence intervals in 2019, where HCHO vertical column density data from TROPOMI, in-situ data from HAPs (harmful air pollutants) monitoring network and ATom mission are utilized. Our result shows that the global surface HCHO average concentration is 2.30 μg/m 3 . Furthermore, in terms of regions, the concentration in Amazon Basin, Northern China, South-east Asia, Bay of Bengal, Central and Western Africa are among the highest. The results from our study provides a first dataset of the global surface HCHO concentration. In addition, the derived confidence interval of surface HCHO concentration adds an extra layer for the confidence to our results. As a pioneer work in adopting confidence interval estimation into AI-driven atmospheric pollutant research and the first global HCHO surface distribution dataset, our paper will pave the way for the rigorous study on global ambient HCHO health risk and economic loss, thus providing a basis for pollutant controlling policies worldwide.


Introduction
Formaldehyde (HCHO) is a carcinogenic trace gas and toxic pollutant in the atmosphere [1]. It is considered by the U.S. Environmental Protection Agency (EPA) to be one of the most important carcinogens in outdoor air among 187 harmful air pollutants (HAPs) [2], and accounts for more than 50% of the total risk of HAP related cancer in the United States [3]. 13 out of every million people receive nasopharyngeal carcinoma after being exposed to an average concentration of 1 microgram per cubic meter of HCHO for a lifetime [4]. As the most abundant aldehyde compound in the atmosphere, HCHO is one of the major volatile organic compounds (VOCs) and pollutants in the troposphere [5], which has a close relationship with the formation and extinction of O3 and NO2 in the atmosphere. HCHO pollution is a global scale issue. Ambient HCHO can be produced naturally and artificially, such as photolysis of isoprene from vegetation [6,7] farmland emissions [8], energy production and automobile exhaust emissions [9,10].
Surface concentration represents the amount of HCHO that people are exposed to, and is the direct data source of health risk estimation. Nevertheless, despite the crucial role of HCHO in human's health and atmosphere, it is difficult to monitor HCHO systematically and comprehensively by using traditional ground-based methods because of the large error and the expensive cost [11]. As a result, there is still no regular or largescale monitoring of HCHO over most regions of the world. Most countries and regions with serious pollution fail to measure the surface HCHO concentration. Only in the United States, the HAP sampling network collects HCHO information but is limited to cities and industrial sites [12].
In contrast, remote sensing technology can not only monitor the long-term and largescale dynamics, but also avoid many interference factors. Currently, there are many satellite missions reporting HCHO vertical column density (VCD) [13], which provides fundamental datasets for many related researches. The main sensors used to measure the concentration of HCHO VCD in the atmosphere include GOME-1 [14], GOME-2 [15], SCIAMACHY [16], OMI [17] and TROPOMI [18]. In terms of precision, TROPOMI is the most advanced atmospheric monitoring spectrometer with the highest resolution, with a swath of 2600km and daily global coverage [19]. However, most satellite-based retrieval can only provide the total column concentration due to their limitation on vertical resolution. Therefore, most studies on ambient HCHO only focus on the total amount in the vertical column in certain regions, such as North America [20], South America [21], Europe [22], Asia [23,24], Africa [7], instead of focusing on its surface concentration.
With the increasing attention towards health risks and photochemical pollution, demand for HCHO surface concentration distribution from the global perspective is growing more urgent. Many efforts have been put to derive surface concentration from total column concentration, such as using the fixed forms of linear models to assess the relationship between VCD and in-situ concentration 1 of NO2, SO2, CO, PM [25], and using R 2 to assess the relationship between vertical column density and ground in-situ concentration [26]. However, these methods seem to be less accurate and may only be limited to specific pollutants. In the other few existing studies HCHO surface concentration was derived by applying the vertical distribution profile from, GEOS-Chem model to the satellite-derived total column concentration [27]. However, the atmospheric transportation model itself requires numerous input parameters, which may impede its application to a global scale with a reasonable spatial and telporal resolution. Therefore, our main focus here is to derive the global surface HCHO concentration distribution based from satelllite-derived total column HCHO concentration and a quite limited in-situ HCHO concentration.
Neural network, a powerful machine learning algorithm, has gained its reputation for revealing hidden patterns inside data with a great accuracy in various fields, such as image classification [28], object detection [29], image denoising [30], image synthesis [31], person re-identification [32], etc. However, some algorithms, such as vanilla neural network, do not assign confidence level nor confidence interval to its point estimation results, which is necessary for scientific estimation and public policy decision-making. To quantify uncertainty of results derived from neural networks, a diverse of approaches have been adopted, including Bayesian neural network [33], delta method [34], bootstrap [35], mean variance estimation [35], interpreting dropout as performing variational inference [36]. However, these methods are either computationally demanding or strongly based on assumptions. Quality-driven (QD) method, a method based on LUBE to derive confidence intervals for the neural network, by combining the uncertainty estimating loss and the neural network loss function as a whole [37], is not only compatible with gradient descent algorithms, but also shrink the average confidence interval length up to 10%, compared with previous works [38]. Therefore, to enhance the credibility of our model, this method is leveraged to obtain the interval estimation of surface concentration of HCHO. By combining the point and interval estimation, it is believed to meet a balance between maintaining accuracy and controlling uncertainty in the form of a pre-set confidence level.
The potential health impact of HCHO but lack of global surface monitoring data demands an efficient way to get a better understanding of global HCHO surface distribution with limited data. In this paper, as a novel study, we derived the global surface concentration of HCHO in 2019 by feeding TROPOMI VCD data and limited surface HCHO concentration data into a neural network model. In addition, besides the capture of the seasonal changes of key areas, confidence intervals for the derived surface HCHO are also estimated by using QD method. As a novel work on adopting interval estimation in AI-driven atmospheric pollutant research and deriving the first dataset ofglobal HCHO surface distribution, our paper will pave the way for rigorous study on global ambient HCHO health risk and economic loss, thus providing a basis for pollutant controlling policies worldwide.

Figure 1. Data processing workflow
To estimate the global distribution of HCHO surface concentration, we used two discrete in-situ data sources and Sentinel-5P TROPOMI VCD data on the corresponding location (as shown by red points in Figure 1) to train our neural network model. Then we apply our model to a global scale and estimate the surface HCHO distribution with confidence intervals.

Sentinel-5P VCD Data
The data of vertical column density (VCD) of HCHO in this study comes from TROPOMI (Tropospheric Monitoring Instrument), which is carried on Sentinel-5P [19]. Sentinel-5P is a global air pollution monitoring satellite launched by ESA on October 13, 2017, as part of the Copernicus project. TROPOMI can effectively observe trace gas components in the atmosphere around the world, including NO2, O3, SO2, HCHO, CH4, CO and other important indicators closely related to human activities, and can strengthen the observation of aerosols and clouds [39].
In terms of accuracy, TROPOMI is currently the most advanced atmospheric monitoring spectrometer with the highest spatial resolution. The satellite provides global coverage daily with a spatial resolution of 7km×7km and the equator crossing time at about 13:30 local time, which effectively ensures the comparability of data in different regions [19]. Sentinel-5P data are currently available for public access 2 .
We use the data of 2019 because a) 2018 is the first year that Sentinel-5P is in operation; the algorithm of the product is not stable then; b) 2020 is within the global COVID-19 pandemic, which might have special impact on anthropogenic sources, making the result less representative in terms of a long-term status. Offline HCHO data from January 1 to December 31, 2019 are collected. According to the technical documents, data points whose quality index (QA_ value) is less than 0.5 are removed to ensure the best quality. After doing mosaic on the datasets and applying Ordinary Kriging interpolation, we obtained the distribution of global average total column concentration of HCHO with a resolution of 0.05° by 0.05°. The data beyond 60°S and 60°N is discarded due to the sparsity of satellite data and scarceness of human activities , which has little impact on health risk estimation.

In-situ Data
Since our study aims to estimate the surface concentration of HCHO on a global level, we need surface-level concentration data which will cover diverse types of underlying surfaces and also different altitudes to train our model. Therefore, the following two data sources are considered.
ATom flight data. NASA's atmospheric tomography mission (ATom) is a systematic, global sampling of the atmosphere in the United States from 2016 to 2018, and continuous profile analysis from 0.2km to 12km. The volume mixing ratio of HCHO in air was measured in ATom flight data. A large number of gas and aerosol payloads were deployed on NASA's DC-8 aircraft, and the HCHO on NASA's high-altitude aircraft was measured by ISAF instrument [40,41]. The instrument uses laser-induced fluorescence (LIF) to obtain the high sensitivity needed to detect HCHO in the upper troposphere and lower stratosphere, which has an abundance of 10 parts per trillion. LIF can also achieve quick response to measure the abundance of HCHO in the fine structure outflow of convective storms. These HCHO measurements will be used to elucidate the mechanism of convective transport and to quantify the effects of boundary layer pollutants on ozone photochemistry and cloud microphysics in the upper atmosphere [42].
HAPs ground monitoring data. We obtained ground HCHO observations from EPA SLTS network at https://www.epa.gov/outdoor-air-quality-data, which reports average 24-hour HCHO concentration all around the year. Here, we selected 5965 data points from 109 sites in 2019, covering the whole country, as shown in Figure 2 (a).
These two datasets cover a wide range of latitudes, from -8.1977° S to 82.9404° N, and a diverse variety of landscapes in the U.S. The selection of the HAPs dataset is to ensure that the concentration distribution feature at ground level is represented in our model, and the ATom data is to ensure that our model can be generalized and applied to a global extent.
(a) (b) Figure 3. (a) The geographical distribution of our data, where red represents ATom flight data points and green represents HAPS ground monitoring network. (b) The meaning of "Height" and "Altitude" for ATom mission data Since ATom data are obtained far above the surface, and the vertical distribution of HCHO usually changes largely from ground to 1~2km above [43], we take the "Height" of the aircraft measurements as another input variable in our model to control the impact of vertical distribution along the column. For those HAPS ground monitoring data, we assign 0 as their heights.

Global DEM Data
Since descriptive statistics show a negative relationship between surface altitude and in-situ concentration, with a Pearson's correlation of r=-0.3907 in our in-situ dataset, we use global Digital Elevation Model (DEM) data as one of the input variables-"Altitude", in order to estimate the ground-level concentration. The relationship between variable "Height" and variable "Altitude" is shown in Figure 2 In our study, we use the Shuttle Radar Topography Mission (SRTM) DEM product and resample it to a resolution of 0.05°. This dataset has an initial resolution of 90m at the equator and is provided in WGS84 projection with a 1 arc resolution [44].

Data Processing
After collecting and organizing data into formattable structure, we first visualize and preprocess these data. Then, two neural networks are implemented for point and interval estimations by using PyTorch, a well-known deep-learning framework. Our code is available online 3 .
The preprocessed data with the ground truth in-situ HCHO concentration are then divided into two groups, training and testing dataset, to train our models. After that, global VCD data are fed into the model to derive global surface level HCHO concentration.

Preprocessing
In theory, a neural network is able to handle input data from a different distribution, however, a significant defect was noticed in the training process without preprocessing, owing to the highly imbalanced, skewed distribution of the HCHO concentration (both column and in-situ). Therefore, we first applied log-transformation to the raw data. The logarithm of the HCHO concentration data shows a bell-shaped distribution, and increments in estimation accuracy have also proven the effectiveness of logtransformation.

Neural Network Architecture
As a universal function approximator, the neural network plays a vital role in helping us derive the point and interval estimations of the HCHO concentration. But instead of training a single network to get these estimations jointly, two separate neural networks are constructed for point and interval estimation respectively, because several experiments which we carried out indicated that a joint model always has to compromise between point estimation and interval estimation, thus greatly reducing the accuracy of point estimation.
Like ordinary multi-layer perceptrons, each neural network in our model contains three input nodes, three BFR blocks (with the ReLUs in the last blocks are disabled). The network for point estimation has one output node, and the other network for interval estimation gets two nodes. The structure of our model is shown in Figure 3. For the sake of stabilizing the training and prediction procedure, instead of stacking full-connection and non-linear activation layers, we proposed to stack BFR blocks, which are made up of a batch normalization layer, a full connection layer and a ReLU activation layer sequentially.
Batch normalization (BN) is first introduced to address Internal Covariate Shift, a phenomenon referring to the unfavorable change of data distributions in the hidden layers. Just like the data standardization, BN forces the distribution of each hidden layer to have exactly the same means and variances dimension-wisely, which not only regularizes the network, but also accelerates the training procedure by reducing the dependence of gradients on the scale of the parameters or of their initial values [45].
Full connection (FC) layer is connected immediately after the BN layer in order to provide linear transformation, where we set the number of hidden neurons as 50. The output from the FC layer is non-linearly activated by ReLU function [46,47].

Loss function
Objective functions with suitable forms are crucial for applying stochastic gradient descent algorithms to converge while training. Though point estimation only needs to take the precision into consideration, two conflicting factors are involved in evaluating the quality of interval estimation -higher confidential level usually yields an interval with greater length and vice versa.
Point estimation loss. Instead of fancy forms, we found that a 1 loss is sufficient for training rapidly: Interval estimation loss is relatively complex compared to point estimation loss. The QD-loss takes the confidential level and interval length into consideration simultaneously [38]: On one hand, to control the confidential level of the interval estimator, is set to indicate at most how many(proportionally) intervals failing to cover the true value can be tolerated. We set multiple ′ , including 0.05, 0.10, 0.20, in our model to derive interval predictions of various confidential level and average coverage length, and it is verified that higher yields shorter intervals.
On the other hand, the average length of intervals subject to > 1 − should be minimized. However, intervals that fail to capture their corresponding data point should not be encouraged to shrink further. The average interval length to penalize is, therefore, where ̃= ( ⋅ ( −̂)) ⋅ ( ⋅ (̂− )) , works as a continuous approximation towards "hard" {̂ < <̂}. Since the sigmoid function is known for providing a differentiable alternative to discrete stepwise functions, and = 160 is a super-parameter for smoothness.

Point Estimation
Point estimation model in this study shows a relatively high accuracy and is generally consistent with previous studies on the vertical distribution of HCHO. Figure 5. shows the point estimation value of in-situ concentration with the change of vertical column density (VCD) and height, when altitude at sea level is fixed. It is seen that in-situ concentration is negatively correlated with the height and positively correlated with VCD. To evaluate the performance of our model, statistics including MAE 4 and RMSE 5 are calculated based on the training and testing datasets respectively. As shown by Table 1, both MAEs and RMSEs are relatively small, which indicates that the model performs well in the point estimation. By loading the global DEM, logarithm VCD and the height (0 m at surface) into the model, the annual average of the global surface HCHO distribution map was derived. As shown in Figure 5, there are generally 6 regions where HCHO surface concentration is high, namely the Amazon area, south east U.S., Central and Western Africa, North Eastern India, South East Asia, and North China, with an average concentration of more than 4 μg/m3. The seasonal change of HCHO in these key areas is discussed in section 3.3. The uneven distribution of HCHO concentration on the sea and land surface is also noticed in Figure 6, which shows the HCHO concentration is relatively lower and more homogeneous on the sea surface than on the land. Statistics given in Table 2 have also confirms this fact. It is seen that the annual mean of surface HCHO concentration is about 2.21 μg/m3 over ocean and 2.77 μg/m3 over land. Cities, as the regions with the densest population, deserve specific attention towards their surface HCHO concentration due to its known and potential harm to people living there. Table 3 shows the surface concentration of HCHO of some of the typical cities in these regions, where Jakarta and Singapore, two major cities (country) in South East Asia, rank the highest and the second highest, reaching to 6.18 and 5.83 μg/m3, respectively.

Interval Estimation
Besides point estimation, the model in this study also provides the estimation of upper and lower bounds of surface concentration of HCHO, so that the uncertainty, or variability of the surface concentration can be evaluated. In Figure 6, the relationship between the estimated upper bound, lower bound and the point estimation are displayed in a 3D space. It is worth emphasizing that the captured uncertainty, or the interval length, delineates the variability of the data itself, not the lower trustworthiness of our model or its estimations. Confidence level, together with the covering length, lay the foundation for the trustworthiness and precision of our interval prediction. As shown in Table 4, interval estimation model obtains the covering rates and the ratio of true values covered by predicted interval, of 94.41% and 88.74%, exceeding the pre-set confidential level = 0.9 and = 0.8, respectively.
In addition, as expected in section 2.2.1.2, a higher confidence level yields a longer average interval length 6 , which is 4.530 μg /m 3 for = 0.9, 17% more than 3.864 μg /m 3 for = 0.8. Such a phenomenon can also be seen in the statistics, shown in Table 4, for minimum, maximum and mean values of upper and lower bounds, respectively for the two confidence levels. However,the standard deviation of upper bounds seems to be larger than that of the lower bounds under both scenarios in Table 4. From the density scatter plot between these two, shown in Figure 7, It is seen that that the upper bound estimation is not deterministic, though interval estimation successfully covers the true values (and point estimations as shall be discussed below) of surface concentration. Nevertheless, further exploration of seasonal changes of HCHO in some key areas in section 3.3 could basically explain that seasonal variations of surface HCHO may contribute to the majority of the uncertainty of interval estimation. Global distribution of the estimated upper and lower bounds are given in Figure 8(a). It shows that the upper and lower bounds generally share the same global pattern, though with different magnitude, with a range of between 3.77 and 8.83 μg/m 3 for upper bounds and from 0.52 to 1.03 μg/m 3 for lower bounds. The interval length 6 of 90% confidence interval is 4.77 μg/m 3 .
As shown in Figure 8

Seasonal Changes of HCHO in Some Key regions
To better understand the seasonal variation of surface HCHO, the distribution pattern of four typical months of some key areas where surface concentration is relatively high are analyzed.
America. Figure 9 shows the surface concentration of February, May, August, and November in South America and around Caribbean Sea. Amazon Basin, Paraguay, and Eastern Central America have a high HCHO surface concentration in November and February, while the south-east coast of U.S. has the highest concentration in November and are almost free from HCHO pollution in February and May. The Andes Mountains has a significantly low concentration, with a value of less than 0.5 μg/m 3 . Africa. As shown in Figure 10, there are two regions in Africa whose HCHO surface concentration is relatively high. One is in the south of R. D. Congo around the city of Kolwezi, a mining center with a humid subtropical climate. The surface concentration of HCHO here reaches its maximum in February. The other pollution belt stretches along the Gulf of Guinea, which is famous for its rainforest climate.

Consistency and innovativeness
It is clear that the global surface distribution of HCHO with point and interval estimation is able to be obtained successfully by using neural network models described above. As shown in Figure 13, the results obtained through machine learning technique are generally consistent with results from the previous works which is obtained by combining OMI total column HCHO concentration with GEOS-Chem model from 2005 to 2016, but with less noise across the satellite track. It is seen from the blowup box, which is shown on the right of the figure, corresponding to each result respectively, results from previous studies bear a strip across the satellite track, but the new results from this study does not. In addition, the estimation results in this study show some reversal trend in the Cordillera mountains area. Future validation may be needed for this case. However, since this difference occurs in places where population is sparse, it is not likely to have a perceivable influence on the estimation of cancer risks. The result of global surface concentration estimation of 2019 gives a closer look at the global distribution pattern of HCHO. Obviously, HCHO tends to prevail on the plain of the continent, instead of on the ocean or on high altitude areas. According to previous study, this can be attributed to the scarceness of VOC sources like chemical industry, combustion and rainforest, which are common precursors of the free radical reaction of HCHO production [46][47][48]. By mapping the distribution of HCHO, two kinds of sources around the world can be distinguished preliminarily. One is plant-related, including Amazon, South East Asia and Gulf of Guinea, the other is human-related, including North China Plain and Pearl River Delta [49,50]. More work is needed to accurately identify the source of these HCHO-polluted areas.
In addition, we introduce the interval estimation of neural network model in the conversion from VCD to global surface concentration of HCHO for the first time, increasing the credibility of the model by providing uncertainty information. This new idea can make up for the deficiency of inexplicability of the neural network model [51], thus being useful for the application of neural network models into the field of estimating atmospheric pollutant or health risk in the future.

Limitations and potential improvements
Despite the consistency and innovativeness mentioned before, the shortage of in-situ data is also hindering the further improvement of the model accuracy. On one hand, the existing HCHO in-situ concentration data is seriously insufficient in both spatial and temporal dimensions. Only the United States monitors HCHO in-situ concentration routinely. Even if ATom data are also adopted, in-situ concentration data in low latitude regions is still sparse, which may lead to estimation bias in the low latitude areas such as Asia and Africa. On another hand, it is also difficult to reach a better result by adding more covariates into our model. Experiments with additional covariates input, such as latitude and months, have failed with degenerated or overfitting outputs. In addition, the large gap between true values and the upper bounds from our interval estimation model may suggest a heterogeneous in-situ concentration of HCHO distribution in different months or seasons, Since the model is required to give the interval estimations on the scale of a whole year, rather than on a fine time scale. The seasonal changes of HCHO in some key areas as discussed in section 3.3 has also shown this phenomenon directly.
Therefore, as more HCHO in-situ monitoring network develop, a larger amount of data from more diverse sites could enable scientists to adopt a careful designation of temporal data input and could help give a better estimation towards in-situ concentration of HCHO. Meanwhile, with more Sentinel-5P data accumulating over time, the model in this study can take more factors, including latitude and seasons, into consideration, which could provide more precise estimation of a global scale health risk and economic loss based on specific regions and seasons. Besides the significance of the health risk, the results from this study can also help research on the generation of photochemical pollution, the concentration of VOC, NO2 and other photochemical reaction related pollutants. HCHO, as one of the most important carcinogens in the outdoor environment [2], draws little attention due to the lack of ground measurements of HCHO in most countries and regions for a long time, leading to the shortage of knowledge about health and economic loss. Even if the vertical column density of HCHO is currently available and does settle parts of concern about these issues, it is the ground level HCHO concentration that reflects the actual amount of concentration people are exposed to.

Health risk of HCHO in major cities
Taking 2019 as an example, it is assumed that the HCHO concentration is always the same as this year. According to the inhalation unit risk estimate from EPA and population data [4,54]. Health risks in main high-risk cities are calculated and given in Table 5. It is indicated that more than a thousand people have the potential to get cancer due to exposure to HCHO in Jakarta, Dhaka, Bangkok, Kolkata, Beijing and Guangzhou. Jakarta has the highest potential patients due to exposure, with a number of up to 2593. Jakarta, Singapore, Kuala Lumpur, Dhaka and Lagos are the most prevalent cities, with 80.34, 75.79, 72.93, 71.63, 71.37 potential patients per million. The main cities with high health risk concentration in Southeast Asia, which was previously neglected by the academia, may become the next hotspots for research in HCHO pollution and health risk.

Conclusions
With the benefit from a quality-driven interval estimation algorithm designed for neural network, we are able to derive the confidence interval and a precise point estimation of 2019 global surface HCHO on different confidence levels with a limited amount of data. By mapping the HCHO surface concentration distribution, we found that Southeast Asia, North China, Central and Western Africa, and the rainforest area of Latin America have a relatively more serious HCHO pollution than the rest regions. Major cities in these regions, such as Bangkok, Beijing, Guangzhou, Singapore, have an annual concentration over 5.00 μg/m 3 . The health effects from such high levels of HCHO pollution deserve more attention from the academia and governments.
Our work paves a way for research on formaldehyde-related cancers, and provides guidance for policy making and insurance pricing. To the best of our knowledge, we are the first to map the global distribution of HCHO and provide insights on its potential health risks. With more HCHO VCD data from Sentinel-5P accumulated, the surface concentration of HCHO dataset covering a longer period of time will be generated, which will help for better assessment of the global risk distribution of formaldehyde-related cancers.
And the data presented in this study are available in [https://drive.google.com/file/d/10A2VIEHm22DF_gyCufV-pbgUdYYhNJKf/view?usp=sharing].