Generating Fine-Scale Aerosol Data through Downscaling with an Artificial Neural Network Enhanced with Transfer Learning

Spatially and temporally resolved aerosol data are essential for conducting air quality studies and assessing the health effects associated with exposure to air pollution. As these data are often expensive to acquire and time consuming to estimate, computationally efficient methods are desirable. When coarse-scale data or imagery are available, fine-scale data can be generated through downscaling methods. We developed an Artificial Neural Network Sequential Downscaling Method (ASDM) with Transfer Learning Enhancement (ASDMTE) to translate time-series data from coarseto fine-scale while maintaining between-scale empirical associations as well as inherent within-scale correlations. Using assimilated aerosol optical depth (AOD) from the GEOS-5 Nature Run (G5NR) (2 years, daily, 7 km resolution) and Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) (20 years, daily, 50 km resolution), coupled with elevation (1 km resolution), we demonstrate the downscaling capability of ASDM and ASDMTE and compare their performances against a deep learning downscaling method, Super Resolution Deep Residual Network (SRDRN), and a traditional statistical downscaling framework called dissever ASDM/ASDMTE utilizes empirical between-scale associations, and accounts for within-scale temporal associations in the fine-scale data. In addition, within-scale temporal associations in the coarse-scale data are integrated into the ASDMTE model through the use of transfer learning to enhance downscaling performance. These features enable ASDM/ASDMTE to be trained on short periods of data yet achieve a good downscaling performance on a longer time-series. Among all the test sets, ASDM and ASDMTE had mean maximum image-wise R2 of 0.735 and 0.758, respectively, while SRDRN, dissever GAM and dissever LM had mean maximum image-wise R2 of 0.313, 0.106 and 0.095, respectively.


Introduction
Fine-scale aerosol data provide essential support for air quality studies [1] and downstream health-related applications. Over the past several years, satellite-based aerosol optical depth (AOD) has been used for this purpose, primarily to estimate PM 2.5 surfaces at fine spatial scales [2][3][4]. Satellite AOD-derived PM 2.5 estimates have been used to examine health outcomes including respiratory [5][6][7] and cardiovascular [8] diseases. Generating fine-scale PM 2.5 from satellite AOD has several limitations including missing data due to cloud cover and bright surfaces [9], and it requires complex statistical or machine learning techniques that incorporate multiple external data sources [10].
Our study region encompasses several countries across Southwest Asia (Afghanistan, Iraq, Kuwait, Saudi Arabia, United Arab Emirates, and Qatar (Figure 1)), which is known Under the ASDM framework, the fine-scale variable can be modeled as a non-linear function of coarse-scale variable, with a sequence of temporally lagging fine-scale variables at the same location adjusting for geographic information (e.g., elevation), time (day of the year) and location (latitude, longitude). To enhance the performance of ASDM, transfer learning can be incorporated where another similar sequential ANN model is trained on the long time series of coarse-scale data to learn its inherent temporal associations; this model is then transferred into ASDM to enhance its downscaling performance.
We developed ASDM/ASDMTE models to downscale AOD data obtained from the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2), a satellite-based reanalysis product produced by NASA's Global Modeling and Assimilation Office (GMAO). MERRA-2 data are available for a long period (1980-present) at relatively coarse scale (∼50 km). The target for downscaling was fine-scale (∼7 km) AOD from the Goddard Earth Observing System Model, Version 5 (GEOS-5) Nature Run (G5NR), another satellite-based product [16]. At this resolution, G5NR is an informative data source for understanding local-scale air quality and as an exposure metric for health effects studies, but it is limited in temporal range (2005)(2006)(2007), which restricts its broad use for long-term studies. As the fine-scale G5NR data has limited temporal range (2 years of daily data), it was difficult to build stable empirical associations needed for traditional statistical downscaling that link large-scale variables with local-scale variables. Furthermore, little external or covariate information were available at fine scales that could help with traditional downscaling. These limitations made it impractical to establish between-scale empirical associations without other prior knowledge, particularly since the single coarse-scale variable did not have enough spatial variability to predict the fine-scale variable. Lastly, even though G5NR and MERRA-2 provide the same variables over the same region and period of time, they are independent datasets that do not match on a point-to-point basis due to algorithmic differences [16]. Specifically, the mean of the G5NR 7 km grid values is not exactly equal to its coincident MERRA-2 50 km coarse grid value. We applied our ASDM and ASDMTE downscaling approaches to G5NR and MERRA-2 data for several countries in Southwest Asia ( Figure 1). ASDM/ASDMTE performances were compared with a deep learning downscaling method, Super Resolution Deep Residual Network (SRDRN) and a traditional statistical downscaling methods in the dissever framework including generalized additive models (GAM), and linear regression model (LM) over the same study domain and period.

G5NR
GEOS-5 Nature Run (G5NR) is a two-year (16 May 2005-15 May 2007) non-hydrostatic 7 km global mesoscale simulation also produced by the GEOS-5 atmospheric general circulation model [38]. Its development was motivated by the observing system simulation experiment (OSSE) community for a high-resolution sequel to the existing Nature Run, European Centre for Medium-Range Weather Forecasts (ECMWF). Like MERRA-2, G5NR includes 15 aerosol tracers [16]. It simulates its own weather system around the Earth which is constrained only by surface boundary conditions for sea-surface temperatures, the burning emissions of sea-ice, daily volcanic and biomass and high-resolution inventories of anthropogenic sources [38]. In this study we focused on all two years of the available G5NR Total Aerosol Extinction AOD 550 nm, which had 0.0625 • grid resolution (∼7 km) and daily temporal resolution.

GMTED2010 Elevation
The Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010) is a global elevation model developed by the U.S. Geological Survey and the National Geospatial-Intelligence Agency [39]. The data are available at three separate resolutions (horizontal post spacing) of 30 arc-seconds (∼1 km), 15 arc-seconds (∼500 m), and 7.5 arc-seconds (∼250 m) [40]. We used the 30 arc-seconds resolution data and spatially averaged to match the ∼7 km G5NR grid.

Downscaling Model
We propose an Artificial Neural Network Sequential Downscaling Method (ASDM) with Transfer Learning Enhancement (ASDMTE) to generate fine-scale (FS) data from coarse-scale (CS) data. The method can be formulated as follows: Let y i,j,t denote the FS AOD referenced at i, j, t, where i ∈ {1, 2, · · · , h}, j ∈ {1, 2, · · · , w}, t ∈ {1, 2, . . . , d}; h and w index latitude and longitude over the study domain and d is the time index. Similarly, we define the CS AOD referenced at x i ,j ,t , where i ∈ {1, 2, · · · , h }, j ∈ {1, 2, · · · , w }, t ∈ {1, 2, · · · , d }; h , w and d are latitude, longitude and time indices, respectively. Although the CS data have a longer overall period of temporal coverage, the FS and CS data have the same time step (day).
The estimated downscaling modelf can then be denoted as: y i,j,t =f(y (i,j,t−1),n , x i ,j ,t , Ele i,j , Lat i , Lon j , Day t ) where Ele i,j , Lat i , Lon j , Day t are elevation, latitude, longitude and day of the year at i, j, t, respectively; x i ,j ,t represents CS AOD that spatially covers y i,j,t (at the same time t); y (i,j,t−1),n is a list of n temporal lagging variables at location i, j. Throughf, we not only learned empirical associations between the CS and FS variables, x i ,j ,t and y i,j,t , but also short-term temporal associations within the FS data by including n = 25 time lags of the fine-scale variables, y (i,j,t−1),n . In the model we also adjusted for location (latitude, Lat i and longitude, Lon j ), long-term time (day of the year, Day t ), and geographic information (elevation, Ele i,j ) makingf a function of space and time. This also enabled the use of data at different locations and times to train our model, which provided more information for training and partially alleviated the issue of having limited overlapping (in time) data. The larger the spatial area and temporal range, the more data we had for training; however, at the same time, the modelf became more complex. This increasing complexity in the target model is equivalent to adding difficulty in the learning process, thus we made the decision to trade off between data availability and model complexity.
To enhance the performance off, we incorporated transfer learning [41] into ASDM. Machine learning methods traditionally solve isolated tasks from scratch, which make them data hungry. Transfer learning attempts to solve this problem by developing methods to transfer knowledge learned in other sources and use it to improve the learning performance in a related target task [42]. The formal definition of transfer learning can be expressed as [41]: Transfer learning allows us to learn certain patterns within one dataset that can be applied to another. Since coarse-scale data are usually cheaper to obtain and more available, we can use inherent knowledge learned within them to improve the predictive performance off. Thus, to make use of the spatiotemporal associations within the CS data, a transfer model was trained on CS data to learn the inherent mapping functionĝ and, consequently, the modelĝ was transferred into the ASDM/ASDMTE. The transfer integration of the ASDMTE network structure is shown in Figure 2. The learned inherent functionĝ can be denoted as:

ASDM/ASDMTE Network Structure
Given its ability to fit non-linear functions, we used an artificial neural network to modelf; the overall network structure of ASDM/ASDMTE is shown in Figure 2. Overall Neural Network structure of ASDM/ASDMTE. The notation LSTM:8s represents a LSTM layer with 8 nodes and return sequence. The notation of Building Block:8 represents a building block with 8 nodes. The light yellow block represents using dropout layer with dropout rate of 0.5. The transfer Block is only used in ASDMTE thus it is connected with dash lines.
For model fitting, longitude, latitude, day of the year and elevation were normalized to a range [0, 1]. The CS and FS AOD variables, X, Y, have natural range [0, 6] which is approximately the same scale as [0, 1] and thus they were kept on their original scale. 'Input I' used all available features except lagging variables, X i ,j ,t , Ele i,j , Lat i , Lon j , Day t , and was processed by 'Process Block I'. 'Input II' was composed of the 25 FS lags y (i,j,t−1), 25 and went through 'Temporal Block I' and 'Temporal Block II' in ASDM. If using transfer learning enhancement (ASDMTE), 'Input II' was also processed by the 'Transfer Block'. All output from 'Process Block I', 'Temporal Block I', 'Temporal Block II' and/or 'Transfer Block' were combined and then processed by 'Process Block II'.
Long Short Term Memory (LSTM) [43] was used to model the within scale temporal associations. The building block of ASDM/ASDMTE was composed of a fully connected (FC) layer, a batch normalization layer, and an optional dropout layer. Leaky ReLU [44] was used as a non-linear activation function of the FC layer to prevent dead neurons and can be expressed as: where we chose α = 0.1. The batch normalization layer was used to stabilize the learning process and reduce the training time [45,46]. Dropout layers with rate 0.5 were used as regularization to prevent overfitting [47,48], but the dropout layer was applied only in selected building blocks, marked in yellow in Figure 2. The loss function of this model was Mean Square Error (MSE), which can be expressed as:

Transferred Model
The transferred model was trained on CS data (MERRA-2), resulting in the learned functionĝ (Equation (2)). Its network structure is shown in Figure 3. The transferred model captured the within-scale association in CS data and carries this spatiotemporal knowledge to the ASDM to enhance its performance. The Neural Network used to learnĝ was composed of the same building block and similar structure as ASDM/ASDMTE. We used mean squared error (MSE) as the loss function, and to prevent overfitting, dropout layer and early stopping training were applied. We randomly chose 10% of available days as the validation set for early stopping. The 'Transferred Model' is integrated as part of ASDMTE network directly by setting it to untrainable (i.e., it was not updated during training).

Training Strategy
There is always a trade-off between model complexity and data size. The larger spatial and temporal coverage of the data used for training, the more complex the target function f becomes. As this makes it more difficult to learn, we simplified the learning task by spatially and temporally splitting the data while maintaining a reasonable data size, and fitting separate models on each of the subsets. Spatially, the data were grouped into four regions: 1. Afghanistan; 2. United Arab Emirates and Qatar; 3. Saudi Arabia; and 4. Iraq and Kuwait. Temporally, the data were divided approximately equally into four seasons that have 91, 91, 91 and 92 days, respectively. In order to produce temporally continuous downscaled predictions, a 45-day overlap was added to each season as shown in Figure 4. The model in Equation (1) illustrates the prediction for the forward temporal direction; that is, to predict the future with historical observations. We also trained a backward prediction model with a slight variation of the same model format, but using future observations to predict historical data ( Figure 4). Training this way allowed downscaling in both directions, forward and backward in time, which was needed for our application where we aimed to downscale before and after the 2-year training period. Consequently, 32 models (4 regions× 4 seasons × 2 directions) were fitted on all combinations of region, season and direction. Within each subset of data, the data were composed of the same seasons from two years (2005 and 2006), as shown in Figure 4. The two years of data were evenly divided into 10 parts and the last 10% of the data were used as test set. The validation set was the fourth 10% of data. The remaining 80 % was used as the training set.

Evaluation
The downscaling results in the same direction and time were combined spatially as whole images for evaluation purposes. The main evaluation metrics were image-wise R 2 [18] and Root Mean Square Error (RMSE), which are defined as follows: whereŷ i,j,t is the downscaled AOD value at i,j,t and y i,j,t is the corresponding true value. The downscaled results of ASDM, ASDM with transfer enhancement (ASDMTE), SRDRN, dissever framework with GAM and LM as regressors were compared on the same test sets with the above metrics. The structure of SRDRN can be found in Wang et al. (2021) [26].

Results
Same-day images of the 7 km G5NR and 50 km MERRA-2 images are shown in Figure 5. We note similarities in their spatial trends with higher values in arid regions of southeast Saudi Arabia and United Arab Emirates (UAE), but greater definition in the fine scale G5NR image that is particularly clear over Afghanistan. The bottom left and bottom right plots of Figure 5 show mean image-wise R 2 and RMSE (respectively) of G5NR and MERRA-2 AOD data with different lagging. Both G5NR and MERRA-2 show similar temporal associations: the further two data images are, the less they are associated, indicated by lower image-wise R 2 and higher RMSE. These similar inherent temporal associations of G5NR and MERRA-2 provided a good foundation for ASDM to assume that local-scale AOD can be predicted not only by between-scale associations, but also by inherent within-scale associations. In addition, due to generative algorithm differences between G5NR and MERRA-2 AOD data, G5NR AOD has a universally higher mean value and standard deviation (0.316 (0.258)) compared to MERRA-2 AOD (0.294 (0.197)), which is the reason G5NR had higher lagging RMSE and R 2 ( Figure 5).  (Figure 6a,b) preserved very similar spatial characteristics as the true G5NR data in Figure 6d, while SRDRN and dissever-based downscaling results (see Figure 6c,e,f) exhibit clearly different patterns.

Discussion
In this study we developed an Artificial Neural Network Sequential Downscaling Method (ASDM) with Transfer Learning Enhancement (ASDMTE) that enabled coarse-scale AOD data (∼50 km) to be downscaled to a finer-scale (∼7 km) where training occurred only on a limited sample of temporally overlapping images. The ASDM/ASDMTE approach took point-wise inputs of lagged fine-scale AOD data, coarse-scale AOD data, latitude, longitude, time and elevation to predict the fine-scale AOD generated from G5NR. We found that this neural network approach was able to learn complex relationships and produce reliable predictions. Based on the comparison of image-wise R 2 and RMSE shown in Appendix Figures A1-A4 and Table 1, ASDM/ASDMTE showed superior downscaling performance that outperformed the CNN-based neural network-SRDRN-and statistical downscaling approaches in dissever (GAM, LM). Table 1. Image-wise R 2 and RMSE from downscaling (by method); R 2 is presented as Max (Mean), and RMSE is presented as Mean (SD). Statistical downscaling has a long history, rooting from the demand to generate localscale climate information from GCMs with less computational cost. Traditional statistical approaches focus on establishing empirical associations between coarse-scale and fine-scale variables [22,49]. For instance, Loew et al. (2008) modeled the associations between soil moisture at 40 km resolution and its corresponding fine-scale (1 km) observations using linear regression [50]. Leveraging temporal replicates, they fit separate linear regression models independently to each fine scale grid, ignoring spatial and temporal associations in either the fine-or coarse-scale data. Recently, deep learning approaches have been used that address spatial features, such as Wang et al. (2021) [26], who developed a CNN-based method, Super Resolution Deep Residual Network (SRDRN), to downscale precipitation and temperature from coarse resolutions (25, 50 and 100 km) to fine resolution (4 km) by learning the between-scale image-to-image mapping function. However, they ignore the temporal associations between images.

Method Mean
Current downscaling methods focus only on modeling between-scale relationships and ignore any inherent temporal associations in the data. As observed in Figure 5, there are inherent within-scale temporal associations in the fine-and coarse-scale data, where at the same location temporally near observations tend to be correlated to each other. These associations provided essential support for downscaling and resulted in better fine-scale predictions. Essentially, the target fine-scale variable can be estimated by the coarse-scale variable as well as its own temporal lagging, adjusting for geographic features, location and time.
By defining the downscaling problem as above, the ASDM/ASDMTE approach was able to take advantage of both the within-scale temporal associations in the fine-scale data, and between-scale spatial associations, which allow it to have more information with which the neural network can learn better than just using the between-scale spatial relationships. This richness in predictive information is especially important in a situation where data are limited, since it can enable the model to be trained on a short period of overlapping data without requiring point-to-point matching of the fine-and coarse-scale images.
This setting also enabled the use of transfer learning (through ASDMTE) by leveraging the within-scale temporal associations in the coarse-scale data, which had a much longer time series. Typically in downscaling only the temporally overlapping coarse-and fine-scale data can be used for modeling. However, in our case we wanted to downscale a longer time series, and we were able to use transfer learning to learn from all (2000-2018) coarse-scale MERRA-2 data by trainingĝ and transferring it to enhance the downscaling model. ASDM/ASDMTE suffers from the same assumption of stationarity as other downscaling methods, that is it assumes the statistical association between coarse-and fine-scale data does not change outside of the model training time [51,52]. In addition, we may need to further assume stationary of within-scale temporal associations (i.e., temporal lags) used in the model.
Another concern of ASDM/ASDMTE is its test robustness. To stabilizef at test time, we trained different ASDM/ASDMTE models for each season of a year and separately for different regions/countries, as shown in Section 2.2.3. The shorter period of time and smaller target domain simplified the learning task of each model and at the same time, simplified the domain to which the model needed to generalize, so we obtained more robust results when testing.
In addition, ASDM/ASDMTE was designed to solve a supervised downscaling problem, that is, to downscale coarse-scale data and validate against fine-scale data. It requires the presence of some fine-scale data and ASDM/ASDMTE can computationally efficiently extend its temporal range by utilizing the within-scale temporal association to downscale. In the absence of fine-scale data, ASDM/ASDMTE cannot be applied.
A further research direction would be to stabilize the sequential downscaling performance in the presence of shorter temporal range of fine-scale data to account for predicting over a long time series. As shown in Appendix Figures A1-A4, ASDM/ASDMTE can have good downscaling performances and their performances can even recover from previous bad downscaled results, but the performance still shows a temporally decreasing trend. Our future research will focus on improved learning of stable temporal associations to improve sequential downscaling performance for long time series prediction.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: