Himawari-8 Aerosol Optical Depth (AOD) Retrieval Using a Deep Neural Network Trained Using AERONET Observations

Spectral aerosol optical depth (AOD) estimation from satellite-measured top of atmosphere (TOA) reflectances is challenging because of the complicated TOA-AOD relationship and a nexus of land surface and atmospheric state variations. This task is usually undertaken using a physical model to provide a first estimate of the TOA reflectances which are then optimized by comparison with the satellite data. Recently developed deep neural network (DNN) models provide a powerful tool to represent the complicated relationship statistically. This study presents a methodology based on DNN to estimate AOD using Himawari-8 Advanced Himawari Imager (AHI) TOA observations. A year (2017) of AHI TOA observations over the Himawari-8 full disk collocated in space and time with Aerosol Robotic Network (AERONET) AOD data were used to derive a total of 14,154 training and validation samples. The TOA reflectance in all six AHI solar bands, three TOA reflectance ratios derived based on the dark-target assumptions, sun-sensor geometry, and auxiliary data are used as predictors to estimate AOD at 500 nm. The DNN AOD is validated by separating training and validation samples using random k-fold cross-validation and using AERONET site-specific leave-one-station-out validation, and is compared with a random forest regression estimator and Japan Meteorological Agency (JMA) AOD. The DNN AOD shows high accuracy: (1) RMSE = 0.094, R2 = 0.915 for k-fold cross-validation, and (2) RMSE = 0.172, R2 = 0.730 for leave-one-station-out validation. The k-fold cross-validation overestimates the DNN accuracy as the training and validation samples may come from the same AHI pixel location. The leave-one-station-out validation reflects the accuracy for large-area applications where there are no training samples for the pixel location to be estimated. The DNN AOD has better accuracy than the random forest AOD and JMA AOD. In addition, the contribution of the dark-target derived TOA ratio predictors is examined and confirmed, and the sensitivity to the DNN structure is discussed.


Introduction
Aerosols play an important role in the Earth's radiation balance, hydrological cycle, and biogeochemical cycles [1]. Spectral aerosol optical depth (AOD), defined as the extinction of solar radiation due to aerosols, as function of wavelength and integrated over the whole atmospheric column, is one of the principal parameters retrieved from satellite observations. Many different AOD retrieval algorithms have been developed and most of them use radiative transfer models (RTMs) to develop relations between the observed top-of-atmosphere (TOA) reflectance and AOD and assume prior knowledge of the surface reflectance and the aerosol type [2,3]. The RTMs calculate the scattering and absorption of solar radiation by atmospheric molecules and aerosols, which is time consuming and, to save computer time, usually look up tables (LUTs) are created providing RTM results for discrete values of the deciding input parameters such as the angles describing the observation geometry, aerosol models, and discrete values of the AOD. Data from a wide variety of instruments on board different satellite platforms have been used for AOD retrieval. Apart from relying on RTMs, each sensor algorithm may be different and designed specifically for sensor characteristics [4], e.g., spectral band configuration [5,6], multiple view capability [7,8], and polarization [9,10].
Most satellite AOD products are derived from polar orbit satellite sensors. Recently launched new generation geostationary sensors provide bands suitable for AOD retrieval with high revisit times (every 0.5-15 min). The Advanced Himawari Imager (AHI) on board the Himawari-8 satellite, launched in 2014, is such a sensor. The current AHI aerosol product [11] retrieves AOD using a deep-blue (DB)-type method. The DB method, using a pre-calculated static surface reflectance, was originally developed for bright surface [12] and the enhanced DB algorithm can be used over all land types [13][14][15]. In addition, other classical RTM-based AOD retrieval methods were applied to AHI AOD retrieval, such as dark-target (DT) [16,17] and multiangle implementation of atmospheric correction (MAIAC) [18,19]. DT [5] assumes fixed ratios between the visible and 2.1 µm shortwave infrared band surface reflectances. MAIAC jointly retrieves the AOD and the surface anisotropic reflectance using multiple MODIS observations [20,21]. It was originally developed for the MODIS sensor and NASA plans to implement the algorithm for the geostationary satellites including Himawari in the GeoNEX project [18].
In this study, we explore the use of data-driven machine learning methods to derive AOD from AHI observations to investigate whether the data-driven method could achieve at least a similar level of accuracy as the RTM-based AOD retrieval. It is motivated by the recent development of deep neural network (DNN) algorithms [22] providing capability to model complicated relationships and by Aerosol Robotic Network (AERONET) observations providing training samples [23]. A few studies have been undertaken using a shallow artificial neural network (ANN) to directly estimate AOD from satellite TOA measurements, but these were applied to a limited study area [24][25][26][27]. Machine learning methods have also been applied to correct measurement errors in the RTM-based AOD retrieval [28][29][30], to build surface reflectance relationships in the DT algorithm [31], and to estimate AOD with surface reflectance as predictors [32]. To our best knowledge, there has been no publication related to the retrieval of aerosol properties directly using AHI TOA data based on deep neural networks.
The DNN predictors include TOA reflectance, TOA reflectance ratios derived from the DT assumption, sun-sensor geometry, and auxiliary data that are used for the RTM-based AOD retrieval so that the RTM and DNN-based AOD retrieval can be fairly compared (i.e., using the same input information). To train and validate the DNN, a dataset is created consisting of one year (2017) of AHI TOA reflectances collocated in space and time with AERONET AOD over the Himawari-8 full disk. The DNN is trained using state-of-the-art techniques to achieve its full capability. The DNN is compared to an established random forest regression algorithm that has been extensively used for estimation of atmospheric components using remote sensing data [33][34][35][36]. The paper is structured as follows. The Himawari-8, AERONET, and auxiliary data are introduced in Section 2 and the methodology is described in Section 3. The results are presented in Section 4 and the sensitivity to Remote Sens. 2020, 12, 4125 3 of 20 DNN structure and the contributions of TOA reflectance ratio predictors are discussed in Section 5. Conclusions are summarized in Section 6.

Himawari-8 TOA Reflectance and AOD, and Auxiliary Data
Himawari-8 is a geostationary satellite launched by the Japan Meteorological Agency (JMA) in 2014 and positioned at 140.7 • E, with a spatial coverage of 150 • by 150 • . The AHI on board Himawari-8 provides multi-spectral observations every 10 min. The AHI images include six bands in the solar spectrum, including three visible and three near-infrared bands with different spatial resolutions from 0.5 to 2 km at the sub-satellite point. The relative radiometric uncertainty of the AHI measurements is <3% for bands 1-4 and~5% for bands 5-6 [37,38]. The AHI data were processed into different levels. The Himawari L1 data (Himawari L1 Gridded data) distributed by JMA is generated from the Himawari standard data with re-sampling to equal latitude-longitude grids, which have a spatial coverage of 120 • by 120 • , centered at 0 • N, 140 • E. The Himawari L1 data provide the TOA reflectances in all the six reflective bands (Table 1) resampled to 2 km resolution, satellite zenith angle, satellite azimuth angle, solar zenith angle, solar azimuth angle, and observation time (UTC). In this study, the TOA reflectance data in all six reflective bands and the navigation data (including latitude, longitude, solar and sensor zenith angle, and azimuth angle) were used. Himawari L2 500 nm AOD (JMA AOD) products were used as a benchmark for comparison with machine learning-based AOD. JMA AOD is defined at 5 km spatial resolution and 10 min temporal resolution [11]. These data are generated using a radiative transfer code called system for the transfer of atmospheric radiation (STAR) that was developed by the University of Tokyo [39]. In this study, the Version 2.1 JMA Level 2 AOD was used. Both the Himawari L1 TOA and L2 AOD data at integer UTC hours were downloaded from the Japan Aerospace Exploration Agency (JAXA) Earth Observation Research Center (http://www.eorc.jaxa.jp/ptree/terms.html) (open access).
The total columnar water vapor and total columnar ozone from the ERA5 hourly data were used to correct for the influence of gas absorption on aerosol estimation (https://cds.climate.copernicus. eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=form). ERA5 is the fifth generation ECMWF atmospheric reanalysis of the global climate. ERA5 provides estimates of atmosphere variables for each hour of the day with a spatial resolution of 0.25 • × 0.25 • .
Remote Sens. 2020, 12, 4125 4 of 20 Digital elevation map (DEM) is related to atmospheric Rayleigh optical thickness and thus scattering and absorption of solar light by atmospheric molecules [40], which directly affects the estimation accuracy of the AOD. Thus, the Shuttle Radar Topography Mission (SRTM) DEM data (http://srtm.csi.cgiar.org) with a 90-m spatial resolution were used in this study.

AERONET 500 nm AOD
AERONET is a global ground-based network of sun-sky photometers [23]. AERONET sites provide AOD in seven wavelength bands (340, 380, 440, 500, 675, 870, and 1020 nm). AERONET data are categorized into different quality levels. In this study, AOD data at 500 nm with quality Level 2.0 (cloud-screened and quality assured) from the latest Version 3.0 [41] were used. The AERONET AOD has a bias of +0.02 (in AOD, unitless) and one sigma uncertainty of 0.02 [41] and is commonly used as a reference data set for satellite AOD product validation. In this study, all version 3.0 AERONET data between 60 • S-60 • N and 80 • E-180 • E available in the year 2017 were used.

Collocated AHI TOA and AERONET AOD Observations
The training and validation data samples are extracted from the collocated AHI TOA and AERONET AOD observations. The AERONET data around the satellite observation time (±5 min) were averaged, and the AHI pixels nearest to the AERONET sites were used. In addition, satellite data identified as cloud contaminated were discarded, and only AERONET data with a positive 500 nm AOD were used. The collocation results in a total of 14,154 records at 76 AERONET sites, and the detailed information is presented in Table A1 (Appendix A). The average number of collocated samples for each AERONET site is 188.7 and the maximum number is 822.

Seventeen Predictors
Seventeen predictor variables were used to train the machine learning models including the six Himawari-8 AHI TOA reflectances, three TOA reflectance ratios, DEM, AHI solar and viewing zenith, azimuth and scattering angles, and ERA-5 water vapor and ozone concentrations. These variables were selected as they are also used in the radiative transfer model-based AOD retrieval algorithms [6,11,21,42]. The solar and viewing geometries determine the path length of the interaction of the aerosol particles and sunlight, and play an important role in AOD retrieval due to aerosol scattering dependence on the sun-sensor geometry [43]. The scattering angle is calculated following Li et al. [44]. Water vapor and ozone absorb sunlight at certain wavelengths [45]. The three TOA reflectance ratios are ρ TOA blue /ρ TOA red , ρ TOA blue /ρ TOA 2.25µm , and ρ TOA red /ρ TOA 2.25µm with ρ TOA blue , ρ TOA red , and ρ TOA 2.25µm being the blue, red, and 2.25 µm shortwave infrared TOA reflectance (Table 1). These ratios are used because their surface reflectance equivalents are assumed to be fixed for dark-targets [5,6] and thus these TOA ratios may indicate AOD levels.

Deep Neural Network (DNN)
The deep neural network (DNN) is constructed by advancing the three layer (one input, one hidden, and one output layer) artificial neural network (ANN) with multiple hidden layers [22,46]. In this study, the 17 predictors are used as the input layer and the 500 nm AOD (one single value) as the output layer. The hidden layer values (each called a neuron) normally have no physical meaning and they are used as intermediate variables to relate the input and output layers. Each layer is derived from the previous layer and transforms the previous layer to more resembling to the output layer (i.e., more related to the AOD value). To be specific, each neuron value in a layer is derived as a linear combination of all the previous layer neuron values with multiple weights and one bias, and followed by a non-linear activation function. In this study, the rectified linear unit (ReLU) nonlinear function is used [47]: where z is the linear combination of all the previous layer neuron values with multiple weights and one bias and f (z) is the output neuron. The ReLU nonlinear function makes a deeper neural network more feasible to train than other non-linear functions such as the sigmoidal function used in the three-layer ANN [47]. This is because the gradient vanishing (too small gradient value) and explosive (too great gradient value) issues, that are present in the deep networks due to the accumulation of composite functions and that make gradient descent training meaningless, are suppressed by using the ReLU function [47].
The DNN actually represents output as a mathematical manipulation function of the input predictors: whereτ is the estimated AOD, F is the mathematical manipulation function consisting of a series of linear and ReLU nonlinear functions, x 1 , x 2 , . . . , x 17 are the 17 predictors, and w 1 , w 2 , . . . , w nw are the n w weights and b 1 , b 2 , . . . , b nb the n b biases used in the linear combinations to derive the next layer neuron values. The last layer has only one neuron and the neuron value derived as a linear combination of all the neuron values in the penultimate layer is considered as the estimated AOD. The regression last layer does not need the ReLU nonlinear function so that the output AOD value is not biased by Equation (1). The regression last layer does not need the softmax function (multi-class logistic function) to be converted to categorical variables that is commonly used in DNN classification. The DNN training is to find optimal weight (w 1 , w 2 , . . . , w nw ) and bias (b 1 , b 2 , . . . , b nb ) coefficients to minimize: where O is the objective function, n is the number of training samples,τ i and τ AERONET i are the ith estimated and AERONET AODs. The optimal weights and biases are solved using a conventional gradient descent algorithm [48] with partial derivatives of the objective (3) with respect to the unknown weights and biases.
The number of hidden layers and the number of neurons in each layer are predefined and normally a larger number of neurons (i.e., deeper network and wider layers) provide a better representation capability but are harder to train [22,49]. There is no optimal structure defined for all practical problems and the DNN structure was mostly developed for image classification [50][51][52] rather than regression. In this study, we adopted an established regression network to predict sub-pixel values in the computer vision field [53] and three hidden layers are used each with 256, 512, and 512 neurons, respectively. The state-of-the-art practices for DNN training were used. The mini-batch gradient descent search method was used [48] with batch size set as 256 and 200 epochs to ensure a stable and robust solution. The weight and bias were initialized randomly following He et al. [54]. Batch normalization [55] was used as regularization to avoid over-fitting. The learning rate was initialized as 0.1 and decreased to 0.01, 0.001, and 0.0001 in the 80, 120, and 160 epochs, respectively [49]. The DNN is implemented using the TensorFlow application programming interface (API) developed by Google [56].

K-Fold Cross-Validation and Leave-One-Station-Out Validation
The DNN retrieved AOD was compared with the reference AERONET AOD using the coefficient of determination (R 2 ), root mean square error (RMSE), and linear regression slopes. Conventionally, a random fraction of collocated AHI TOA and AERONET AOD observations is set aside in the machine learning training and the remaining data are used as reference for validation. However, validation variability could be produced by setting different random portions from the whole sample as the validation dataset. To reduce variability and to get more reliable validation results, we used a more sophisticated strategy, i.e., cross-validation [57]. Cross-validation is achieved by training and applying DNN multiple times, each with different training and validation samples. Consequently, each sample is used as a validation sample once and only once so that a total of 14,154 samples are validated.
Two cross-validation strategies were used: (i) k-fold cross-validation and (ii) leave-one-station-out validation. In (i) k-fold cross-validation, the 14,154 samples are randomly partitioned into k equally sized subsamples. A single subsample is used to validate the model trained by the other k−1 subsamples. The training and validation process is repeated k times so that every single sample is used as an independent validation sample once [57]. The k-fold cross-validation has been used in validation of PM2.5 estimation using machine learning methods [58][59][60]. However, in the k-fold cross-validation, the training and validation samples may come from the same AERONET site for each DNN training and validation, which could overestimate the accuracy for data-driven models. This is because the trained model may have 'remembered' the surface reflectance characteristics of the AERONET site and thus can estimate AOD well for the following TOA observations at that site. However, when a new location observation with unknown surface reflectance characteristics is fed to the model, the model could fail but it is unknown (i.e., not validated in the k-fold cross-validation). In this study, k was set at 76 so that the numbers of training samples in (i) and (ii) are comparable.
In (ii) leave-one-station-out validation, the 14,154 samples are partitioned into 76 subsamples (there are 76 AERONET stations (Table A1)) and all samples in each subsample are from the same AERONET station. A single subsample is used to validate the model trained by the other 75 subsamples. The training and validation process is repeated 76 times so that every single sample is used as an independent testing sample once and the validation samples come from different locations than the training samples. This corresponds to the real situation where the AERONET stations are distributed sparsely over the globe and the DNN model application site usually does not have AERONET observations to train the model. This has been used in validation of PM2.5 estimation using machine learning methods [36,61,62].
The DNN was compared to the random forest regression [63] algorithm for AOD retrieval. Random forest regression is widely used in remote sensing parameter retrieval [64] such as PM2.5 [64][65][66][67]. Random forest regression is an ensemble learning method by constructing a multitude of regression trees at training time and outputting the prediction by averaging the predictions from all the individual regression trees. The random forest was implemented using the R RANDOMFOREST package (http://www.r-project.org/) with default parameter settings [63]. A total of 500 trees were grown with a random selection of 63.2% of the training data in each tree and a random selection of four of the seventeen predictors per partition in the tree.

Descriptive Statistics
The mean, standard deviation, maximum, and minimum values of the AOD and the 13 basic predictor variables in the 14,154 data records are presented in Table 2. The AOD ranges from 0.00 to 2.92, which represents a wide range of atmospheric aerosol concentrations. The mean AOD value is 0.32, which is greater than the global multi-year mean MODIS AOD value of 0.19 reported in Remer et al. [68] because the Himawari-8 covers mainly Asia where AOD is higher than the global average due to high pollution levels [69]. The mean TOA reflectance values in the visible bands decrease from blue to green to red, which is a typical reflectance pattern that is affected by the path radiance [70]. This is because the path radiance is largest in the blue visible band and smallest in the red visible band. The solar zenith and azimuth and viewing zenith and azimuth angles all cover a wide range except that the minimum viewing zenith is 17.36 • , which is reasonable because the Himawari-8 is positioned over ocean (140.7 • E) and the AERONET sites are mostly located over the continental land.  figures. Both random forest and DNN compare favorably with the AERONET reference data (R 2 of 0.86 and 0.91, respectively, and RMSE of 0.12 and 0.09), showing capability of the two machine leaning models of estimating the AOD. The DNN predictions are better than those of the random forest regression which tends to slightly overestimate the low AOD values and underestimate the high AOD values. It creates a skewed distribution in the random forest AOD scatterplot (Figure 1a) and thus a much lower linear regression slope (0.768) against the AERONET AOD than that of the DNN (0.930).  Figure 2 shows density scatterplots between the leave-one-station-out validation machine learning AOD estimate (left: random forest; right: DNN) and the AERONET AOD. The R 2 , RMSE, and linear regression equations are also shown in the figures. Similar to the k-fold cross-validation, the DNN (RMSE ≈ 0.17, and R 2 ≈ 0.73) has better performance than the random forest in all the evaluation metrics. This is especially true for the linear regression slopes (DNN 0.827 and random forest 0.554), indicating that random forest overestimation of low AOD values and underestimation of high AOD values are more evident than that in the k-fold cross-validation.   Figure 3 shows a map of the study area with the AERONET sites color-coded according to the RMSE for the DNN leave-one-station-out AOD validation at each individual site. The RMSE spatial variation is evident and is smaller than 0.2 for 63 out of the 76 sites and smaller than 0.3 for 73 sites. The three sites with RMSE >0.3 are Lake_Lefroy (0.565), Dhaka_University (0.455), and Bamboo (0.397). The high RMSE in Lake_Lefroy is due to the high surface reflectances and dust aerosol model. Its unique surface and aerosol characteristics make the data collected from other stations for training in the leave-one-station-out validation less representative. The Dhaka_University site high RMSE is related to high pollution in Bangladesh [71] and the average AOD during the study period was 1.00. The number of collocated AOD data pairs at the Bamboo site was only two (Table A1), i.e., insufficient to derive a statistically meaningful value.  in the leave-one-station-out validation less representative. The Dhaka_University site high RMSE is related to high pollution in Bangladesh [71] and the average AOD during the study period was 1.00. The number of collocated AOD data pairs at the Bamboo site was only two (Table A1), i.e., insufficient to derive a statistically meaningful value.  Figure 3 shows a map of the study area with the AERONET sites color-coded according to the RMSE for the DNN leave-one-station-out AOD validation at each individual site. The RMSE spatial variation is evident and is smaller than 0.2 for 63 out of the 76 sites and smaller than 0.3 for 73 sites. The three sites with RMSE >0.3 are Lake_Lefroy (0.565), Dhaka_University (0.455), and Bamboo (0.397). The high RMSE in Lake_Lefroy is due to the high surface reflectances and dust aerosol model. Its unique surface and aerosol characteristics make the data collected from other stations for training in the leave-one-station-out validation less representative. The Dhaka_University site high RMSE is related to high pollution in Bangladesh [71] and the average AOD during the study period was 1.00. The number of collocated AOD data pairs at the Bamboo site was only two (Table A1), i.e., insufficient to derive a statistically meaningful value.

Comparison with the JMA AOD Product
For the comparison of the DNN estimated AOD with the official JMA AOD product, those samples were selected where both methods provided an AOD value. This provided a data sample with 7695 data pairs which were compared with AERONET AOD as reference, i.e., the leave-one-station-out validation results were compared. Density scatterplots of the DNN estimated or JMA AOD versus AERONET AOD are presented in Figure 4. Note that Figure 4b is a subset of Figure 2b. Figure 4 shows that the validation results for the DNN-estimated AOD are much better that for the JMA AOD product.

Comparison with the JMA AOD Product
For the comparison of the DNN estimated AOD with the official JMA AOD product, those samples were selected where both methods provided an AOD value. This provided a data sample with 7695 data pairs which were compared with AERONET AOD as reference, i.e., the leave-onestation-out validation results were compared. Density scatterplots of the DNN estimated or JMA AOD versus AERONET AOD are presented in Figure 4. Note that Figure 4b is a subset of Figure 2b. Figure 4 shows that the validation results for the DNN-estimated AOD are much better that for the JMA AOD product.  To illustrate the advantage of the Himawari-8 capability of obtaining images with high frequency, 10-day time series of the DNN-retrieved AOD are presented in Figures 5 and 6  To illustrate the advantage of the Himawari-8 capability of obtaining images with high frequency, 10-day time series of the DNN-retrieved AOD are presented in Figures 5 and 6, over the Birdsville AERONET site (in Australia) and the Pokhara AERONET site (in Nepal), together with AERONET AOD (black dots) and the JMA (top figures red dots) AOD. The DNN AOD is plotted in the bottom figures (red dots). The two AERONET sites were selected for this comparison because of the large number (822 for Birdsville and 717 for Pokhara) of collocated satellite and AERONET data samples. The data in Figures 5 and 6 show that the DNN AOD is more consistent with the AERONET AOD values than the JMA product and better captured the diurnal variation of AERONET AOD. Furthermore, the DNN AOD was more temporally stable than the JMA AOD. JMA AOD tends to overestimate the AOD values at the Birdsville site ( Figure 5) and underestimate the AOD values at the Pokhara site ( Figure 6). The discrepancies are ascribed to the imprecision in the pre-calculated surface reflectances used in the JMA model-based AOD retrieval.

Differences between the K-Fold Cross-Validation and Leave-One-Station-Out Validation
In both the leave-one-station-out validation ( Figure 2) and k-fold cross-validation (Figure 1), the DNN training and validation was run 76 times so that each collocated TOA and AOD sample was independently validated. The only difference is that in each run, the validation and training samples come from different AERONET stations for the leave-one-station-out validation but they could come from the same AERONET stations for k-fold cross-validation. Comparison of Figures 1 and 2 indicates that the accuracy of the leave-one-station-out validation strategy is much lower than that of the k-fold

Differences between the K-Fold Cross-Validation and Leave-One-Station-Out Validation
In both the leave-one-station-out validation ( Figure 2) and k-fold cross-validation (Figure 1), the DNN training and validation was run 76 times so that each collocated TOA and AOD sample was independently validated. The only difference is that in each run, the validation and training samples come from different AERONET stations for the leave-one-station-out validation but they could come from the same AERONET stations for k-fold cross-validation. Comparison of Figures 1 and 2 indicates that the accuracy of the leave-one-station-out validation strategy is much lower than that of the k-fold cross-validation strategy. This is expected because the trained model may be less representative to a

Differences between the K-Fold Cross-Validation and Leave-One-Station-Out Validation
In both the leave-one-station-out validation ( Figure 2) and k-fold cross-validation (Figure 1), the DNN training and validation was run 76 times so that each collocated TOA and AOD sample was independently validated. The only difference is that in each run, the validation and training samples come from different AERONET stations for the leave-one-station-out validation but they could come from the same AERONET stations for k-fold cross-validation. Comparison of Figures 1 and 2 indicates that the accuracy of the leave-one-station-out validation strategy is much lower than that of the k-fold cross-validation strategy. This is expected because the trained model may be less representative to a new site than the sites used for training since the climate, surface reflectance characteristics, aerosol optical properties, and sun-sensor geometries may vary among AERONET stations. The leave-one-station-out validation reflects the accuracy for large-area applications where there are mostly no AERONET observations for training DNN and the k-fold cross-validation overestimates the accuracy for large-area applications.
The difference between the two validation results is exemplified by the "Fowlers_Gap" station (Australia) results (Figure 7). This station was selected because of the large amount of collocated AHI TOA and AERONET AOD samples (805, Table A1) and because of the contrasting accuracy difference between the two validation results. In the k-fold cross-validation (Figure 7a), each of the 76 subsamples include 2 to 21 "Fowlers_Gap" station samples. The training data for each subsample validation use the other 75 subsamples and thus include 784-803 "Fowlers_Gap" samples. In the leave-one-station-out validation (Figure 7b), only one of the 76 subsamples includes "Fowlers_Gap" station samples and it uses all 805 samples. The training data for this subsample validation do not include samples from the "Fowlers_Gap" station. The low performance of the leave-one-station-out validation is likely due to the location of the "Fowlers_Gap" station in the typical Australian dry desert characterized by high surface reflectance in all six bands which is seldom seen at the other 75 stations. The station performance is expected to be improved when more collocated samples with desert surface AERONET stations are included.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 19 TOA and AERONET AOD samples (805, Table A1) and because of the contrasting accuracy difference between the two validation results. In the k-fold cross-validation (Figure 7a), each of the 76 subsamples include 2 to 21 "Fowlers_Gap" station samples. The training data for each subsample validation use the other 75 subsamples and thus include 784-803 "Fowlers_Gap" samples. In the leave-one-station-out validation (Figure 7b), only one of the 76 subsamples includes "Fowlers_Gap" station samples and it uses all 805 samples. The training data for this subsample validation do not include samples from the "Fowlers_Gap" station. The low performance of the leave-one-station-out validation is likely due to the location of the "Fowlers_Gap" station in the typical Australian dry desert characterized by high surface reflectance in all six bands which is seldom seen at the other 75 stations. The station performance is expected to be improved when more collocated samples with desert surface AERONET stations are included.

The DNN Machine Learning Advantage
DNN performs better than random forest regression which is expected from the many comparisons between the two models for remote sensing image classification reaching similar conclusions [72][73][74]. Furthermore, the random forest regression tends to underestimate high AOD values and overestimate low AOD values due to the nature of the tree split algorithm for regression [75]. In each decision tree of the random forest, the mean value of the node samples are used as the node prediction [76]. The mean value by definition will overestimate low values and underestimate high values.
There is an evident advantage of the DNN machine learning method over an RTM model-based method (i.e., JMA AOD) (Figure 4). The JMA AOD has been systematically validated in many studies

The DNN Machine Learning Advantage
DNN performs better than random forest regression which is expected from the many comparisons between the two models for remote sensing image classification reaching similar conclusions [72][73][74]. Furthermore, the random forest regression tends to underestimate high AOD values and overestimate low AOD values due to the nature of the tree split algorithm for regression [75]. In each decision tree of the random forest, the mean value of the node samples are used as the node prediction [76]. The mean value by definition will overestimate low values and underestimate high values.
There is an evident advantage of the DNN machine learning method over an RTM model-based method (i.e., JMA AOD) (Figure 4). The JMA AOD has been systematically validated in many studies [19,77]. The accuracy metrics of the DNN-based AOD with RMSE 0.172 and R 2 0.730 are promising. These metrics are in the range of those reported for the validation of the most recent MODIS Collection 6.1 3 km DT AOD products using global AERONET observations up to 2015, for which the reported RMSE ranges from 0.13 to 0.18 and R 2 from 0.67 to 0.76 ( Table 1 in Gupta et al. [78]) for two different sensors (Terra or Aqua) and depending on the quality flag considered. The DNN method can potentially be used operationally for AOD estimation.
The DNN model, once trained, can efficiently be applied to pixel observation which can save the time consuming RTM run. The training and prediction need 5.5 h and 0.9 s, respectively, for the 3-hidden-layers DNN model for the leave-one-station-out validation. The 0.9 s are for all the 14,154 sample predictions. The 5.5 h are for the 76 times training, each using~13,967 samples (14,154 × 75/76). The training was run 76 times in this study for validation purposes only and the operational use needs the training run only once (0.07 h). The codes were written in R using Tensorflow API for R and the algorithm was run on a Linux station with 48 cores and 512 GB memory. The DNN was paralleled using 16 cores.

The Contribution of the Dark-Target (DT) Derived TOA Ratio Predictors
In this study, three TOA reflectance ratios ρ TOA blue /ρ TOA red , ρ TOA blue /ρ TOA 2.25µm , and ρ TOA red /ρ TOA 2.25µm were used as predictors, which are derived from the TOA reflectance based on the physical-based DT AOD retrieval algorithm. To examine how these variables contribute to the DNN-based AOD estimation, the 3-layer DNN was estimated without using these three predictors and the results were validated using the leave-one-station-out validation strategy. The density scatterplot for this validation is presented in Figure 8. Comparison of Figure 8 with Figure 2b clearly shows the value of including the TOA ratios. Without these three variables, the R 2 decreased by 0.02 and RMSE increased by about 0.01.

The Contribution of the Dark-Target (DT) Derived TOA Ratio Predictors
In this study, three TOA reflectance ratios / , / . , and / . were used as predictors, which are derived from the TOA reflectance based on the physical-based DT AOD retrieval algorithm. To examine how these variables contribute to the DNN-based AOD estimation, the 3-layer DNN was estimated without using these three predictors and the results were validated using the leave-one-station-out validation strategy. The density scatterplot for this validation is presented in Figure 8. Comparison of Figure 8 with Figure 2b clearly shows the value of including the TOA ratios. Without these three variables, the R 2 decreased by 0.02 and RMSE increased by about 0.01.

Sensitivity to the DNN Structure
Generally, the more complicated the DNN structure, the more representative the estimates will be. However, a more complicated DNN structure requires more training samples for stability and reliability. To find the optimum between the number of hidden layers and representativeness, different numbers of hidden layers (1, 2, 3, 7, and 18) were tested. The number of neurons in each hidden layer is shown in Table 3. The 1-and 2-hidden layer models are just simplified versions of the

Sensitivity to the DNN Structure
Generally, the more complicated the DNN structure, the more representative the estimates will be. However, a more complicated DNN structure requires more training samples for stability and reliability. To find the optimum between the number of hidden layers and representativeness, different numbers of hidden layers (1, 2, 3, 7, and 18) were tested. The number of neurons in each hidden layer is shown in Table 3. The 1-and 2-hidden layer models are just simplified versions of the 3-hidden layer model described in Section 3. The 18-hidden layer model is derived by adapting the state-of-the-art Visual Geometry Group (VGG) model developed by the Oxford University [51] by changing all the convolution layers to plain fully connected layers and the 7-hidden layer model is its simplified version ( Table 3). The results are as expected (Figure 9), i.e., the accuracy increases with the number of layers initially but levels off when more than 3-hidden-layers are introduced. To use more complicated DNN, e.g., a 7-hidden-layer model, future research with more training samples should be conducted. Table 3. The different CNN structures and their numbers of hidden neurons in each layer. These neuron numbers are used in the classical convolutional neural network models in the computer vision field [51,53].  Table 3. The different CNN structures and their numbers of hidden neurons in each layer. These neuron numbers are used in the classical convolutional neural network models in the computer vision field [51,53].

Conclusions
A DNN algorithm to retrieve AOD from Himawari-8 AHI TOA reflectances has been presented. The six band TOA reflectances, three ratios derived from TOA reflectance, the sun-senor geometries, the reanalysis water vapor and O3 concentrations, and the DEM, which are also used in RTM-based AOD retrievals, are used as DNN predictors. One year (2017) of collocated Level 1 AHI TOA and Version 3 Level 2.0 AERONET AOD observations over the Himawari-8 full disk were used to derive the training and validation samples. The DNN was trained with state-of-the-art techniques and applied to independent validation samples that were not used in the training. The cross-validation is used to reduce the sample variability in randomly sampled training data. The results were compared with those from a random forest regression algorithm and with the JMA AOD products. The following conclusions can be drawn: (1) The leave-one-station-out validation shows the capability of the DNN algorithm for systemic AOD retrieval over large-areas using AHI data (RMSE = 0.172, R 2 = 0.730).

Conclusions
A DNN algorithm to retrieve AOD from Himawari-8 AHI TOA reflectances has been presented. The six band TOA reflectances, three ratios derived from TOA reflectance, the sun-senor geometries, the reanalysis water vapor and O3 concentrations, and the DEM, which are also used in RTM-based AOD retrievals, are used as DNN predictors. One year (2017) of collocated Level 1 AHI TOA and Version 3 Level 2.0 AERONET AOD observations over the Himawari-8 full disk were used to derive the training and validation samples. The DNN was trained with state-of-the-art techniques and applied to independent validation samples that were not used in the training. The cross-validation is used to reduce the sample variability in randomly sampled training data. The results were compared with those from a random forest regression algorithm and with the JMA AOD products. The following conclusions can be drawn: (1) The leave-one-station-out validation shows the capability of the DNN algorithm for systemic AOD retrieval over large-areas using AHI data (RMSE = 0.172, R 2 = 0.730). (2) The k-fold cross-validation with RMSE = 0.094 and R 2 = 0.915 overestimates the accuracy for large-area applications. (3) DNN estimated AOD agrees better with AERONET measurements than random forest AOD estimates and the JMA official AOD product. (4) Some variables that are important for physical model-based AOD estimation can also improve the DNN AOD estimation. It highlights the importance of the domain knowledge and expertise when using data driven models for aerosol estimation.
The DNN machine learning method does not use assumptions on the surface reflectance and aerosol models. Future work is encouraged to examine how to use the aerosol model assumptions to help improve data-driven-based AOD predictions. One possible way is to train multiple DNN models each applied for a particular aerosol model which can be predefined based on the location geography and climate characteristics.