Aerosol Optical Depth Retrieval for Sentinel-2 Based on Convolutional Neural Network Method

: Atmospheric aerosol signiﬁcantly affects the climate environment and public health, and Aerosol Optical Depth (AOD) is a fundamental optical characteristic parameter of aerosols, so it is important to develop methods for obtaining AOD. In this work, a novel AOD retrieval algorithm based on a Convolutional Neural Network (CNN) method that could provide continuous and detailed aerosol distribution is proposed. The algorithm utilizes data from Sentinel-2 and Aerosol Robotic Network (AERONET) spanning from 2016 to 2022. The CNN AOD data are consistent with the AERONET measurements, with an R 2 of 0.95 and RMSE of 0.049 on the test dataset. CNN demonstrates superior performance in retrieving AOD compared with other algorithms. CNN retrieves AOD well on high reﬂectance surfaces, such as urban and bare soil, with RMSEs of 0.051 and 0.042, respectively. CNN efﬁciently retrieves AOD in different seasons, but it performs better in summer and winter than in spring and autumn. In addition, to study the relationship between image size and model retrieval performance, image datasets of 32 × 32, 64 × 64 and 128 × 128 pixels were created to train and test the CNN model. The results show that the 128-size CNN performs better because large images contain rich aerosol information.


Introduction
Atmospheric aerosols are solid, liquid, or solid-liquid mixed particles suspended in the atmosphere with an aerodynamic diameter of less than 100 µm. These particles are mainly distributed in the stratosphere and troposphere at the lower portion of the atmosphere. Aerosols exhibit a complex and variable composition resulting from a combination of natural emissions and human activities, including anthropogenic aerosols, such as sulfates and biomass-burning aerosols, as well as natural aerosols, such as dust and sea salt, mineral dust, and primary biological aerosol particles [1].
Aerosol, which plays a crucial role in the Earth's atmospheric system, has a significant influence on radiation balance, climate change, atmospheric environmental quality, and human health [2][3][4][5][6]. Aerosol particles could change Earth's radiation budget through scattering and absorbing incoming solar radiation [7]. Aerosols also have an indirect effect on climate by altering cloud microstructure [8,9]. In addition to the above-mentioned aerosol cloud microphysical effects, aerosols can also change the atmospheric thermodynamic characteristics through aerosol-radiation interaction, thereby altering the atmospheric circulation, thus affecting the occurrence of clouds and precipitation [10]. For example, black carbon aerosols can heat the atmosphere, evaporate cloud droplets to reduce cloud cover [11], create stable atmospheric stratification [12], weaken convection, and reduce rain showers. High concentrations of aerosol particles seriously affect human health [13][14][15]. A number of studies have shown that the incidence of cardiopulmonary disease is closely related to the content of particulate matter in polluted air [16][17][18], especially the inhalable

Aerosol Robotic Network (AERONET) Data
AERONET is a global aerosol optical property monitoring network established by the National Aeronautics and Space Administration. This network uses the CIMEL automatic sun-photometer as the basic observing instrument [40], which can observe aerosol characteristics at eight wavelengths from visible to near-infrared every 15 min, and the network consists of more than 1500 sites around the world [41]. These sites have been providing researchers with long-term, stable, and easily accessible aerosol data for decades. AERONET aerosol data are divided into three levels: Level 1.0 (unscreened), Level 1.5 (cloud-cleared and quality controls), and Level 2.0 (automatically cloud-cleared and quality-assured). The number of ground-based AODs available at level 2.0 is limited. Accordingly, level 1.5 data are selected for unified sampling to have sufficient samples. East Asia is the ideal research area for training and testing an AOD retrieval model because of the complex and variable composition of aerosols and abundant distribution of AERONET sites. V3 Level 1.5 data from 22 sites in East Asia from 2016 to 2022 were obtained in this study based on the climate, the land cover types around the sites, and the effectiveness and availability of observation data. Figure 1 shows the distribution of 22 sites used in this work, and Table 1 provides the site name, location, abbreviation, corresponding coordinates, and matched number with satellite images of each site.

Sentinel-2 Data
Sentinel-2 is an Earth observation mission in the European Space Agency's Copernicus program, and it consists of two polar-orbiting satellites: Sentinel-2A and Sentinel-2B. Every satellite carries a multi-spectral instrument to acquire 13 spectral bands from visible to shortwave infrared along a 290 km-wide orbital belt. The revisit period is 10 days for each satellite and 5 days for the combined system [42]. Sentinel-2 provides land surface images with spatial resolutions of 10, 20, and 60 m/pixel covering the global land, which is convenient for scholars to apply in remote sensing analysis fields, such as forest monitoring, land cover change monitoring, and natural disaster management. Sentinel-2 is superior for characterizing high spatial resolution and relatively high temporal resolution simultaneously, making it possible to obtain a continuous and detailed aerosol distribution map. The European Space Agency provides free Sentinel-2 Level-1C products, which

Sentinel-2 Data
Sentinel-2 is an Earth observation mission in the European Space Agency's Copernicus program, and it consists of two polar-orbiting satellites: Sentinel-2A and Sentinel-2B. Every satellite carries a multi-spectral instrument to acquire 13 spectral bands from visible to shortwave infrared along a 290 km-wide orbital belt. The revisit period is 10 days for each satellite and 5 days for the combined system [42]. Sentinel-2 provides land surface images with spatial resolutions of 10, 20, and 60 m/pixel covering the global land, which is convenient for scholars to apply in remote sensing analysis fields, such as forest monitoring, land cover change monitoring, and natural disaster management. Sentinel-2 is superior for characterizing high spatial resolution and relatively high temporal resolution simultaneously, making it possible to obtain a continuous and detailed aerosol distribution map. The European Space Agency provides free Sentinel-2 Level-1C products, which are the atmospheric apparent reflectance products that have been finely orthorectified and sub-pixel geometric corrected. In this study, Sentinel-2 L1C images from Bands 2, 4, 8, and 12 from 2016 to 2022 were selected ( Table 2).

MODIS and Himawari-8 Data
MODIS has provided scholars with different types of global aerosol products for decades, among which MODIS AOD standard products over land include AOD retrieved by the MAIAC algorithm, DT algorithm, DB algorithm, and combined Dark Target and Deep Blue (DTB) [34]. At present, the MODIS MAIAC aerosol products have been developed for the C6.0 version, and the DT, DB, and DTB aerosol products have been developed for the C6.1 version [28]. In 2015, the Japan Meteorological Agency began operating the geostationary satellite Himawari-8 and providing remote sensing data with a high temporal resolution (10 min) [43]. The MODIS data used in this study are the daily aerosol  Table 3). The Himawari-8/AHI level 3 hourly Yonsei Aerosol Retrieval (YAER) product with a spatial resolution of 0.05 • /pixel was used in this work [44]. These aerosol products were selected to evaluate the reasonability of the spatial distribution of the CNN AOD.

Data Preprocessing
The dataset was constructed using multiple data preprocessing steps, including band interpolation, resampling, pixel screening, and spatiotemporal matching ( Figure 2). Considering that CNN AOD must be compared with MODIS aerosol products, MODIS AOD is the AOD at 550 nm, whereas AERONET AOD does not have such a wavelength. Consequently, band interpolation is required. The AERONET AODs at 500 and 675 nm are interpolated into AOD at 550 nm using the Ångström exponent.
where τ 500 denotes the AOD at 500 nm, τ 675 represents the AOD at 675 nm, α is the Ångström exponent at 500-675 nm, and τ 550 indicates the AOD at 550 nm obtained by interpolation.
Considering that Sentinel-2 L1C products are resampled at different ground sampling distances (10, 20, and 60 m/pixel) based on the native resolution of different spectral bands [45], the L1C products must be first resampled to unify the spatial resolution before aerosol retrieval. In this study, all satellite images were resampled to 10 m/pixel. Satellite images are pixel-screened to remove samples containing light, clouds, cloud shadows, and snow from the dataset to improve the quality of samples and minimize the influence of the aforementioned factors. The AERONET AOD data after band interpolation are selected to match with the screened satellite images. Assuming that the aerosol concentration does not change in a short period of time, the average sun-photometer AOD value within ±30 min of the satellite overpass is selected to match with the satellite image. Assuming that the aerosol properties do not change within an area of 1.28 km × 1.28 km, the Sentinel-2 image within the range of 128 × 128 pixels is cropped as a satellite image sample with the AERONET site in the center. The satellite image and ground-based AOD data in the same spatiotemporal are paired together to provide samples for the model.
When training deep learning models, data augmentation methods, such as rotation, scaling, and translation, are frequently used to produce more samples to improve the generalization ability and robustness of the model [46]. The remote sensing images are processed by a horizontal translation of 0.1 to expand the dataset, enabling the model to better learn features from these images and avoid model overfitting. In this study, 2575 pairs of samples were successfully matched from the AERONET AOD data and Sentinel-2 images. Before training, all image samples are resized to 128 × 128 pixels. A total of 2575 pairs of samples are randomly assigned to the training dataset (2060 pairs) and the test dataset (515 pairs) in an 8:2 ratio.

CNN Model
CNN, which is a deep learning algorithm, consists of a series of convolution and pooling layers that extract relevant features from the input image, followed by one or more fully connected layers that use these features to make a prediction. It has the characteristics of sparse connections, parameter sharing, and translation invariance, greatly minimizing the number of network parameters that need to be optimized, improving the training speed, and facilitating the extraction of partial features in images. CNN is able to capture spatial features and patterns in images using a hierarchical architecture of layers that perform convolution operations and extract features at different levels of abstraction [47]. So, this algorithm has remarkable advantages for image analysis and recognition. The CNN model proposed in this work consists of 10 convolution layers, 4 max-pooling layers, 1 Global max-pooling layer, and 1 fully connected layer, as shown in Figure 3. Every convolutional block is followed by a max-pooling layer to minimize the number of parameters and compress image features to reduce computation and memory consumption. The last max-pooling layer is followed by a Global max-pooling layer, which can flatten the multidimensional image to obtain a feature vector with a size of 512. In addition, dropout regularization is used to randomly turn off some neurons with a certain probability during each iteration of training of the neural network [48]. Accordingly, each neuron does not necessarily participate in network training every time, and the updating of network weights no longer depends on the joint action of implicit nodes with fixed

CNN Model
CNN, which is a deep learning algorithm, consists of a series of convolution and pooling layers that extract relevant features from the input image, followed by one or more fully connected layers that use these features to make a prediction. It has the characteristics of sparse connections, parameter sharing, and translation invariance, greatly minimizing the number of network parameters that need to be optimized, improving the training speed, and facilitating the extraction of partial features in images. CNN is able to capture spatial features and patterns in images using a hierarchical architecture of layers that perform convolution operations and extract features at different levels of abstraction [47]. So, this algorithm has remarkable advantages for image analysis and recognition. The CNN model proposed in this work consists of 10 convolution layers, 4 max-pooling layers, 1 Global maxpooling layer, and 1 fully connected layer, as shown in Figure 3. Every convolutional block is followed by a max-pooling layer to minimize the number of parameters and compress image features to reduce computation and memory consumption. The last max-pooling layer is followed by a Global max-pooling layer, which can flatten the multidimensional image to obtain a feature vector with a size of 512. In addition, dropout regularization is used to randomly turn off some neurons with a certain probability during each iteration of training of the neural network [48]. Accordingly, each neuron does not necessarily participate in network training every time, and the updating of network weights no longer depends on the joint action of implicit nodes with fixed relationships. This approach prevents some features from being effective only under other specific features, reduces the dependence of model on some partial features, and enhances the generalization ability of the model. All convolutional layers use small-size 3 × 3 convolutional filters with size 1 stride and select the "same" padding to maintain the spatial resolution of the feature map. All max-pooling operations are performed over a window of 2 × 2 pixels with size 2 stride and the "same" padding to minimize the spatial resolution of the feature map by half. the loss. Accordingly, the predicted value output by the model will be closer to the true value. Mean Absolute Error (MAE) was used as the loss function of the model in this study. The nonlinear activation function ReLU is selected as the activation function to minimize the interdependence between parameters. During the back propagation, the optimizer can guide the model parameters to update in the correct direction. Consequently, the updated parameters maintain the loss value close to the global minimum. The Adam optimization algorithm is used for network training, and the learning rate is set to 0.001 to minimize the MAE between retrieved AOD and true AOD. After the CNN is trained, the model with the smallest MAE between CNN AOD and AERONET AOD is selected as the final CNN model.

Evaluation Methods
Spearman R, Pearson R, R 2 , Root Mean Square Error (RMSE), MAE, Within the MSI Expected Error (Within EE), Above EE, and Below EE were used as evaluation indicators in this study. Pearson R is sensitive to outliers and suitable for evaluating linear relationships. On the other hand, Spearman R is not affected by outliers and is suitable for evaluating nonlinear relationships. The model performance can be effectively evaluated using these two correlation coefficients as evaluation indexes. The loss function, activation function, and optimizer should be configured before training. After the training data of every iteration is inputted into the model, the model will output the predicted value through forward propagation, and the loss function will calculate the difference between the predicted and the true values. After the loss value is obtained, the model will update each parameter through back propagation to minimize the loss. Accordingly, the predicted value output by the model will be closer to the true value. Mean Absolute Error (MAE) was used as the loss function of the model in this study. The nonlinear activation function ReLU is selected as the activation function to minimize the interdependence between parameters. During the back propagation, the optimizer can guide the model parameters to update in the correct direction. Consequently, the updated parameters maintain the loss value close to the global minimum. The Adam optimization algorithm is used for network training, and the learning rate is set to 0.001 to minimize the MAE between retrieved AOD and true AOD. After the CNN is trained, the model with the smallest MAE between CNN AOD and AERONET AOD is selected as the final CNN model.

Evaluation Methods
Spearman R, Pearson R, R 2 , Root Mean Square Error (RMSE), MAE, Within the MSI Expected Error (Within EE), Above EE, and Below EE were used as evaluation indicators in this study. Pearson R is sensitive to outliers and suitable for evaluating linear relationships. On the other hand, Spearman R is not affected by outliers and is suitable for evaluating nonlinear relationships. The model performance can be effectively evaluated using these two correlation coefficients as evaluation indexes.
Atmosphere 2023, 14, 1400 Random Forest (RF) and VGG16 algorithms are selected for comparison and verification with the CNN model to verify the rationality of the CNN model. RF is an ensemble learning algorithm [49] that can evaluate the importance of every feature in classification problems, judge the interaction between different features, and capture the complex relationship between input feature vectors and output values. VGG is a classic CNN structure that is widely used in computer vision tasks, such as image classification, object detection, and semantic segmentation. VGG16 is a network structure in VGG [50].

Overall Validation of the CNN Model
AERONET AODs were used to compare and analyze the AOD retrieved by the CNN model on the training and test datasets. In the training dataset, the retrieved AOD is in good agreement with AERONET AOD, with an R 2 of 0.99, RMSE of 0.021, and Within EE of 99%, as shown in Figure 4. Only 1% of the data is higher than EE, indicating a very slight overestimation of the model in the training dataset. The result of CNN performance on the test dataset is similar to that on the training dataset. In the test dataset, R 2 is 0.95, RMSE is 0.049, and Within EE is 95%, indicating that CNN can provide accurate AOD concentration and distribution at a relatively fine temporal resolution (5 days). Approximately 3% of AODs are higher than EE, and 2% of AODs are lower than EE, indicating a slight deviation in the CNN AOD.   The AOD retrieval performance of the CNN model on the test dataset was subdivided by season and land cover type. CNN performs better in summer and winter than in spring and autumn (Table 5). From the perspective of the various validation indicators, In the satellite images, the top-of-atmosphere signals are composed of coupled contributions from the surface and atmosphere, hence, the signal-to-noise-ratio of the high-AOD aerosols is larger than that of the low-AOD aerosols [35]. In the test dataset, 52% of the AERONET AODs are in the range of 0 to 0.16, and 48% are in the range of 0.16 to 1.6. The test dataset was divided into two parts to examine the accuracy of the CNN model in  Table 4 illustrates that MAE and RMSE in the low AOD are lower than those in the high AOD. On the other hand, the Within EE in the low AOD is similar to that in the high AOD. This finding indicates that CNN can better retrieve AOD in areas with good air quality. In the low AOD, 6% of AODs are higher than EE and no data is lower than EE. In the high AOD, 1% of AODs are higher than EE and 4% of AODs are lower than EE. The result shows that the CNN model slightly overestimates the low values and underestimates the high values. The AOD retrieval performance of the CNN model on the test dataset was subdivided by season and land cover type. CNN performs better in summer and winter than in spring and autumn (Table 5). From the perspective of the various validation indicators, the performance of CNN in winter is the best, with an R 2 , MAE, RMSE, and Within EE of 0.97, 0.024, 0.037, and 97%, respectively. The excellent CNN performance in winter may be related to the large amount of winter data. The performance of CNN in summer is the second best, with an R 2 of 0.96 and RMSE of 0.045. On the other hand, the performance of CNN in autumn is similar to that in summer, with an R 2 of 0.94 and RMSE of 0.046. The CNN performance in spring is poorer compared with those in the other three seasons, with an R 2 of 0.93 and RMSE of 0.062. CNN efficiently performs AOD retrieval on the different land cover types (Table 6). CNN exhibits excellent retrieval accuracy in the vegetation surface, with an R 2 of 0.92 and RMSE of 0.044. When the surface reflectance is high, the satellite sensors can obtain less information about aerosols, making it difficult to retrieve aerosols on high-reflectance surfaces [51]. However, CNN still shows great retrieval performance on land cover types with high surface reflectance, such as urban and bare soil, with an R 2 of 0.96 and 0.90 and RMSE of 0.051 and 0.042, respectively.

Retrieval Performance of the CNN Model at Different Scales
The AOD retrieval performance of the CNN model on the test dataset was subdivided into two scale types: regional scale and ground-based AERONET site scale, to analyze CNN performance in different regions and sites. The retrieval performance of CNN in North China, Taiwan, South Korea, and Japan is shown in Figure 5. There is no analysis of Tibet and Hong Kong here because of the small amount of data in these regions. CNN performed best in South Korea, with an RMSE of 0.037, R 2 of 0.92, and Within EE of 98%. Moreover, 2% of AODs are lower than EE, indicating a slight underestimation of CNN in South Korea. In North China, CNN could accurately retrieve AOD with an R 2 of 0.97, RMSE of 0.047, and Within EE of 96%. Furthermore, 2% of AODs are higher than EE, whereas 2% are lower than EE, indicating a slight deviation of CNN AOD in North China. The more data in the region, the more regional features the model could capture. The largest amount of data can be found in North China. However, the retrieval performance of CNN in North China is not as good as that in South Korea, which may be because the aerosol concentration in North China is complex and changeable, and it is difficult for CNN to learn the aerosol characteristics in this region. CNN did not perform well in Taiwan, with its RMSE being the highest of the four regions. CNN performed the worst in Japan, with an R 2 of 0.78, RMSE of 0.045, and Within EE of 91%, which may be related to the smaller amount of data obtained in Japan. The annual average AOD in Japan is the lowest, and the CNN model slightly overestimates the low AODs with an Above EE of 6% (Table 4), so the CNN tends to overestimate the AOD in Japan, which has an Above EE of 8% and is significantly higher than those in North China and South Korea.
Atmosphere 2023, 14, x FOR PEER REVIEW 11 of 17 amount of data obtained at TW03 is small, with only 55 pairs of samples. Accordingly, the accuracy of CNN AOD at TW03 is low. CNN exhibits a slight AOD deviation at most sites. Nonetheless, the CNN AOD at HK01, HK02, KR4, and JP3 reaches 100% Within EE, and no AOD is higher or lower than EE at all. CNN retrieved AOD better at sites with more data, indicating that CNN can obtain more features at sites with more data. Expanding the dataset may help the model to better extract image features related to aerosols, and a large dataset could be made for model training in the future.  The CNN performance at each site is shown in Figure 6. Given the limited amount of data at TB01 and KR1, only one TB01 data and two KR1 data were divided into the test dataset. Consequently, the retrieval performance at these two sites is not shown. The AOD retrieval accuracy at NC02 is the highest, with an R 2 , RMSE, and Within EE of 0.98, 0.027, and 99%, respectively. This result may be due to the largest amount of data being at NC02, with up to 311 pairs of samples, from which CNN could capture more accurate image features. Moreover, the land cover type at NC02 is simple and mainly urban. In addition, CNN performed well at TW01, with its RMSE being the lowest among all sites (0.016). The land cover type at TW01 is simple, with the vast majority of the area covered by vegetation, and CNN can obtain more aerosol information in an area with low surface reflectance. The AOD retrieval accuracy at JP4 is the lowest, with an R 2 of 0.73, RMSE of 0.095, and Within EE of 67%. CNN overestimated the AOD at JP4 with an Above EE of 33%. The low accuracy of CNN AOD at JP4 is due to two reasons. First, high AOD more frequently occurs in China and South Korea than in Japan [52], and the annual average AOD in Japan is the lowest, so CNN tends to overestimate the AOD in Japan. Second, the data at JP4 are few (only 52 pairs of samples). The deep learning model requires a large number of samples, and CNN would have difficulty exploring the characteristics of the site using only a few samples. In addition, CNN did not perform well at TW03, with its RMSE being the highest of all sites (0.122). TW03 is characterized by a relatively complex land cover type, primarily dominated by urban areas with some vegetation coverage. Moreover, the amount of data obtained at TW03 is small, with only 55 pairs of samples. Accordingly, the accuracy of CNN AOD at TW03 is low. CNN exhibits a slight AOD deviation at most sites. Nonetheless, the CNN AOD at HK01, HK02, KR4, and JP3 reaches 100% Within EE, and no AOD is higher or lower than EE at all. CNN retrieved AOD better at sites with more data, indicating that CNN can obtain more features at sites with more data. Expanding the dataset may help the model to better extract image features related to aerosols, and a large dataset could be made for model training in the future.

Comparison with Other Models
The MODIS and Himawari-8 aerosol products were selected to evaluate the rationality of the spatial distribution of CNN AOD. Figure 7a shows a Sentinel-2 satellite image over Chiba and Ibaraki in Japan on 12 February 2019, with no snow and clouds in most areas, making it suitable for AOD retrieval. Figure 7c-g shows MODIS MAIAC, DT, DB, DTB, and Himawari-8 aerosol products over Japan, with spatial resolutions of 1 km/pixel, 10 km/pixel, 10 km/pixel, 10 km/pixel, and 0.05 • /pixel, respectively. Figure 7b shows the spatial distribution of AOD retrieved by CNN. CNN AODs are calculated with a step size of 64 pixels and an input image size of 128 × 128 to facilitate display and comparison, resulting in a spatial resolution of 640 m/pixel. Figure 7a depicts some cloud and snow cover in the lower left corner and middle upper areas of the image. Incorrect AODs are retrieved by CNN due to the interference of clouds and snow. Except for the cloud and snow coverage area, the spatial distribution of CNN AOD and MODIS AOD is similar, and the overall AOD concentration of the image coverage area is low. Some coastal areas show relatively high AODs due to dense human habitation and intensive industrial agglomeration. On the other hand, inland and southern mountain areas show relatively low AODs. The Himawari-8 aerosol product shows an overall higher AOD, which is not consistent with MODIS AOD and CNN AOD. The MODIS AOD product is more reliable than the Himawari-8 AOD product [34]. Specifically, the CNN AOD that is more similar to MODIS AOD can reasonably reflect the AOD concentration. The distribution maps of the CNN AOD and Himawari-8 AOD cover the entire land. The data coverage rate of the MODIS aerosol products is low, and many no-value regions exist. The CNN AOD and Himawari-8 AOD can provide continuous AOD spatial distribution. Furthermore, the high spatial resolution map of the CNN AOD shows the spatial details that cannot be reflected in the rough spatial resolution of 1 km/pixel, 10 km/pixel, and 0.05 • /pixel. The AOD retrieved by CNN is closer to the ground-based AOD, with an RMSE of 0.049. In contrast, the RMSE values of MODIS MAIAC, DT, DB, and Himawari-8 AOD are 0.15, 0.22, 0.17, and 0.14, respectively, which are less accurate than CNN.

Comparison with Other Models
The MODIS and Himawari-8 aerosol products were selected to evaluate the rationality of the spatial distribution of CNN AOD. Figure 7a shows a Sentinel-2 satellite image Himawari-8 AOD can provide continuous AOD spatial distribution. Furthermore, the high spatial resolution map of the CNN AOD shows the spatial details that cannot be reflected in the rough spatial resolution of 1 km/pixel, 10 km/pixel, and 0.05 °/pixel. The AOD retrieved by CNN is closer to the ground-based AOD, with an RMSE of 0.049. In contrast, the RMSE values of MODIS MAIAC, DT, DB, and Himawari-8 AOD are 0.15, 0.22, 0.17, and 0.14, respectively, which are less accurate than CNN. The CNN model was compared with RF and VGG16 to further evaluate the AOD retrieval performance of CNN. The RF and VGG16 algorithms were trained and tested using the dataset produced in this study. The output layer of VGG16 was changed to a The CNN model was compared with RF and VGG16 to further evaluate the AOD retrieval performance of CNN. The RF and VGG16 algorithms were trained and tested using the dataset produced in this study. The output layer of VGG16 was changed to a fully connected layer with size 1 because it is a classification model that eventually divides into 1000 categories. The AOD retrieval performance of each model is shown in Table 7. The results of MAIAC, DT, DB, and YAER are from a study by Choi et al. [53], in which the MODIS and Himawari-8 aerosol products in East Asia are validated. CNN shows outstanding advantages in AOD retrieval, and all verification indicators are the best values, with an RMSE of 0.049, and significantly lower than those of other models. The R 2 of CNN is 0.95, which is higher than those of other models. RF performs well, with an R 2 of 0.86, which is higher than the 0.70 of VGG16. The RMSE of RF is 0.085, which is lower than the 0.123 of VGG16. The results show that CNN demonstrates significant advancements in addressing AOD retrieval problems compared with most previous algorithms.

Retrieval Performance as a Function of Satellite Image Size
Image datasets of 32 × 32, 64 × 64, and 128 × 128 pixels were constructed to explore the relationship between image size and AOD retrieval performance. The model was retrained with datasets of different sizes, and the hyperparameters of CNN were adjusted. The smaller the size, the less information the satellite image contains. Accordingly, the image feature vector of the fully connected layer in the CNN was reduced to 80 when the model was trained with datasets of 64 and 32 sizes. The AOD retrieval performance on the dataset with different sizes is shown in Table 8. The 128-size model works best, with an R 2 of 0.95, which is 0.03 and 0.02 higher than those of the 32-size model and 64-size model, respectively. The RMSE of the 128-size model is 0.049, which is 0.016 and 0.011 lower than those of the 32-size model and 64-size model, respectively. The 32-size model performed worst, with the lowest R 2 , highest MAE, and highest RMSE among the three models, and its overestimation and underestimation are the most severe. The better results of the model trained with a larger image dataset indicate that the 128-size image contains richer aerosol information.

Conclusions
In this study, a novel AOD-retrieval model based on CNN was developed, which can directly retrieve AOD from Sentinel-2 satellite images, improving the stability and spatiotemporal adaptability of aerosol retrieval. The CNN model consists of 10 convolutional layers, 4 max-pooling layers, 1 Global max-pooling layer, and 1 fully connected layer. The image features related to aerosol load in Sentinel-2 images of four bands were extracted using the excellent image feature extraction function of CNN. The correlation between the AOD and the remote sensing images was modeled to achieve AOD retrieval. Taking East Asia as the research area, the data of 22 AERONET sites from 2016 to 2022 were obtained, and the satellite image and ground-based AOD data in the same spatiotemporal were paired together to make 2575 pairs of samples. The samples were enhanced by a horizontal shift of 0.1. The model was trained and tested by selecting data in a ratio of 8:2 from samples. The results are as follows: (1) The proposed model can accurately retrieve AOD, with an R 2 of 0.95, RMSE of 0.049, and Within EE of 95% on the test dataset. The AOD-retrieval accuracy of CNN is higher compared with those of the DT, DB, DTB, MAIAC, YAER, RF, and VGG16 algorithms. In addition, CNN could provide continuous and detailed aerosol distribution to fill the observation gap in existing ground-based monitoring networks. (2) CNN efficiently performs AOD retrieval on different land cover types: vegetation surface, urban, and bare soil. When the surface reflectance is high, the satellite sensors can obtain less information about aerosols, making it difficult to retrieve aerosols on a high-reflectance surface. However, CNN still shows great AOD retrieval potential on surfaces with high surface reflectance, such as urban and bare soil, with an R 2 of 0.96 and 0.90 and RMSE of 0.051 and 0.042, respectively. (3) CNN performs better in summer and winter than in spring and autumn. The performance of CNN in winter is the best, with an R 2 of 0.97 and RMSE of 0.037. The performance of CNN in summer is the second best. The CNN performance in spring is poorer compared with those in the other three seasons, with an R 2 of 0.93 and RMSE of 0.062. (4) To investigate the relationship between image size and model retrieval performance, datasets of 32, 64, and 128 sizes were created to train and test the CNN. The 128-size CNN performed better because of the rich AOD information in the 128-size image.