Improving Spatial Coverage of Satellite Aerosol Classiﬁcation Using a Random Forest Model

: The spatial coverage of satellite aerosol classiﬁcation was improved using a random forest (RF) model trained with observational data including target (aerosol type) and input (satellite measurement) variables. The AErosol RObotic NETwork (AERONET) aerosol-type dataset was used for the target variables. Satellite input variables with many missing data or low mean-decrease accuracy were excluded from the ﬁnal input variable set, and good performance in aerosol-type classiﬁcation was achieved. The performance of the RF-based model was evaluated on the basis of the wavelength dependence of single-scattering albedo (SSA) and ﬁne-mode-fraction values from AERONET. Typical SSA wavelength dependence for individual aerosol types was consistent with that obtained for aerosol types by the RF-based model. The spatial coverage of the RF-based model was also compared with that of previously developed models in a global-scale case study. The study demonstrates that the RF-based model allows satellite aerosol classiﬁcation with improved spatial coverage, with a performance similar to that of previously developed models. desert dust, secondary aerosols of biogenic origin, secondary aerosols of urban/industrial origin, aged aerosols, volcanic sulfate, sea salt, unknown source Comparison with model-derived aerosol compositions from the global monitoring atmospheric composition and climate model.


Introduction
Aerosols, directly and indirectly, affect Earth's radiation budget and climate [1], with the degree of influence being dominated by aerosol type [2][3][4]. Aerosols also play a critical role in the calculation of radiative forcing with high inherent uncertainty [5][6][7][8], with the aerosol type being a particularly important input parameter [9,10]. Aerosol type is also an input parameter in satellite aerosol retrieval algorithms [11,12], and retrieval accuracy can be affected by uncertainties in aerosol type. Accurate aerosol type information is, therefore, important in climate science, particularly in satellite aerosol remote sensing.
Various aerosol classification methods have been proposed for the determination of aerosol type, with spatially continuous satellite data based on threshold approaches utilizing empirically determined threshold values. Table 1 summarizes previous works on threshold-based satellite aerosol classification. Various satellite-based variables have been used to classify aerosol types. Aerosol optical properties such as aerosol optical depth (AOD), Ångström exponent (AE), fine-mode fraction (FMF), and aerosol index were used as inputs in earlier satellite aerosol classification methods [12][13][14][15][16]. Column densities of the trace gases NO 2 , HCHO, SO 2 , and CO were also utilized in accounting for the abundances of several aerosol types [12,17]. The vertically resolved aerosol mask information is provided from Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) measurements [18,19]. Although internal uncertainties in satellite input variables may lead to misclassifications of aerosol type [20][21][22], aerosol classification methods have rarely been evaluated, with classifications usually only being compared with results of aerosol climate modeling and earlier classification methods. The vertical aerosol mask data were evaluated by comparison with ground-based lidar measurements [23]. However, Mao Table 1. Summary of previous studies of threshold-based satellite aerosol classification. Abbreviations: SeaWiFS, Seaviewing Wide Field-of-view Sensor; AVHRR, Advanced Very-High-Resolution Radiometer; TOMS, total-ozone mapping spectrometer; MODIS, Moderate-Resolution Imaging Spectroradiometer; OMI, Ozone Monitoring Instrument; GOME-2, Second Global Ozone Monitoring Experiment; MOPITT, Measurements of Pollution in the Troposphere.

Reference Parameters Aerosol Types Validation (or Comparison)
Higurashi and Nakajima [9] Spectral radiances in channels 1, 2, 6, and 8  Choi et al. [25] recently proposed a new satellite classification method for identifying aerosol types based on a 'random forest' (RF) model, which is a machine-learning technique. The RF-based aerosol classification model was trained using a set of observational data including target (AERONET aerosol type) and input (satellite measurement) variables to identify aerosol types without input from AERONET observations. Choi et al. [25] introduced 11 satellite input variables to account for the combined effects of trace-gas data and aerosol optical properties from TROPOspheric Monitoring Instrument (TROPOMI) and Moderate Resolution Imaging Spectroradiometer (MODIS). The importance of satellite input variables was also investigated using the RF-based model [25] with satellite variable inputs to allow the classification of seven aerosol types. Model accuracy for the classification of the seven aerosol types was 59%, improving up to 73% when the seven classes were merged into four aerosol types.
The present study followed the work of Choi et al. [25], with the RF-based model being further developed to investigate the feasibility of satellite aerosol classification. The use of several different satellite products may lead to insufficient spatial coverage in aerosol classification due to data being missing from each product. We constructed a new satellite input variable dataset to improve the spatial coverage of the RF-based Remote Sens. 2021, 13, 1268 3 of 14 aerosol classification model. Input variables with high levels of missing data and low mean-decrease accuracy (MDA; i.e., variable importance) were excluded from the final input variable set while maintaining an aerosol type classification performance comparable with that of the previous study [18]. We also compared the spatial coverage of the aerosol type classification algorithm with that of previous methods, on a global scale.

Description of Model
The RF is an ensemble of decision trees based on classification and regression trees in which multiple trees are aggregated with majority voting and averaged in classification regression tasks [26]. The RF method constructs numerous trees from bootstrap samples to overcome the common problem of over-fitting in decision-tree models. An ensemble model is based on bagging and randomized node optimization [26] and the RF method determines how input variables contribute to model development. Figure 1 shows the overall process of the RF-based aerosol classification model. The RF model was trained using the AERONET aerosol type dataset (target variable) and various satellite input variables. The AERONET aerosol type dataset was constructed using the AERONET-based aerosol classification method proposed in [27]. This method identifies the seven aerosol types (pure dust; PD, dust dominant mixed; DDM, pollution dominant mixed; PDM, and pollution aerosols classified as strongly absorbing; SA, moderately absorbing; MA, weakly absorbing; WA, and non-absorbing; NA) using the SSA to determine aerosol absorbance and dust ratios (R d ) derived from a particle linear depolarization ratio at 1020 nm, which can distinguish contributions of non-spherical particles such as dust aerosols from the AERONET Version 3 product. Shin et al. [27] reported that R d is a more suitable parameter for distinguishing aerosol types mixed with dust particles than FMF since non-spherical dust aerosols may occur in the fine mode. Threshold values used in classifying aerosol types from AERONET data are shown in Table 2   The satellite input variable dataset was collected from TROPOMI data and aerosol The satellite input variable dataset was collected from TROPOMI data and aerosol optical properties from MODIS data. Satellite input variables of previous studies were utilized by Choi et al. [25], including AOD, AE, aerosol index, and trace-gas densities (CO and tropospheric NO 2 column densities). Aerosol formation is reported to depend on radiation exposure, particularly its effect on smog production [28]. Since the amount of solar radiation depends on the solar zenith angle (SZA), Choi et al. [25] introduced the SZA as an input variable to indirectly represent aerosol formation by photochemical reactions. Top-of-atmosphere (TOA) reflectances at three wavelengths (412, 470, and 660 nm) were applied because they are dependent on the specific aerosol type, especially for aerosol absorbance [29]. Annual land-cover type and percentage urban area were selected to account for the effects of land-cover type on aerosol formation and to serve as proxies for aerosol source information [30,31]. The use of these 11 input variables allowed the RF-based model to account for the combined effects of trace-gas information with a classification accuracy of up to 73% [25]. However, the use of TROPOMI and MODIS measurements can result in loss of data owing to the collocation of satellite input variables. In this study, we attempted to exclude satellite input variables with many missing data and low MDA (i.e., variable importance). Aerosol index, column densities of CO and tropospheric NO2, and SZA were obtained from TROPOMI level 2 products. The AOD, AE, and TOA reflectances were from a MODIS level 2 product (MYD04_L2). Annual land-cover type and percentage urban areas were obtained from a MODIS level 3 product (MCD12C1). MODIS products are available from NASA Level-1 and Atmosphere Archive and Distribution System of the Distributed Active Archive Center (https://ladsweb.modaps.eosdis.nasa.gov/, accessed on 22 September 2020).

Training and Validation of the RF Model
We trained the RF model using the 'randomForest' package (version 4. [6][7][8][9][10][11][12][13][14] in Rstudio (R version 3.6.3). Hyperparameters of the RF model include ntree (binary classification trees), mtry (the number of input variables), and node size (the minimum size of terminal nodes). These hyperparameters must be optimized because model performance depends on them [26,32]. The 'tune.randomForest' function, used to obtain optimal hyperparameters such as ntree and node size, was obtained from the 'e1071 package (version 1.7-3). Node sizes of 1-5 with ntree values of 100-1500 were considered in tuning the hyperparameters. A single mtry value, the square root of the number of input variables, was applied as it is a value typically used for classification [33].
The dataset was randomly divided into separate training (60%) and test (40%) datasets. In the training procedure, k-fold (5-fold in this case) cross-validation was used to optimize hyperparameters with the training dataset randomly divided into five sets of the same size. Four folds were involved only in the training procedure, with the remaining single fold (unseen) being used to validate the performance of the trained model. This allowed us to choose optimized hyperparameters for best model performance, with 40% of the test dataset finally being used to evaluate the classification performance of the RF model.

Variable Importance and Data Volume for the Satellite Input Variable
We used data collected previously by Choi et al. [25] with the AERONET aerosol type dataset being obtained from AERONET Level 1.5 data (cloud-screened) as more data can be obtained from it than from AERONET Level 2.0 (quality-assured). Choi et al. [25] used only 440 nm AOD data above 0.4 to minimize uncertainties in AERONET Level 1.5 data; and collected AERONET aerosol type data at the overpass time (13:30 local time) of Remote Sens. 2021, 13, 1268 5 of 14 TROPOMI aboard the Sentinel-5P satellite and MODIS aboard the Aqua satellite for the period January 2018 to July 2020. The AERONET aerosol type dataset comprised 10,481 data points. After collection of the AERONET aerosol type dataset (N = 10,481), satellite variables (TROPOMI and MODIS data) were collocated by selecting the satellite pixel nearest the AERONET site to obtain training and validation datasets in AERONET site location, as in Choi et al. [25]. Three input variable sets were constructed for the selection of that with the highest classification accuracy, including all input variable candidates (11 variables), MODIS input variable candidates (7 variables), and TROPOMI input variable candidates (4 variables). The variable set with 11 input variables, including both TROPOMI and MODIS variables, was selected by Choi et al. [25] because of its higher classification accuracy (59%) than other variable sets. However, the input variable set with 11 parameters collected only 47% (N = 4906) of the total AERONET aerosol type dataset (N = 10,481) as many data were missing from each satellite product. The variable set including four TROPOMI variables (8693 data points) contains more data than the other sets with an overall accuracy of 51%. The set of MODIS variables (overall accuracy = 56%) contained only 5714 data points, with many data points missing except for annual land cover and percent of urban area.
In Choi et al. [25], variable importance for each satellite input parameter was investigated using the MDA value, which indicates the accuracy lost when a specific variable is excluded from the RF-based model ( Table 3). The MDA of TROPOMI variables is 64-83% higher than that of MODIS variables apart from AOD. We excluded the input MODIS variables AE and TOA reflectance (at 412, 470, and 660 nm) due to their missing data and low importance. Despite its high MDA value, AOD was also excluded due to many missing data. We finally selected TROPOMI variables with low missing data and high MDA values; MODIS land cover type and percent of urban area variables were also selected as they comprise annual datasets with little missing data.

Assessment of the New RF-Based Model
A comparison was made between the RF-based model performance that was based on the variable set used by Choi et al. [25] and our selected set ( Table 4). As found by Choi et al. [25], the data collection rate was only 47% when 11 TROPOMI and MODIS variables were used. When only two MODIS land-cover variables and four TROPOMI variables were used, the number of data points was maintained at N = 8693 and the overall accuracy in classifying the seven aerosol types was 56%, with a 5% increase over that when Remote Sens. 2021, 13, 1268 6 of 14 only TROPOMI variables were used. The difference in accuracy to the case where all 11 TROPOMI and MODIS variables were used was only 3%, with~1.77 times more data being secured together with improved spatial coverage. The detailed comparison of spatial coverage between the RF-based model from Choi et al. [25] and this study is investigated in Section 4.2. Table 4. Summary of variables, data, and overall accuracy for the initial input variable set.

Dataset Name
Input Variables We investigated the confusion matrix and classification accuracy for each aerosol type to determine which are usually confused or well classified. Aerosol types that were usually confused with other types were merged based on a sensitivity test via confusionmatrix analysis When aerosols were classified into the seven types (PD, DDM, PDM, SA, MA, WA, and NA), the overall accuracy (OA) of the RF-based model was 56%, with the classification performance for each type generally being comparable with but slightly poorer than those of Choi et al. [25], as shown in Figure 2a. Our RF-based model generally yielded reliable detection accuracies for the SA (72%), DDM (70%), PDM (61%), and PD (60%) types, indicating sensitivity to dust and pollution aerosols with strong absorption. The detection performance for pollution aerosols MA, WA, and NA was generally poor at <44%, similar to the performance observed by Choi et al. [25]. However, classification performance for the NA type (37%) was higher than the 21% achieved by Choi et al. [25], while the performance for the PD type was lower (60%, compared with 73%).
The RF-based model confuses mainly between the pollution-related aerosols (MA, WA, NA, and PDM) as shown in Figure 2b, and as reported by Choi et al. [25] who suggested that the difficulty for the RF-based model in discriminating between absorbing features of aerosols (except for the SA type) and identifying the PDM type that pollution and dust aerosols are mixed. Choi et al. [25] solved the problem of confusion between pollutionrelated aerosols by merging aerosols into the four classes PD, DDM, SA, and NA, achieving increased classification performance. We, therefore, merged the WA and NA types and the MA and SA types, while the PDM type was reclassified as the NA or SA type to integrate the PDM class into pollution aerosols.
tures of aerosols (except for the SA type) and identifying the PDM type that pollution and dust aerosols are mixed. Choi et al. [25] solved the problem of confusion between pollution-related aerosols by merging aerosols into the four classes PD, DDM, SA, and NA, achieving increased classification performance. We, therefore, merged the WA and NA types and the MA and SA types, while the PDM type was reclassified as the NA or SA type to integrate the PDM class into pollution aerosols. The training process was repeated for the newly merged classes (SA, NA, DDM, and PD), with classification performance increasing from 56% to 73% (Figure 3b) i.e., the same OA as that achieved by Choi et al. [25] and with similar classification performance for TROPOMI and MODIS land-cover variables. In particular, our model yields a higher NA classification performance (77%) than that of Choi et al. (74%) [25] as shown in Figure 3a. Although PD classification performance was poorer than that achieved by Choi et al. [25], indicating decreased classification sensitivity for the PD type when using only TROPOMI variables and MODIS land-cover variables, our new classification model ensures extended spatial coverage with minimized input variables when classifying the four aerosol types. The training process was repeated for the newly merged classes (SA, NA, DDM, and PD), with classification performance increasing from 56% to 73% (Figure 3b) i.e., the same OA as that achieved by Choi et al. [25] and with similar classification performance for TROPOMI and MODIS land-cover variables. In particular, our model yields a higher NA classification performance (77%) than that of Choi et al. (74%) [25] as shown in Figure 3a. Although PD classification performance was poorer than that achieved by Choi et al. [25], indicating decreased classification sensitivity for the PD type when using only TROPOMI variables and MODIS land-cover variables, our new classification model ensures extended spatial coverage with minimized input variables when classifying the four aerosol types.  Based on statistical validation, our RF-based model with TROPOMI and MODIS land-cover variables has acceptable accuracy. We attempted to evaluate the model using aerosol optical properties from AERONET data, as done by Choi et al. [25], who applied the spectral dependence of SSA to infer aerosol composition [25,34]. For example, SSA values of dust aerosols tend to increase with increasing wavelength, whereas those of carbonaceous aerosols decrease with increasing wavelength [34][35][36][37][38]. AERONET and RFmodel aerosol types displayed similar trends in the wavelength dependence of SSA for each aerosol type (Figure 4a,b). In particular, the SSAs of PD and DDM tended to increase with wavelength, indicating a high contribution of dust aerosols. However, with the DDM type, the rate of SSA increase with wavelength is lower than that of PD, indicating that the RF-based model can distinguish PD and dust aerosols mixed with pollution aerosols. With the SA type, SSA values tend to decrease with increasing wavelength, indicating that Based on statistical validation, our RF-based model with TROPOMI and MODIS land-cover variables has acceptable accuracy. We attempted to evaluate the model using aerosol optical properties from AERONET data, as done by Choi et al. [25], who applied the spectral dependence of SSA to infer aerosol composition [25,34]. For example, SSA values of dust aerosols tend to increase with increasing wavelength, whereas those of carbonaceous aerosols decrease with increasing wavelength [34][35][36][37][38]. AERONET and RFmodel aerosol types displayed similar trends in the wavelength dependence of SSA for each aerosol type (Figure 4a,b). In particular, the SSAs of PD and DDM tended to increase with wavelength, indicating a high contribution of dust aerosols. However, with the DDM type, the rate of SSA increase with wavelength is lower than that of PD, indicating that the RF-based model can distinguish PD and dust aerosols mixed with pollution aerosols. With the SA type, SSA values tend to decrease with increasing wavelength, indicating that the AERONET and RF algorithms reasonably describe the wavelength dependence of carbonaceous aerosol types, consistent with previous studies [27,[34][35][36].  Table 5, which summarizes means and standard deviations of differences between AERONET and RF aerosol types. All of these parameters tended to decrease when aerosol types were merged. For example, with seven aerosol types, the differences in SSA at 440, 675, 870, and 1020 nm were 0.006, 0.010, 0.013, and 0.015, respectively, and with four aerosol classes, the differences were 0.003, 0.005, 0.007, and 0.008, respectively. The difference in Rd values also decreased from 0.061 to 0.027 through the merging of aerosol classes, while differences in FMF values were relatively unaffected. The decreasing trends in aerosol optical properties indicate that merging aerosol classes contributed to a decrease in classification confusion in the satellite aerosol classification model. Table 5. Means and standard deviations of differences in SSAs, fine-mode fraction (FMF), and aerosol absorbance and dust ratios (Rd) values for several aerosol classification models identifying seven or four aerosol types. Values in parentheses are from Choi et al. [25].  The performance of the RF-based model was assessed quantitatively by comparing SSA values at wavelengths of 440, 675, 870, and 1020 nm for AERONET aerosol types (Figure 4a) with those of the RF-based model (Figure 4b), with differences between the two shown in Figure 4c,d. The mean differences in SSA at 440, 675, 870, and 1020 nm were 0.003, 0.005, 0.007, and 0.008, respectively, similar to those values reported by Choi et al. [25] (averaging <0.01). Although the difference in SSA is <0.01, the internal uncertainty of AERONET SSA (0.03) [39] must be considered.

Seven Aerosol Classes
The effect of merging aerosol types on classification performance, in terms of aerosol optical properties (SSA values at 440, 675, 870, and 1020 nm, FMF, and R d values) is shown in Table 5, which summarizes means and standard deviations of differences between AERONET and RF aerosol types. All of these parameters tended to decrease when aerosol types were merged. For example, with seven aerosol types, the differences in SSA at 440, 675, 870, and 1020 nm were 0.006, 0.010, 0.013, and 0.015, respectively, and with four aerosol classes, the differences were 0.003, 0.005, 0.007, and 0.008, respectively. The difference in R d values also decreased from 0.061 to 0.027 through the merging of aerosol classes, while differences in FMF values were relatively unaffected. The decreasing trends in aerosol optical properties indicate that merging aerosol classes contributed to a decrease in classification confusion in the satellite aerosol classification model. Table 5. Means and standard deviations of differences in SSAs, fine-mode fraction (FMF), and aerosol absorbance and dust ratios (R d) values for several aerosol classification models identifying seven or four aerosol types. Values in parentheses are from Choi et al. [25]. Overall accuracy 56% (59%) 73% (73%) Figure 5 shows aerosol types classified by the RF-based algorithm with fewer variables (this study), that of Choi et al. [25], and the threshold-based algorithms [9,12] on 26 March, 2018. The aerosols were classified for pixels where the TROPOMI cloud fraction is <0.2 and SZA is <70 • to reduce uncertainties in input variables due to the presence of cloud and a high solar zenith angle. Over the ocean, we classified aerosols with high aerosol loadings (AOD > 0.4) for our RF-based algorithm and that from Torres et al. [9], because these algorithms do not classify sea salt aerosols. In the case of the aerosol classification from Lee et al. [12], aerosols were classified over the ocean for all aerosol loading because Lee et al. [12] classify sea salt aerosols.

Spatial Distributions among DIfferent Aerosol Classification Models
As shown in Figure 5a-d, on 26 March, 2018, aerosols were classified over 2,149,917 and 823,505 pixels using the RF-based model used here and that of Choi et al. [25], respectively. The reduction in input variables was found to greatly improve spatial coverage of the RF-based model. In general, spatial distributions of aerosol types classified by RF-based algorithms tend to be similar, except over parts of South America (including the Atacama, Monte, and Patagonian Deserts), South Africa (including Namib and Kalahari deserts), and Australia (including Strzelecki and Simpson Deserts). Over these regions, the RF-based model used here (Figure 5a,b) classified aerosol types as PD or DDM (dust; i.e., coarse-mode absorbing aerosols), whereas that of Choi et al. [25] (Figure 5c,d) classified them as SA (fine-mode absorbing aerosols). To clarify the difference in aerosol classification, we investigated AE values from the AERONET site. Near the overpass time (13:30) of the TROPOMI (aboard Sentinel-5p) and MODIS (aboard Aqua), the AERONET data were only measured observed over CEILAP-Neuquen (South America) and Gobabeb (South Africa) with AE values of 0.48 and 0.59, respectively, indicating the likely presence of coarse-mode aerosols. It may be possible to determine whether the coarse-mode aerosol is present when lidar observation data with polarization capabilities are available in the future. Therefore, the dust aerosols (PD or DDM) are more appropriate than the pollution aerosols (SA) over the regions. For the part of Australia, more investigations are needed for case studies in the future. The PD type was detected with high aerosol loadings (average AOD: 0.98) over the North Atlantic Ocean between the Sahara Desert and North America. The average (maximum) MODIS AE and TROPOMI AI were found to be 0.29 (0.59) and 0.51 (2.15), respectively, indicating the presence of dust aerosols may be transported from the Sahara Desert. In order to check the transport of dust aerosols from the Sahara Desert, a hybrid single-particle Lagrangian integrated trajectory (HYSPLIT) model was used [40,41]. Figure 6 shows 96-h HYSPLIT back trajectories (1-degree Global Data Assimilation System, GDAS meteorology) originating from 500, 1000, and 1500 m above ground level (AGL) at the point of dust plume detected over the North Pacific Ocean (latitude: 10, longitude: −30). Based on backward trajectory analysis, it was found that a dust plume detected over the North Pacific Ocean is transported from the Sahara Desert.
March, 2018. The aerosols were classified for pixels where the TROPOMI cloud fraction is <0.2 and SZA is <70° to reduce uncertainties in input variables due to the presence of cloud and a high solar zenith angle. Over the ocean, we classified aerosols with high aerosol loadings (AOD > 0.4) for our RF-based algorithm and that from Torres et al. [9], because these algorithms do not classify sea salt aerosols. In the case of the aerosol classification from Lee et al. [12], aerosols were classified over the ocean for all aerosol loading because Lee et al. [12] classify sea salt aerosols. Figure 5. Comparisons of spatial distributions of aerosol types classified from (a) RF-based algorithm with a reduced number of input variables in this study (four type classification), (b) RF-based algorithm with a reduced number of input variables in this study (seven type classification), (c) RF-based algorithm in Choi et al. [25] (four type classification), (d) Figure 5. Comparisons of spatial distributions of aerosol types classified from (a) RF-based algorithm with a reduced number of input variables in this study (four type classification), (b) RF-based algorithm with a reduced number of input variables in this study (seven type classification), (c) RF-based algorithm in Choi et al. [25] (four type classification), (d) RF-based algorithm in Choi et al. [25] (seven type classification), (e) Torres et al. [9], and (f) Lee et al. [12] on 26 March 2018. Aerosol types classified by earlier threshold-based algorithms [9,12] have high spatial coverages (2,342,921 pixels for Torres et al. [9]; 1,486,496 pixels for Lee et al. [12]), as shown in Figure 5e,f. Torres et al. [9] are found to have the highest spatial coverage especially over the land (Figure 5e). Furthermore, Lee et al. [12] classify aerosols for the much more data points over the ocean with sea salt aerosols. However, these earlier classification models [9,12] give a much higher difference in SSA values (0.0265-0.0513 for Torres et al. [9]; 0.0190-0.0337 for Lee et al. [12]) than those of the RF-based model (<0.01) [25]. The RFbased algorithm used here yields improved spatial coverage with performance comparable with that used by Choi et al. [25].
ported from the Sahara Desert. In order to check the transport of dust aerosols from the Sahara Desert, a hybrid single-particle Lagrangian integrated trajectory (HYSPLIT) model was used [40,41]. Figure 6 shows 96-h HYSPLIT back trajectories (1-degree Global Data Assimilation System, GDAS meteorology) originating from 500, 1000, and 1500 m above ground level (AGL) at the point of dust plume detected over the North Pacific Ocean (latitude: 10, longitude: −30). Based on backward trajectory analysis, it was found that a dust plume detected over the North Pacific Ocean is transported from the Sahara Desert.

Discussion
Previous studies generally used a threshold-based approach for identifying aerosol types with satellite input variables [9,[13][14][15][16], with aerosol optical properties and trace-gas information (with internal uncertainties) being obtained from various satellite sensors. Although uncertainties in satellite input parameters may lead to misclassification of aerosols, the early methods were rarely evaluated as reported in [25]. The RF classification model of Choi et al. [25] was evaluated using AERONET aerosol optical properties with reasonable classification performance being achieved, but with limitations on spatial coverage due to missing data of each satellite product.
Both Choi et al. [25] and this study applied a random forest model to classify satellite aerosol types, with this study attempting to improve spatial coverage of the RF model by reducing the number of input variables. In this study, we followed the work of Choi et al. [25] to improve spatial coverage of the RF aerosol classification model. Among the 11 input variables used by Choi et al. [25], we excluded satellite input variables with many missing data points or low variable importance. The reduced satellite input variables include four TROPOMI-based variables (aerosol index, tropospheric CO and NO 2 column densities, and SZA) and two MODIS-based variables (annual land-cover type and percentage urban ratio). The performance of the RF-based model with a reduced number of variables is similar to that of the model with 11 variables used by Choi et al. [25], although with a reduction in input variables the model may not fully distinguish pure dust aerosols, especially with MODIS aerosol optical variables being excluded. However, the reduction in input variables led to improved spatial coverage over the satellite aerosol classification model.
Our satellite aerosol classification model was evaluated with the AERONET-based aerosol type dataset constructed by Shin et al. [27] and aerosol optical properties of typical aerosol types. In future work, our results should be compared with other ground-based aerosol classification methods. Hamil et al. [42], Ozdemir et al. [43], and Stefan et al. [44] suggested aerosol classification methods using mainly parameters related to aerosol size and scattering properties based on AERONET measurement data, with Kaskaoutis [45] combining in situ measurement data to classify aerosol types for the first time.

Summary and Conclusions
We improved the spatial coverage of the RF aerosol type classification model. Satellite input variables with many missing data points and of low importance were excluded from the final input variable set. Four TROPOMI input variables (aerosol index, tropospheric CO and NO 2 column densities, and SZA) and two MODIS variables (annual land-cover type and percentage urban area) were selected. The RF-based model with these reduced input variables gave a performance comparable to that of the model with 11 input variables. The global spatial coverage of the RF algorithm was compared with that of previous methods, with the RF-based model providing improved spatial coverage while maintaining a classification performance comparable with that of other models.
It is necessary for future studies to improve the accuracy of the aerosol type classification algorithm of the RF-based model through the use of additional variables (with fewer missing data), such as meteorological data from numerical weather prediction models and chemical variables from chemical-transport models. Furthermore, the collection of a longer-term dataset should improve the accuracy of the model, which may provide aerosol type information with spatially continuous coverage, providing global climatological distributions of aerosol types even where there are no AERONET sites.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: For the results and data generated during the study, please contact the first author.