Aerosol Optical Properties and Type Retrieval via Machine Learning and an All-Sky Imager

: This study investigates the applicability of using the sky information from an all-sky imager (ASI) to retrieve aerosol optical properties and type. Sky information from the ASI, in terms of Red-Green-Blue (RGB) channels and sun saturation area, are imported into a supervised machine learning algorithm for estimating ﬁve different aerosol optical properties related to aerosol burden (aerosol optical depth, AOD at 440, 500 and 675 nm) and size (Ångström Exponent at 440–675 nm, and Fine Mode Fraction at 500 nm). The retrieved aerosol optical properties are compared against reference measurements from the AERONET station, showing adequate agreement (R: 0.89–0.95). The AOD errors increased for higher AOD values, whereas for AE and FMF, the biases increased for coarse particles. Regarding aerosol type classiﬁcation, the retrieved properties can capture 77.5% of the total aerosol type cases, with excellent results for dust identiﬁcation (>95% of the cases). The results of this work promote ASI as a valuable tool for aerosol optical properties and type retrieval.


Introduction
Aerosols regulate the Earth's radiative balance by interacting with atmospheric radiation [1].Under cloudless conditions, the aerosol radiative impact on the climate is described by direct aerosol radiative forcing (DARF), which is often characterized by significant uncertainties due to the high spatiotemporal variabilities of the aerosol's physical and optical properties [2].Those properties depend on the emission sources; biogenic (e.g., sea salt and dust) and anthropogenic (e.g., biomass burning and fossil fuel combustion), and aerosol composition and size [3][4][5][6].The contiguous and accurate monitoring of aerosol properties at the finest spatiotemporal resolution is desirable in order to (a) better understand the aerosol effect on climate and (b) model the solar irradiance.
Several techniques exist to monitor aerosol properties including ground-based measurements and satellite observations.The AERONET (AERosol RObotic NETwork) provides aerosol optical properties at the highest temporal resolution (5-15 min), with over 600 stations worldwide using CIMEL sun-photometers [7], and it is commonly used for validation studies related to aerosol properties retrieval.Although the AERONET stations provide high-quality aerosol information, most stations are located in landlocked and low-elevated regions.On the other hand, satellite-based remote sensing offers a near-global spatiotemporal coverage of aerosol optical properties.Over the past few decades, satellite remote sensing using passive imaging radiometers, such as MODIS (MODerate resolution Imaging Spectroradiometer) [8][9][10], MISR (Multi-angle Imaging SpectroRadiometer) [11], MERIS (MEdium Resolution Imaging Spectrometer) [12], AATSR (Advanced Along Track Scanning Radiometer) [13], GOES (Geostationary Observational Environmental Satellite) [14], POLDER (POLarization and Directionality of the Earth's Reflectances) [15], and VIIRS (Visible Infrared Imaging Radiometer Suite) [16], measure aerosol properties at regional and global scales.Several limitations characterize the aerosol satellite retrievals, such as (i) low temporal resolution, (ii) sensor calibration, (iii) cloud detection, and (iv) surface reflections and brightness, inducing significant uncertainties in the aerosol retrieval [17][18][19].Based on the limitations of the currently available aerosol retrievals, it is essential to explore alternative techniques for aerosol monitoring that may work in conjunction with remote sensing instruments.
Several studies reported the potential use of an ASI system to characterize aerosol properties.Olmo et al. [41] calculated the AOD at 550 nm using the spectral radiances obtained from ASI and radiative transfer model (RTM) simulations.The predicted AOD underestimated the AERONET observations, with 80% (60%) of the AOD deviations being lower than 0.04 (0.02).Cazorla et al. [42] used radiances derived from a calibrated ASI and artificial neural network (ANN) to estimate AOD at three wavelengths (440, 675, and 870 nm).Compared to the AERONET, the predicted AODs revealed adequate performance with a coefficient of determination (R 2 ) higher than 0.90 at all wavelengths.In addition to AOD, they also calculated the Ångström Exponent (AE), revealing a weaker correlation against the AERONET (R 2 = 0.77).Huo and Lu [43] retrieved AOD at 500 nm by comparing the Blue (B) to Red (R) ratio derived from an ASI against the relevant radiance ratios (between 440 and 650 nm) from RTM simulations (MODTRAN) [44,45] at the Xianghe Observatory in Hebei Province, China.The retrieved AOD, against sun-photometer measurements, showed a correlation coefficient close to 0.95 and an average retrieval error of around 7%. Kazantzidis et al. [46] proposed a methodology to calculate AOD at 440, 500, and 675 nm using the RGB (Red-Green-Blue channels) intensities, the sun-saturated area (SAT) from an ASI, and RTM radiances (libRadtran) [47] at the Plataforma Solar de Almeria, Spain.A validation against the AERONET observations showed mean/median differences and a standard deviation lower than 0.01 and 0.03, respectively.
An ASI has been used in the GRASP code (Generalized Retrieval of Atmosphere and Surface Properties) [48] to retrieve aerosol properties [49].The normalized sky radiances (NSRs) [50], extracted from the ASI at three effective wavelengths (467, 536, and 605 nm), were imported to GRASP for retrieving AOD.The GRASP-AOD and AERONET-AOD retrievals revealed R 2 close to 0.87.Generally, the median and standard deviation of the AOD differences were between 0.006-0.010and 0.024-0.030,respectively.Scarlatti et al. [51] recently proposed a machine learning (ML) approach for AOD and AE retrieval using the smoothing RGB signals towards the principal plane, as captured from a well-calibrated ASI installed at the University of Valencia, Spain.Different distinct ML models with varied input information were implemented.More specifically, AOD and AE were predicted using the color signal relevant to the spectral AOD or using solely the red channel.In addition, the B/R ratio was also applied to retrieve AOD and AE.All the aforementioned approaches revealed adequate retrieval performance against the AERONET, with R 2 exceeding 0.95.A novel point of the Scarlatti et al. [51] study was the implementation of partially clouded images during ML model training.
In cases of clear-sky conditions, solar resource estimation is linked mainly to AOD and total column water vapor (TCWP) [52,53].The global and direct components of the surface solar irradiance (GHI and DNI) are important for PV systems and concentrated solar power (CSP) technologies [53][54][55].The former two solar parameters under clear skies are primarily affected by the atmospheric aerosol burden and aerosol scattering effects [56].Thus, qualityassured AOD measurements with low errors are required to model DNI and GHI accurately.Aerosol optical properties are frequently used to characterize aerosol type.Particle type classification is important in aerosol modelling since different aerosol size modes present different chemical compositions and deposition processes [57].Various aerosol classification schemes have been proposed in the literature, relying on aerosol absorptivity (single scattering albedo, SSA) and AE and/or FMF using AERONET retrievals [58][59][60][61][62][63][64].
The principal objectives of the presented study are: 1.
The application of a supervised learning technique for retrieving AOD at 440, 500, and 674 nm (AOD 440 nm , AOD 500 nm , and AOD 675 nm ), Ångström Exponent between 440 and 675 nm (AE 440-675 nm ), and Fine Mode Fraction at 500 nm (FMF 500 nm ) using valuable sky information from an ASI; 2.
The efficiency of the results in performing aerosol-type classification.
The retrievals of the presented methodology can be used to expand and complement the aerosol properties and type information in non-existing timeframes when sky information from an ASI is available.
This paper is structured as follows: Section 2 describes the measurement site and the measuring instruments.Section 3 presents the methodology for the retrieval of the aerosol properties and the implemented validation metrics.Section 4 presents the results of the study and, lastly, the main conclusions are summarized in Section 5.

Measurement Site
The data used in this study were collected from an ASI and a CIMEL Sun-photometer installed at the National Observatory of Athens (NOA; 37.97 • N, 23.72 • E; 130 m above sea level) in Thissio, Greece.The site is characterized as urban, located in the city center of Athens.According to the Köppen-Geiger classification system, the climate is of type Csa with warm and dry summers and wet and mild winters.
Both natural and anthropogenic sources contribute to the aerosol conditions of Athens.For instance, dust episodes from the Sahara Desert are a common natural occurrence, accounting for ~23% of the annually averaged AOD [65].Other sources are sea-salt aerosols emitted by the surrounding sea bodies [66], biomass burning [67], wildfires [68,69], and photochemical pollution [70,71].Raptis et al. [72] performed a detailed analysis using decadal measurements of aerosol properties (AOD 440 nm and AE 440-870 nm ) from the NOA AERONET station to extract the seasonality and the trends of aerosols in the region.Polluted (27%) and mixed (23%) aerosols are dominant, followed by continental (19%) and dust (16%) particles.Marine (11%) and biomass burning (5%) aerosols have the fewest occurrence frequencies.

AERONET Station
The CE318 sun-sky photometer (CIMEL Electronique, Paris, France) is the standard instrument of all AERONET stations.It measures the sun-collimated direct beam irradiance and sky radiance to provide high-quality aerosol optical and microphysical properties [7,73].More specifically, sun measurements with a 1.2 • field of view were obtained at ~5 to 15 min and AOD [74] was calculated at eight (8) standard wavelengths between 340 and 1640 nm.Precipitable water was also retrieved via the 940 nm channel [74,75].Sky radiances in the almucantar geometry (zenith angle is set equal to solar zenith angle, and ±180 azimuthal sweeps), at 440, 675, 870, and 1020 nm, also gave aerosol properties such as SSA, aerosol volume distribution, phase function, FMF, refractive index, etc., via an inverse algorithm [7,76].Typically, the total AOD uncertainty under clear skies is less than ±0.01 for λ > 440 nm and less than ±0.02 for shorter wavelengths [74], reflecting the variations in atmospheric conditions and instrumental calibrations.
The AERONET database is divided into three data quality levels: Level 1.0, Level 1.5, and Level 2.0.In the latest Version 3, Level 1.0 is for prescreened data; Level 1.5 represents data with near-real-time automatic cloud screening and automatic instrument anomaly quality controls; and Level 2.0 includes pre-field and post-field calibrations [74].The current work uses AERONET Level 2.0 Version 3 (L2V3) measurements and, more specifically, the following parameters: AOD 440 nm , AOD 500 nm , AOD 675 nm, AE 440-675 nm , and FMF 500 nm .The data were downloaded directly from the network's website (https: //aeronet.gsfc.nasa.gov/;accessed on 28 June 2023).

All-Sky Imager
In this work, a commercial Mobotix Q24M ASI (www.mobotix.com;accessed on 28 June 2023) was installed on NOA station.This hemispheric camera is a weatherproof IP network dome camera with a special hemispheric lens (fisheye).It provides sky images every 640 µs, stored in 24-bit JPEG format with a spatial resolution of 1024 × 768 pixels.The sensor has a RGB filter.These channels correspond to 440, 500, and 675 nm wavelengths with color intensities ranging from 0 to 255.The whole sky is represented as a circle and has no shadowing mechanism in the system.Inevitably, at the images' edges, close to the horizon, obstacles, such as other instruments installed on the rooftop of the measurement site, nearby buildings, and trees, are detected and excluded from further analysis.
The aerosol measurements from the CIMEL sun-photometer were temporally synchronized with the nearest ASI image.The study period ranges between 1 January 2021 and 18 November 2021 and the measurements were collected during the experimental campaign of the ASPIRE project (Atmospheric parameters affecting SPectral solar IRradiance and solar Energy) (https://aspire.geol.uoa.gr/;accessed on 28 June 2023).Possible cloud-contaminated aerosol data that were not flagged as cloudy instances during the cloud-screening approach were further detected and removed from the dataset based on the ASI images.Moreover, instances with a solar zenith angle (SZA) greater than 70 • were discarded to avoid cases with the sun at the edges of the image, where obstacles may interfere.Finally, 3212 images were retained for subsequent analysis.

Machine Learning Approach
This work uses an updated Gradient Boosting Machine (GBM) [77] algorithm, the Light GBM (LGBM) [78], for regression.GBM is an ensemble technique that performs iterative decision trees in an additive and sequential way.The processing of the decision trees is the principal difference between the LGBM and GBM, where LGBM implements a leaf-wise tree growth, and GBM applies a depth-wise tree growth.The former reflects lower memory usage, better accuracy, and easy handling of large datasets.Although LGBM constitutes a suitable algorithm, several other supervised machine learning (ML) techniques are also investigated (see Section 4.1).Finally, LGBM is selected regarding the overall performance, accuracy, and total training time.
LGBM is trained and tested using a 70/30 random splitting procedure.Several parameters called hyperparameters describe the LGBM architecture (e.g., learning rate, max depth).The 'optimal' hyperparameter combination is selected through a randomized searching approach and a 10-fold cross-validation process.The mean square error (MSE) is a commonly used fitness function measuring the amount of error in regression analyses, and the lower MSE extracts the 'optimal' hyperparameter combination.
Different LGBM models were designed to predict AOD 440 nm , AOD 500 nm , AOD 675 nm , AE 440-675 nm , and FMF 500 nm using the sky information from ASI images as well as the SZA and total column water vapor (TCWV) as auxiliary information.Regarding ASI sky information, two different sources of data were extracted.Firstly, RGB values from 60 pixels around the sun are selected (Figure 1c-e), where 30 out of 60 pixels consider a con-stant zenith angle and variable azimuth angle (almucantar: curved line (2) in Figure 1c-e) and vice versa (principal plane: straight line (1) in Figure 1c-e).Torres et al. [79] stated that almucantar retrievals, combined with principal plane inversions, using ground-based radiometer observations could result in satisfying aerosol retrievals.Almucantar observations offer the great advantage of symmetry, a decisive factor in minimizing errors and cloud-screening processes.At the same time, principal plane retrievals provide accuracy during the day, which is crucial for aerosol studies.
Different LGBM models were designed to predict AOD440 nm, AOD500 nm, AOD675 nm, AE440-675 nm, and FMF500 nm using the sky information from ASI images as well as the SZA and total column water vapor (TCWV) as auxiliary information.Regarding ASI sky information, two different sources of data were extracted.Firstly, RGB values from 60 pixels around the sun are selected (Figure 1c-e), where 30 out of 60 pixels consider a constant zenith angle and variable azimuth angle (almucantar: curved line (2) in Figure 1c-e) and vice versa (principal plane: straight line (1) in Figure 1c-e).Torres et al. [79] stated that almucantar retrievals, combined with principal plane inversions, using ground-based radiometer observations could result in satisfying aerosol retrievals.Almucantar observations offer the great advantage of symmetry, a decisive factor in minimizing errors and cloud-screening processes.At the same time, principal plane retrievals provide accuracy during the day, which is crucial for aerosol studies.Secondly, the saturation area (SAT in %) is defined as the ratio between the number of pixels around the sun that includes sunlight and the total number of image pixels.In order to explain the significance of SAT, two cases with different AOD and almost identical solar geometries are represented in Figure 1a,b.The increased solar disk in Figure 1b explains that increases in SAT (from 1.8% to 4.1%) are reflected in AOD increases (from 0.12 to 0.42).In particular, a higher SAT corresponds to a higher aerosol burden within the atmosphere, according to the Mie theory [80].Secondly, the saturation area (SAT in %) is defined as the ratio between the number of pixels around the sun that includes sunlight and the total number of image pixels.In order to explain the significance of SAT, two cases with different AOD and almost identical solar geometries are represented in Figure 1a,b.The increased solar disk in Figure 1b explains that increases in SAT (from 1.8% to 4.1%) are reflected in AOD increases (from 0.12 to 0.42).In particular, a higher SAT corresponds to a higher aerosol burden within the atmosphere, according to the Mie theory [80].
TCWV strongly correlates to AOD in various aerosol environments (biomass burning, heavy pollution, and dust) [81].Therefore, TCWV is also included as model input.TCWV data were obtained through CAMS (Copernicus Atmosphere Monitoring Service, https: //ads.atmosphere.copernicus.eu/;accessed on 28 June 2023) Reanalysis product [82], and were linearly interpolated from their native temporal resolution (3 h) to fit AERONET and ASI measurements.TCWC from CAMS encompasses adequately good performance against AERONET TCVW, with R 2 = 0.94 and RMSE = 0.22 cm (Figure S1).
Finally, five LGBM models are trained, one for each aerosol parameter.SAT, SZA, and TCWV parameters are standard input parameters for all models.On the other hand, models for AOD 440 nm , AOD 500 nm , and AOD 675 nm use each image's blue, green, and red channels relevant to each wavelength.The aerosol size properties use all RGB values.Before the training process, all input parameters are normalized to extend between 0 and 1 through the Min-Max normalization process, except for SZA, where the cosine function is applied.Hereafter, the retrievals of the proposed methodology will be abbreviated as 'ML-ASI'.
The proposed methodology can be applied in any instance if AERONET retrievals and ASI data are jointly available to train the machine learning algorithm.The CIMEL sun photometer can be installed for a certain period (in this study, that period was 11 months) for recording the month-to-month variability of aerosol and climate characteristics.After the model training, the ML-ASI can be used to reproduce the aerosol optical properties without the presence of the CIMEL.It should be mentioned that model transferability is possible at sites with similar aerosol and climate climatology without requiring the presence of a sun photometer.
The performance of the applied methodology could be affected by the earth's location in different ways.Firstly, the proposed methodology is applied only to clear-sky ASI measurements.In regions with relatively high cloud presence (e.g., North Europe), it is expected to be more difficult to train the machine learning algorithm due to the suppression of data sampling.Secondly, regarding the model input parameter, TCWV, it is also expected to be related to the earth's location.In higher latitudes (>60 • N and >60 • S), due to lower water vapor levels, the TCWV-AOD correlation may be eradicated.Nevertheless, the decrease in TCWV-AOD correlation is not expected to reduce the performance of the applied methodology but rather the importance level of TCWV as a feature parameter to machine learning algorithms.

Validation Metrics
The predicted aerosol parameters are compared against ground observations in terms of the Mean Bias Error (MBE), relative MBE (rMBE), Root Mean Square Error (RMSE), relative RMSE (rRMSE), and Pearson's correlation coefficient (R): where y ML-ASI and y AERONET are the retrieved aerosol parameters from the ML model and AERONET, respectively.
The performance of ML-ASI-predicted aerosol retrievals for the aerosol type classification is achieved in terms of total accuracy and precision: Total Accuracy = True cases Total cases (6) Precision = Predicted cases per aerosol type Observed cases per aerosol type (7) All the above statistical metrics are calculated solely using the testing dataset as described in Section 3.1.The testing dataset is also used to interpret the results of the proposed methodology (Section 4).

Performance of the Retrieved Aerosol Optical Properties
The LGBM model's efficiency in reproducing aerosol parameters is presented in the following sections.A sensitivity analysis of the model input parameters is presented in Section 4.1.1.The AOD 440 nm , AOD 500 nm , and AOD 675 nm accuracy is shown in Section 4.1.2,and the performance of the AE 440-675 nm and FMF 500 nm is discussed in Section 4.1.3.

Sensitivity Analysis on Model Input Parameters and ML Application
Two sensitivity exercises are performed to achieve the best model accuracy.Firstly, the model performance is evaluated using (a) sky information from all 60 pixels as already described in Figure 1c-e (blue bars in Figure 2) or (b) pixels at specific zenith angles (25-35 • and 51-61 • ) (green bars in Figure 2).Figure S2 shows the channels' intensity variations against the zenith angle.The first range of the specific zenith angles (25-35 • ) forms an area close to the sun, with intense scattering effects, while the second group of specific zenith angles (51-61 • ) is located at a distant area from the sun, where the channel intensity still reduces with increasing zenith angle.At higher zenith angles, the color intensity shows an almost stable pattern (Figure S2).The model performance for all aerosol parameters reduces as zenith angle reduces.R and RMSE ranges are 0.87-0.94and 0.06-0.16using all pixels, while for the specific pixels, the corresponding ranges are 0.84-0.93 and 0.07-0.18.

R =
∑ (y ML-ASI,i − y ML ̅̅̅̅̅)(y AERONET,i − y AERONET ̅̅̅̅̅̅̅̅̅̅̅̅) where yML-ASI and yAERONET are the retrieved aerosol parameters from the ML model and AERONET, respectively.The performance of ML-ASI-predicted aerosol retrievals for the aerosol type classification is achieved in terms of total accuracy and precision: Precision = Predicted cases per aerosol type Observed cases per aerosol type All the above statistical metrics are calculated solely using the testing dataset as described in Section 3.1.The testing dataset is also used to interpret the results of the proposed methodology (Section 4).

Performance of the Retrieved Aerosol Optical Properties
The LGBM model's efficiency in reproducing aerosol parameters is presented in the following sections.A sensitivity analysis of the model input parameters is presented in Section 4.1.1.The AOD440 nm, AOD500 nm, and AOD675 nm accuracy is shown in Section 4.1.2,and the performance of the AE440 -675 nm and FMF500 nm is discussed in Section 4.1.3.

Sensitivity Analysis on Model Input Parameters and ML Application
Two sensitivity exercises are performed to achieve the best model accuracy.Firstly, the model performance is evaluated using (a) sky information from all 60 pixels as already described in Figure 1c-e (blue bars in Figure 2) or (b) pixels at specific zenith angles (25-35° and 51-61°) (green bars in Figure 2).Figure S2 shows the channels' intensity variations against the zenith angle.The first range of the specific zenith angles (25-35°) forms an area close to the sun, with intense scattering effects, while the second group of specific zenith angles (51-61°) is located at a distant area from the sun, where the channel intensity still reduces with increasing zenith angle.At higher zenith angles, the color intensity shows an almost stable pattern (Figure S2).The model performance for all aerosol parameters reduces as zenith angle reduces.R and RMSE ranges are 0.87-0.94and 0.06-0.16using all pixels, while for the specific pixels, the corresponding ranges are 0.84-0.93 and 0.07-0.18.Furthermore, the model's performance is also evaluated (a) with TCWV (Figure 2; orange and red bars) and (b) without TCWV (Figure 2; blue and green bars) as input in the training stage.LGBM, including TCWV, shows increasing accuracy with R and RMSE ranges of 0.89-0.95and 0.05-0.15(Figure 2).Lower values are expected for models without TCWV as input (R: 0.88-0.94and RMSE: 0.06-0.18).Generally, the user can select which scenario fits the application.The model slightly improves by including the TCWV, but ASIs alone provide accurate estimates that can be used for further analysis.
Based on the above sensitivity analysis, the following sections only discuss the scenario with 60 all-sky pixels of Figure 1c-e and TCWV.
In addition, the performance of other ML algorithms is also investigated using the best input scenario described above.Eight ML models with different prediction mechanisms and algorithmic structures (Table S1) are evaluated, namely, (1) linear-based: MARS (Multivariate Adaptive Regression Splines) and ANN (Artificial Neural Network), (2) tree-based: GBM, LGBM, XGBoost (Extreme Gradient Boosting Machine), and RF (Random Forest), (3) distance-based: KNN (K-Nearest Neighbors), and (4) kernel-based: SVM (Support Vector Machine).The 'best' ML model is selected based on overall performance, accuracy, and tuning/training time.The ML algorithms are evaluated on the same computer system following the hyperparameter tuning strategy described in Section 3.1.According to Figure S3, the model's performances vary between 0.81-0.93 and 0.05-0.09 in terms of R (Figure S3a) and RMSE (Figure S3b).LGBM encompasses the highest R and lowest RMSE values.Its high performance and the relatively low tuning/training time (~2 min; Figure S3c) explain the selection of LGBM in this study.

Aerosol Optical Depth Retrieval Performance
The ML-ASI AOD retrievals are compared against reference AOD measurements from the AERONET.Figure 3a-c display the scatter density plots of ML-ASI AOD against the AERONET AOD at each wavelength.The ML-ASI AOD correlates well with the AERONET, with R values extending from 0.89 to 0.93 at all wavelengths (Table 1).The negative values of MBE show an overall underestimation of ML-ASI for AOD 440 nm and AOD 675 nm , while ML-ASI slightly overestimates the AERONET for AOD 500 nm (Table 1).The results in Figure 3 explain an overall underestimation of the estimated AODs, especially under high aerosol burden conditions (AOD > 0.5).However, such high AOD cases were limited in the available dataset.In addition, the dispersion error is relatively low, with RMSEs lying between 0.05 and 0.07, revealing the adequate retrieval accuracy of the ML model.Table 1.Mean bias error (MBE), root mean square error (RMSE), and correlation coefficient (R) metrics for aerosol optical depth (AOD) at the three wavelengths, along with their relative values (in parenthesis).Figure 3d-f represent the distribution of the differences between the AERONET and ML-ASI AODs (∆AOD = ML-ASI AOD − AERONET AOD).The peak of the distribution at 440 and 675 nm is slightly biased to negative values, explaining the overall underestimation of ML-ASI AOD to the AERONET.Regarding the frequency distribution at 500 nm, the peak is perfectly centered to zero.The shape of the distribution implies that ML-ASI slightly overestimates the AERONET AOD (more positive AOD difference at the right tail of the statistical distribution).The AERONET's estimated AOD uncertainties in the visible spectrum are 0.01 [74].More than 40% of ML-ASI AOD retrievals revealed differences lower than the AERONET's uncertainties, whereas 85% of ML-ASI AOD retrievals were below 0.05 against the AERONET (Figure 3).
∆AOD against AOD ranges are shown in Figure 4. Apparently, the amount of available data per bin is much higher for lower AODs.For AOD > 0.6, only 14-26 data points are detected within the testing dataset.∆AOD is relatively small for AOD < 0.4, with the boxplot median (and mean) value close to zero.∆AOD increases with AOD magnitude, where the median (and mean) is negative, indicating the substantial underestimation of ML-ASI AOD at high AERONET AODs.This subsection investigates the AE440-675 nm and FMF500 nm parameters.For the AERONET, the Ångström power law is used to calculate the spectral-dependent AE.In addition, FMF500 nm is derived through the fraction of the fine-mode AOD to the total AOD, accounting for the amount of anthropogenic aerosols in the atmospheric column.Figure 5 shows the scatter density plots for ML-ASI AE440-675 nm and FMF500 nm against the AERONET.Boxplots of the differences between ML-ASI and AERONET (∆AOD) at 440 nm (blue), 500 nm (green), and 675 nm (red) for specific AOD ranges relying on AERONET data.The numbers above each boxplot correspond to the total number of measurements.

AE 440-675 nm and FMF 500 nm Retrieval Performance
This subsection investigates the AE 440-675 nm and FMF 500 nm parameters.For the AERONET, the Ångström power law is used to calculate the spectral-dependent AE.In addition, FMF 500 nm is derived through the fraction of the fine-mode AOD to the total AOD, accounting for the amount of anthropogenic aerosols in the atmospheric column.Figure 5 shows the scatter density plots for ML-ASI AE 440-675 nm and FMF 500 nm against the AERONET.
This subsection investigates the AE440-675 nm and FMF500 nm parameters.For the AERONET, the Ångström power law is used to calculate the spectral-dependent AE.In addition, FMF500 nm is derived through the fraction of the fine-mode AOD to the total AOD, accounting for the amount of anthropogenic aerosols in the atmospheric column.Figure 5 shows the scatter density plots for ML-ASI AE440-675 nm and FMF500 nm against the AERONET.Both parameters are in good agreement with the observations.R is equal to 0.92 and 0.95 for AE 440-675 nm and FMF 500 nm , respectively (Table 2).The positive values of MBE (0.007-0.017) indicate a slight overestimation of ML-ASI AE 440-675 nm and FMF 500 nm to the AERONET.In addition, the low RMSEs, 0.15 and 0.057 for AE 440-675 nm and FMF 500 nm , verify the good accuracy of the ML-ASI-derived aerosol parameters.
As in Section 4.1.2for AOD, ∆AE 440-675 nm and ∆FMF 500 nm are calculated at specific ranges (Figure 6).The median of the boxplots is positive at almost all ranges, highlighting that ML-ASI predictions tend to overestimate the AERONET.The higher values of ML-ASI AE 440-675 nm and FMF 500 nm explain that the ML-ASI model also overestimates the real size of aerosol particles.Only for AE 440-675 nm > 1.8 and FMF 500 nm > 0.8 are the medians negative, showing a minimal underestimation of ML-ASI retrievals to the AERONET.
distribution of ΔFMF500 nm shows that most values are around the central tendencies (mean and median), similar to the AOD distributions of Figure 3, with a slight overestimation (MBE = 0.007).The AERONET's estimated FMF500 nm and AE440-675 nm uncertainties are around 0.1 and 0.2, respectively [83].More than 90% and 80% of ML-ASI FMF500 nm and AE440-675 nm retrievals, respectively, lie within the AERONET's uncertainties (Figure 5).
As in Section 4.1.2for AOD, ΔAE440-675 nm and ΔFMF500 nm are calculated at specific ranges (Figure 6).The median of the boxplots is positive at almost all ranges, highlighting that ML-ASI predictions tend to overestimate the AERONET.The higher values of ML-ASI AE440-675 nm and FMF500 nm explain that the ML-ASI model also overestimates the real size of aerosol particles.Only for AE440-675 nm > 1.8 and FMF500 nm > 0.8 are the medians negative, showing a minimal underestimation of ML-ASI retrievals to the AERONET.According to Section 3.1, an LGBM model is created for retrieving ML-ASI AE440-675 nm.Alternatively, the spectral ML-ASI AOD retrievals at 440 and 675 nm can calculate AE440-675 nm by applying the Ångström power law (hereafter abbreviated as pAE440-675 nm).Based on Figure 7a, pAE440-675 nm provided relatively low accuracy (RMSE = 0.59) and overall performance (R = 0.60) against the AERONET AE440-675 nm.The latter can also be observed by the wider dispersion of the corresponding differences (ΔpAE440-675 nm) against the AERONET AE440-675 nm (histogram of Figure 7b).For instance, in several cases (Figure 7b), the ML retrievals differ by more than one value from the reference measurements.The poor performance of the pAE440-675 nm could be attributed to: (a) the independence of the LGBM models in the AOD retrievals and (b) the sensitivity of the Ångström power law to small changes in the spectrally determined AOD.For example, a raw measurement Figure 6.Boxplots of the differences between ML-ASI and AERONET for (a) Ångström Exponent between 440 and 675 nm (AE 440-675 nm ) and (b) Fine Mode Fraction at 500 nm (FMF 500 nm ).The numbers above each boxplot correspond to the total number of measurements.
According to Section 3.1, an LGBM model is created for retrieving ML-ASI AE 440-675 nm .Alternatively, the spectral ML-ASI AOD retrievals at 440 and 675 nm can calculate AE 440-675 nm by applying the Ångström power law (hereafter abbreviated as pAE 440-675 nm ).Based on Figure 7a, pAE 440-675 nm provided relatively low accuracy (RMSE = 0.59) and overall performance (R = 0.60) against the AERONET AE 440-675 nm .The latter can also be observed by the wider dispersion of the corresponding differences (∆pAE 440-675 nm ) against the AERONET AE 440-675 nm (histogram of Figure 7b).For instance, in several cases (Figure 7b), the ML retrievals differ by more than one value from the reference measurements.The poor performance of the pAE 440-675 nm could be attributed to: (a) the independence of the LGBM models in the AOD retrievals and (b) the sensitivity of the Ångström power law to small changes in the spectrally determined AOD.For example, a raw measurement can be overestimated from the ML-ASI AOD 440 nm and underestimated from the ML-ASI AOD 675 nm and vice versa, revealing significant differences between the pAE 440-675 nm and the AERONET AE 440-675 nm .
The scatterplot of Figure 7c represents the differences between the ML-ASI and AERONET AODs and how they are related to the corresponding ∆pAE 440-675 nm .The scatterplot is divided into six areas (from A1 to B2) based on the AE 440-675 nm difference signs.For cases with different signs in the ∆AOD, like in A1 (ML-ASI AOD 675nm overestimates and ML-ASI AOD 440 nm underestimates the AERONET), pAE 440-675 nm substantially underestimates the AERONET AE 440-675 nm (blue area in the color bar of Figure 7c).A contrasting result is observed in the B2 area, which is characterized by substantial overestimation.For ∆AODs differences with similar signs (A2 and B1 in Figure 7c), pAE 440-675 nm reveals lower biases, and the bias sign depends on the ∆AOD sign.If ∆AOD 440 nm is higher than ∆AOD 675 nm , pAE 440-675 nm overestimates the AERONET (B1-b and A2-b are of Figure 7c).The opposite behavior is documented for the B1-a and A2-a areas of Figure 7c.can be overestimated from the ML-ASI AOD440 nm and underestimated from the ML-ASI AOD675 nm and vice versa, revealing significant differences between the pAE440-675 nm and the AERONET AE440-675 nm.(c) Scatter plot between the differences of ML-ASI and AERONET aerosol optical depth (AOD) at 440nm (x-axis; ΔAOD440 nm) and 675nm (y-axis; ΔAOD675 nm).The color bar represents ΔpAE440-675 nm.
The scatterplot of Figure 7c represents the differences between the ML-ASI and AERONET AODs and how they are related to the corresponding ΔpAE440-675 nm.The scatterplot is divided into six areas (from A1 to B2) based on the AE440-675 nm difference signs.For cases with different signs in the ΔAOD, like in A1 (ML-ASI AOD675nm overestimates and ML-ASI AOD440 nm underestimates the AERONET), pAE440-675 nm substantially underestimates the AERONET AE440-675 nm (blue area in the color bar of Figure 7c).A contrasting result is observed in the B2 area, which is characterized by substantial overestimation.For ΔAODs differences with similar signs (A2 and B1 in Figure 7c), pAE440-675 nm reveals lower biases, and the bias sign depends on the ΔAOD sign.If ΔAOD440 nm is higher than ΔAOD675 nm, pAE440-675 nm overestimates the AERONET (B1-b and A2-b are of Figure 7c).The opposite behavior is documented for the B1-a and A2-a areas of Figure 7c.

Aerosol Type Classification
This section employs the predicted aerosol properties to perform aerosol-type classification.The aerosol classification scheme of Raptis et al. [72] is adapted which is applied for the same AERONET station (see Section 2.1).More specifically, aerosols are classified into six (6) main classes, (1) biomass burning, (2) continental, (3) dust, (4) marine, (5) mixed, and (6) polluted, relying on pre-defined fixed threshold limits of AOD440 nm and AE440-870 nm.The applied aerosol classification scheme defines the prevailing aerosol type.The latter does not secure the purest aerosol-type conditions, which requires additional information about aerosol absorptivity [62].Both pie charts in Figure 8a,b used the AERONET retrievals at different study periods.More specifically, Figure 8a,b were

Aerosol Type Classification
This section employs the predicted aerosol properties to perform aerosol-type classification.The aerosol classification scheme of Raptis et al. [72] is adapted which is applied for the same AERONET station (see Section 2.1).More specifically, aerosols are classified into six (6) main classes, (1) biomass burning, (2) continental, (3) dust, (4) marine, (5) mixed, and (6) polluted, relying on pre-defined fixed threshold limits of AOD 440 nm and AE 440-870 nm .The applied aerosol classification scheme defines the prevailing aerosol type.The latter does not secure the purest aerosol-type conditions, which requires additional information about aerosol absorptivity [62].Both pie charts in Figure 8a In this study, a relatively higher percentage of dust particles was observed, highlighting the increasing dust activity (~40%; Figure 8b) within the study period compared to the climatological average value (16%; Figure 8a).In addition, lower percentages for the mixed (~14%; Figure 8b) and polluted (~18%; Figure 8b) aerosol types are encompassed compared to climatological average values (23% and 27%, respectively; Figure 8a).The rest of the aerosol types provided small differences (<5%).In regions with both anthropogenic and natural aerosol sources, differences in the yearly averages of aerosol types occurrence are likely to be documented.For example, aeolian dust particles from the Sahara Desert often disperse towards the Mediterranean basin, reaching the Southern European regions, but the seasonal dust concentration levels are strongly related to the variability of cyclones that occur during the year [84].
percentages for the mixed (~14%; Figure 8b) and polluted (~18%; Figure 8b) aerosol types are encompassed compared to climatological average values (23% and 27%, respectively; Figure 8a).The rest of the aerosol types provided small differences (<5%).In regions with both anthropogenic and natural aerosol sources, differences in the yearly averages of aerosol types occurrence are likely to be documented.For example, aeolian dust particles from the Sahara Desert often disperse towards the Mediterranean basin, reaching the Southern European regions, but the seasonal dust concentration levels are strongly related to the variability of cyclones that occur during the year [84].At first glance, ML retrievals seem to reproduce the aerosol classification scheme of the AERONET quite well, providing minor differences (<3%) for all aerosol types (Figure 8c).However, in the case of mixed aerosol type, the difference is almost 5.0%.Apart from the pie charts, the confusion matrix of Figure 9 gives more insights into the aerosol classification outcomes by presenting the correctly and falsely assigned aerosol types using the ML-ASI results.The last column of the confusion matrix displays the Precision of each aerosol type.In particular, the green/red percentage corresponds to the correctly/falsely assigned aerosol type from the ML-ASI divided by the true assigned aerosol type from the AERONET.Above each percentage the total number of aerosol type cases is presented, as assigned by ML-ASI.The last row of the last column is the Total Accuracy.Overall, the ML-ASI method captures 77.5% of the total cases.The percentage contribution of each aerosol type to Total Accuracy is presented by the diagonal elements of the matrix, with dust and continental types encompassing the highest contribution.The last row of the table shows the true number of cases for each aerosol type as revealed using the AERONET retrievals.The percentages of the last row are calculated accordingly, as discussed above, for the last column of the confusion matrix, evaluating the AERONET retrievals against ML-ASI.
A great advantage of this confusion matrix is that it can detect the falsely assigned aerosol types.For instance, ML-ASI has the tendency to falsely document the polluted aerosol type as mixed (35 out of 964 total cases, 3.6%), which is related to the underestimation of ML-ASI AE against the AERONET retrievals.A promising finding At first glance, ML retrievals seem to reproduce the aerosol classification scheme of the AERONET quite well, providing minor differences (<3%) for all aerosol types (Figure 8c).However, in the case of mixed aerosol type, the difference is almost 5.0%.Apart from the pie charts, the confusion matrix of Figure 9 gives more insights into the aerosol classification outcomes by presenting the correctly and falsely assigned aerosol types using the ML-ASI results.The last column of the confusion matrix displays the Precision of each aerosol type.In particular, the green/red percentage corresponds to the correctly/falsely assigned aerosol type from the ML-ASI divided by the true assigned aerosol type from the AERONET.Above each percentage the total number of aerosol type cases is presented, as assigned by ML-ASI.The last row of the last column is the Total Accuracy.Overall, the ML-ASI method captures 77.5% of the total cases.The percentage contribution of each aerosol type to Total Accuracy is presented by the diagonal elements of the matrix, with dust and continental types encompassing the highest contribution.The last row of the table shows the true number of cases for each aerosol type as revealed using the AERONET retrievals.The percentages of the last row are calculated accordingly, as discussed above, for the last column of the confusion matrix, evaluating the AERONET retrievals against ML-ASI.
Atmosphere 2023, 14, x FOR PEER REVIEW 14 of 20 refers to the number of true predicted dust aerosol types (Precision > 95.0% of the cases).
In addition, the continental, polluted, and marine aerosol types are adequately extracted with a precision exceeding 60%.Moderate precision (~50.0%) is calculated for the mixed aerosol type and low precision (<29.0%) for the biomass-burning aerosols.7)) of ML-ASI aerosol type classification against the AERONET, with the last row referring to the total ML-ASI Accuracy (Equation ( 6)).

Conclusions
The current study investigates the feasibility of using the detailed sky information from an all-sky imager to retrieve various aerosol optical properties and types using machine learning.The presented retrieval methodology uses the RGBs and the percentage of saturated pixels near the sun extracted from the all-sky imager along with the solar zenith angle and the total column water vapor to train the machine learning algorithm  7)) of ML-ASI aerosol type classification against the AERONET, with the last row referring to the total ML-ASI Accuracy (Equation ( 6)).
A great advantage of this confusion matrix is that it can detect the falsely assigned aerosol types.For instance, ML-ASI has the tendency to falsely document the polluted aerosol type as mixed (35 out of 964 total cases, 3.6%), which is related to the underestimation of ML-ASI AE against the AERONET retrievals.A promising finding refers to the number of true predicted dust aerosol types (Precision > 95.0% of the cases).In addition, the continental, polluted, and marine aerosol types are adequately extracted with a precision exceeding 60%.Moderate precision (~50.0%) is calculated for the mixed aerosol type and low precision (<29.0%) for the biomass-burning aerosols.

Conclusions
The current study investigates the feasibility of using the detailed sky information from an all-sky imager to retrieve various aerosol optical properties and types using machine learning.The presented retrieval methodology uses the RGBs and the percentage of saturated pixels near the sun extracted from the all-sky imager along with the solar zenith angle and the total column water vapor to train the machine learning algorithm (here, it is Light GBM, LGBM).Five individual models were trained to retrieve aerosol properties analogous to aerosol burden (AOD 440 nm , AOD 500 nm , and AOD 675 nm ) and size (AE 440-675 nm and FMF 500 nm ).Then, the retrieved aerosol properties were used to perform an aerosol type classification.
A comprehensive analysis of the factors affecting the performance of the proposed methodology has been conducted by (i) varying the sky information in terms of sky pixel intensity, (ii) including the total column water vapor, and (iii) applying various machine learning algorithms.The best model performance was obtained by including more information about the sky, such as zenith angles distant from the sun and total column water vapor.Among the various algorithms, the LGBM revealed the highest performance and the best accuracy at a relatively low training time.
The retrieved aerosol properties (ML-ASI) correlated well with reference ground-based measurements (AERONET), recording R values of 0.89-0.93 for AOD and 0.92-0.95for AE 440-675 nm and FMF 500 nm .The differences between the ML-ASI and AERONET AOD follow an increasing tendency with AOD magnitude.Regarding the properties for aerosol size determination, the highest ∆AOD was revealed for the coarser particles.
The retrieved aerosol optical properties were also implemented to classify the prevailing aerosol type.The ML-ASI aerosol classification results indicated that the proposed retrieval methodology could predict the dominant aerosol types with relatively high precision (>60.0% in 4 out of 6 aerosol clusters).
The findings from the presented study highlight the feasibility of an ASI to retrieve aerosol optical properties accurately with the synergy of supervised learning under clear skies.Those ASI-based aerosol properties revealed the ability to identify the prevailing aerosol type.The applicability of the ML-ASI system to further retrieve aerosol properties under partial cloud conditions is under investigation.

Figure 1 .
Figure 1.Two all-sky imager (ASI) images with similar solar zenith angle (SZA) and different aerosol loads on (a) 2 July 2021 at 12:29:02 UTC with SZA = 29.8,aerosol optical depth (AOD) = 0.12, and sun-saturated area (SAT) = 1.8%, and (b) 31 July 2021 at 12:29:01 UTC with SZA = 30.8,AOD = 0.42, and SAT = 4.1%.The three heatmap images illustrate the 60 selected pixels in (c) red, (d) green, and (e) blue color scale.Both color bars indicate the intensity of the image color.The different color bars are applied to easily distinguish the selected pixels on the graph.Principal plane: (1) the straight line refers to pixels with constant azimuth and varying zenith angles (30 pixels with a 2° step).Almucantar: (2) the curved line refers to pixels with constant zenith and varying azimuth angles (30 pixels with a 2° step).

Figure 1 .
Figure 1.Two all-sky imager (ASI) images with similar solar zenith angle (SZA) and different aerosol loads on (a) 2 July 2021 at 12:29:02 UTC with SZA = 29.8,aerosol optical depth (AOD) = 0.12, and sun-saturated area (SAT) = 1.8%, and (b) 31 July 2021 at 12:29:01 UTC with SZA = 30.8,AOD = 0.42, and SAT = 4.1%.The three heatmap images illustrate the 60 selected pixels in (c) red, (d) green, and (e) blue color scale.Both color bars indicate the intensity of the image color.The different color bars are applied to easily distinguish the selected pixels on the graph.Principal plane:(1) the straight line refers to pixels with constant azimuth and varying zenith angles (30 pixels with a 2 • step).Almucantar: (2) the curved line refers to pixels with constant zenith and varying azimuth angles (30 pixels with a 2 • step).

Figure 3 .
Figure 3. Density scatter plots of aerosol optical depth (AOD) retrieved from ML-ASI as a function of the AERONET AOD at (a) 440, (b) 500, and (c) 675 nm.Frequency distributions of ∆AOD at (d) 440, (e) 500, and (f) 675 nm.∆AOD corresponds to the AOD difference between ML-ASI and AERONET.

Figure 4 .
Figure 4. Boxplots of the differences between ML-ASI and AERONET (ΔΑΟD) at 440 nm (blue), 500 nm (green), and 675 nm (red) for specific AOD ranges relying on AERONET data.The numbers above each boxplot correspond to the total number of measurements.

Figure 4 .
Figure 4. Boxplots of the differences between ML-ASI and AERONET (∆AOD) at 440 nm (blue), 500 nm (green), and 675 nm (red) for specific AOD ranges relying on AERONET data.The numbers above each boxplot correspond to the total number of measurements.

Figure 5 .
Figure 5. Density scatter plots of retrieved (a) Ångström Exponent between 440 and 675 nm (AE440-675 nm) and (b) Fine Mode Fraction at 500 nm (FMF500 nm) from ML-ASI as a function of AERONET values.Frequency distributions of (c) ΔAE440-675 nm and (d) ΔFMF500 nm.ΔAE440-675 nm and ΔFMF500 nm correspond to the AE440-675 nm and FMF500 nm difference between ML-ASI and AERONET.

Figure 6 .
Figure 6.Boxplots of the differences between ML-ASI and AERONET for (a) Ångström Exponent between 440 and 675 nm (AE440-675 nm) and (b) Fine Mode Fraction at 500 nm (FMF500 nm).The numbers above each boxplot correspond to the total number of measurements.
,b used the AERONET retrievals at different study periods.More specifically, Figure 8a,b were generated by including the AERONET retrievals for the 2008-2018 and 01/2021-11/2011 time frames, respectively.Figure 8c includes the ML-ASI retrievals of the testing dataset during the study period (January 2021-November 2011).

Figure 8 .
Figure 8. Pie charts of aerosol type classification based on Raptis et al.'s [72] aerosol classification scheme at the station of the National Observatory of Athens (NOA; 37.97° N, 23.72° E) in Thissio, Greece using (a) the Raptis et al. approach [72] (May 2008-September 2018), (b) AERONET, and (c) ML-ASI retrievals during the study period of this work, which is the testing dataset within the 1 January 2021-18 November 2021 time frame.

Figure 8 .
Figure 8. Pie charts of aerosol type classification based on Raptis et al.'s [72] aerosol classification scheme at the station of the National Observatory of Athens (NOA; 37.97 • N, 23.72 • E) in Thissio, Greece using (a) the Raptis et al. approach [72] (May 2008-September 2018), (b) AERONET, and (c) ML-ASI retrievals during the study period of this work, which is the testing dataset within the 1 January 2021-18 November 2021 time frame.

Figure 9 .
Figure 9.The confusion matrix includes the 6 possible aerosol types based on the research of Raptis et al.[72].The diagonal elements show the aerosol types that are correctly predicted, while the offdiagonal elements indicate false predictions.The 7th column of the confusion matrix represents the Precision (Equation (7)) of ML-ASI aerosol type classification against the AERONET, with the last row referring to the total ML-ASI Accuracy (Equation (6)).

Figure 9 .
Figure 9.The confusion matrix includes the 6 possible aerosol types based on the research of Raptis et al.[72].The diagonal elements show the aerosol types that are correctly predicted, while the off-diagonal elements indicate false predictions.The 7th column of the confusion matrix represents the Precision (Equation (7)) of ML-ASI aerosol type classification against the AERONET, with the last row referring to the total ML-ASI Accuracy (Equation (6)).
: (a) Scatter plot between CAMS and AERONET total column water vapor retrievals; Figure S2: (a) Red, (b) Green, and (c) Blue channel intensity against zenith angle.The zenith angle points are shown in Figure 1 for the principal plane (straight line 1), ranging from sun's center point (zenith angle = 0) and to sun's area up to 75 • .Different colors refer to three different AOD ranges.Blue, red, and green colors represent AOD range values that are relatively low (0.1), moderate (0.1-0.3), and high (>0.3).The shaded areas around the lines correspond to ±1 standard deviation bands.The two rectangles refer to the specific zenith angles (25-35 • & 51-61 • ) which are used in Section 4.1; Figure S3: (a) R correlation coefficient and (b) RMSE for the eight different machine learning models by using the 60 pixels of Figure 1 and the total column water vapor as model inputs to retrieve AOD 440nm .(c) Execution time for models training procedure including the tuning;

Table 1 .
Mean bias error (MBE), root mean square error (RMSE), and correlation coefficient (R) metrics for aerosol optical depth (AOD) at the three wavelengths, along with their relative values (in parenthesis).

Table S1 :
Machine Learning Architecture including their hyperparameters that are tuned during the training procedure.