A New Satellite-Based Retrieval of Low-Cloud Liquid-Water Path Using Machine Learning and Meteosat SEVIRI Data

: Clouds are one of the major uncertainties of the climate system. The study of cloud processes requires information on cloud physical properties, in particular liquid water path (LWP). This parameter is commonly retrieved from satellite data using look-up table approaches. However, existing LWP retrievals come with uncertainties related to assumptions inherent in physical retrievals. Here, we present a new retrieval technique for cloud LWP based on a statistical machine learning model. The approach utilizes spectral information from geostationary satellite channels of Meteosat Spinning-Enhanced Visible and Infrared Imager (SEVIRI), as well as satellite viewing geometry. As ground truth, data from CloudNet stations were used to train the model. We found that LWP predicted by the machine-learning model agrees substantially better with CloudNet observations than a current physics-based product, the Climate Monitoring Satellite Application Facility (CM SAF) CLoud property dAtAset using SEVIRI, edition 2 (CLAAS-2), highlighting the potential of such approaches for future retrieval developments. reported that the precision (i.e., variance) of the bias is signiﬁcantly when using the daily LWP values instead of using the instantaneous LWP values. This study shows a reduction in the bias by approximately 17.5% on average when daily LWP values are used.


Introduction
The Fifth Assessment Report (AR5) of the Intergovernmental Panel on Climate Change (IPCC) stated that clouds remain one of the largest uncertainties of the climate system [1]. The study of cloud processes requires information on cloud physical properties, such as the liquid water path (LWP). LWP is a critical control variable of short-and longwave cloud radiative effects, and therefore Earth's radiative balance [2]. It is defined as the vertical integral of liquid water content above a unit area. As a parameter summarizing the water content of clouds, LWP can play an important role in water cycle research. For this reason, LWP is among the fundamental elements of the water cycle included in the Essential Climate Variables [3]. This study presents the development of a new procedure to retrieve LWP from satellite data. Processing and re-processing satellite data in this way can be used to build a climate data record (CDR) of LWP, to be used for better understanding of the water cycle and the global climate system. LWP has been retrieved from satellite data at both optical and microwave wavelengths [4,5]. Passive microwave satellite data have been used to estimate LWP as passive microwave sensors have the ability to penetrate clouds, allowing to directly measure liquid cloud condensate amount [5]. However, they have much coarser spatial resolutions than optical sensors, and their LWP retrievals have been limited to ocean regions [5,6]. LWP estimation with optical sensor data is usually based on bispectral reflectances, as the reflection of clouds at the non-absorbing visible (VIS) channels changes with cloud optical thickness (COT), and the reflection at water or ice absorbing near-infrared (NIR) channels varies with cloud particle size. To retrieve COT and cloud droplet effective radius (DER), the satellite observations of reflectances at VIS and NIR wavelengths are usually compared with simulated reflectances stored in lookup tables (LUTs). LUTs are generated using a radiative transfer model for combinations of optical thickness, particle size, and surface albedo. Then, cloud LWP is calculated from the retrieved COT and DER [7].
Several studies have evaluated these satellite retrievals with in-situ measurements (cf. [7][8][9]). However, existing LWP retrievals come with uncertainties related to assumptions inherent in physical retrievals such as the plain-parallel clouds, which is problematic for fractional cloudiness [7,10], and other retrieval issues from viewing angle and scale differences [2,10]. Solar zenith angle (SZA) and scene heterogeneity have been commonly reported as error sources in past studies [6,11]. Seethala and Horváth [6] reported better LWP estimation in overcast scenes (cloud fractions of 95-100%). They also reported errors coming from SZA and scene heterogeneity. Greenwald et al. [11] identified clear-sky biases, cloud-rain partition biases, cloud-fraction-dependent biases, and cloud temperature biases as potential error sources. It was observed that bias-corrected LWP shows a poor agreement with observations on mostly cloudy scenes. Kostsov et al. [9] also mentioned the inhomogeneity of cloud fields as an error source for the differences between satellite and ground-based observations. This inhomogeneity involves the interaction between types of the underlying surface and the atmospheric conditions affecting LWP values as well, which further links to a seasonal dependence of accuracy. Besides, when satellite products are compared to ground-based data, errors can be induced by the choice of the validation method. The scale difference and parallax problem were raised by Greuell and Roebeling [2] and Schutgens and Roebeling [10] as the major causes of uncertainty. Greuell and Roebeling [2] presented a set of standardized validation methods such as averaging LWP values for both satellite and ground-based data over a certain time period for clouds moving across observation sites using a Gaussian weighting function and parallax correction. Machine learning approaches have the potential to avoid some of the assumptions necessary for the LUT approach and find better solutions for other retrieval issues, hence resulting in potentially more robust LWP estimates.
In this study, we present a machine learning-based approach for retrieving cloud LWP to reduce retrieval uncertainties. Ground-based supersite (CloudNet) observations are used as ground truth, and spectral information from geostationary satellite channels (SEVIRI), and satellite viewing geometries are combined to develop a statistical LWP retrieval. Machine learning model-derived LWP estimates are compared with a state-of-the art physics-based cloud property dataset (CLAAS-2, described in Section 2.1). Both LWP retrievals are evaluated and compared with the CloudNet in-situ measurements. We focus on all seasons except for winter (i.e., December, January, and February) to avoid the effect of reflectance of ice and snow. The decisions of the machine learning model are analyzed in detail to examine the influence of input variables on the LWP retrieval, and biases of model-derived LWP are discussed.

Meteosat-9 SEVIRI Data
Meteosat-9, launched on 21 December 2005, is one of the Meteosat Second Generation (MSG) geostationary satellites. It is equipped with the Spinning Enhanced Visible Infra-Red Imager (SEVIRI). It observes a disk covering Europe, the North Atlantic, and Africa with a spatial resolution of 1 km for the high resolution visible (HRV) channel and 3 km for two visible (VIS) and nine infrared (IR) channels and with a nominal repeat cycle of 15 min. While it has a spatial resolution of 3 km × 3 km at nadir, this corresponds to roughly 4 km (E-W) × 6 km (N-S) over Europe [12]. The SEVIRI imaging is done by scanning the full Earth disk from east to west along the south-north direction in about 12 min, followed by the data calibration and transfer for 3 min. In this study, we used two VIS channels of 0.6 (VIS0.6) and 0.8 (VIS0.8) µm; six IR channels of 1.6 (IR1.6), 3.9 (IR3.9), 8.7 (IR8.7), 10.8 (IR10.8), 12.0 (IR12), and 13.4 (IR13.4) µm; and solar zenith angle (SZA) and solar azimuth angle (AZA). The input variables are selected based on previous literature [7,13].
This paper is focused on water phase clouds. Water cloud pixels were identified using single-and multispectral threshold methods [14]. Based on considerations by Strabala et al. [15] and Cermak and Bendix [14], the following tests are applied: If IR12 -IR8.7 is greater than 2 K, then pixels are assigned as water phase clouds. IR10.8 are cut off at very low brightness temperatures of 250 K to exclude ice clouds in a straightforward way. Cirrus clouds are detected and filtered out if IR8.7 > IR10. 8. In addition, situations with SZA over 72 • were not used due to expected high uncertainties [7].

CloudNet Data
CloudNet LWP data are used as the target from the sites Leipzig (51.353 • N, 12.434 • E), Lindenberg (52.2105 • N, 14.13 • E), and Juelich (50.90856 • N, 6.4134 • E) in Germany. Other comparable stations in Europe have insufficient sample sizes to be used as training data, or filtering related to liquid-phase clouds resulted in a large reduction in samples when applied to regions in lower latitudes, such as CloudNet site in Potenza, Italy. CloudNet data were averaged using median over 15 min for the matching times, which is described in Section 2.4. LWP is one of the level-2a CloudNet retrieved cloud parameters at 30-s temporal resolution and the vertically integrated liquid water content in clouds. CloudNet is part of the European Aerosol, Clouds and Trace Gases Research Infrastructure (ACTRIS) project (http://devcloudnet.fmi.fi/ and https://actris.nilu.no/). The main instruments at CloudNet sites are Doppler cloud radars, lidar ceilometers, and dualor multiwavelength microwave radiometers [16]. LWP is derived from microwave radiometer (MWR) measurements [7]. MWRs measure brightness temperatures at two frequencies with different atmospheric absorption characteristics (i.e., one sensitive to water vapor and the other sensitive to cloud liquid water). The algorithm for calculating LWP uses the statistical relationship between the observed brightness temperatures and LWP. The relationship is obtained through simulated brightness temperatures by a radiative transfer model. More details on LWP retrieval methods were described by Löhnert and Crewell [17] and Gaussiat et al. [18].
The influence of inhomogeneity of cloud fields on LWP retrieval was investigated. To this end, we preprocessed the CloudNet LWP time series after Greuell and Roebeling [2] as follows. LWP data were separated into two subsets: one with relatively homogeneous and the other with relatively inhomogeneous cloud fields. The separation was done based on bins of equal size in terms of CloudNet LWP. In each bin, samples were divided into homogeneous ones with their standard deviations (STD) less than an upper bound defined as the third quartile plus 1.5× the interquartile range (IQR) and inhomogeneous ones greater than the upper bound. We used the 3rd quartile + 1.5× IQR criteria instead of the median value used by Greuell and Roebeling [2] since the distribution of STD is highly biased toward low STD values (not shown). The STD was obtained from samples that were collected over 2 h. LWP values over 180 g m −2 are discarded to strictly focus on non-precipitating clouds [6].

CLAAS-2 Data
The Climate Monitoring Satellite Application Facility (CM SAF) of the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT) developed a product called 'Cloud Physical Properties (CPP)' as part of the CLAAS-2 (CM SAF CLoud property dAtAset Using SEVIRI, edition 2) [19]. The CPP algorithm includes COT, DER, and LWP on the basis of VIS (0.6 µm) and NIR (1.6 µm) reflectances from the Spinning-Enhanced Visible and Infrared Imager (SEVIRI) onboard Meteosat satellites [7]. COT and REF are retrieved based on LUTs of top-of-atmosphere reflectances simulated by a radiative transfer model. Then, LWP is calculated with an equation adapted from Stephens [20]. More detailed information on the retrieval can be found in the work by Benas et al. [21]. We used the CLAAS-2 LWP product to compare with the results of this study. The validation of CLAAS-2 LWP product was comprehensively conducted by Benas et al. [21] with other satellite-based LWP products. Roebeling et al. [7] conducted the validation of SEVIRI-based LWP retrievals using the CPP algorithm with two CloudNet sites in Europe, showing a high agreement of the retrieved LWP with CloudNet data during the summer months. The CLAAS-2 cloud mask (CMA) product was used here to mask out clear-sky situations. The CMA product is produced based on a multispectral threshold technique aiming at delineating all cloud-free pixels in satellite images, developed by the EUMETSAT Satellite Application Facility on Nowcasting (NWC SAF). Cloud filled and cloud contaminated pixels identified by the CMA product were used in this study.

Paring of SEVIRI, CLAAS-2 and CloudNet Data
To obtain the best matching possible between SEVIRI and CloudNet data, SEVIRI scan times were calculated for the CloudNet sites. SEVIRI scans the Earth in the east-west direction while rotating progressively to perform the south-north scan. A full disk SEVIRI acquisition is composed of about 1250 scan lines [22]. At each satellite revolution, three image lines are acquired, so 1250 scan lines provide 3750 image lines. The nominal Level 1.5 full disk SEVIRI images (except for HRV) have 3712 lines × 3712 columns (N-S × E-W). The line step in the south-north scan is 9 km at the sub-satellite point, where the satellite and the Earth's center intersect the Earth's surface, and spreads towards the poles, so the sampling distance is defined to be exactly 3 km × 3 km at the sub-satellite point [12,23]. Since the east-west scan is very fast (i.e., 30 ms), only south-north scan duration, that is 0.6 s, is important for the scan time calculation. The calculation processes are briefly explained as follows: (1) The acquisition starts at 81 • S. The number of degrees to the latitude of interest is calculated from the start latitude (e.g., 20 • if 61 • S is required). (2) The assumption about the spreading distance from the equator to the pole between scan lines is made. Since one scan line contains three image lines, for example, the line step in the regions with the ground resolution of 4 km is 4 km × 3 lines = 12 km. In the mid-latitudes the average resolution of 6 km × 6 km is assumed. Therefore, the line steps in kilometers are: 9 km for 0-10 • , 12 km for 10-30 • , 15 km for 30-40 • , and 18 km for 40-81 • . (3) The line steps are used to calculate the number of scan lines which are accomplished until SEVIRI has reached the latitude of interest. (4) The number of scan lines × 0.6 s is the time at the latitude of interest. A full disk SEVIRI scan time image was finally obtained. As a result, it was identified that SEVIRI images with a time interval of 15 min (i.e., 00, 15, 30 and 45 min) are approximately matched with CloudNet times at 11, 26, 41 and 56 min in UTC. We paired SEVIRI, CLAAS-2, and CloudNet data by matching their UTC times. The matching time stamp of CloudNet are the time stamp obtained from the SEVIRI scan time calculation as explained above. The time stamp of CLAAS-2 is the same as SEVIRI as it is based on SEVIRI images, so CLAAS-2 data were easily matched with SEVIRI using the same UTC time. The matching with CloudNet data was also done in the same way as CloudNet-SEVIRI matching.
A parallax correction was applied based on the method suggested by Greuell and Roebeling [2]. The locations of the observation sites were corrected by a distance of where H is the averaged cloud-top height below 3 km from CloudNet cloud top altitude data and θ s is the satellite zenith angle. The sites were found to have mean parallax displacements of approximately 3.5 km, so that the pixel directly north of the SEVIRI pixel that encloses the CloudNet site was used.

Gradient Boosting Regression Trees
Gradient Boosting Regression Trees (GBRTs) are used as a machine learning algorithm to develop a model that estimates LWP. GBRTs are an effective technique to extract nonlinear relationships between a set of predictors and a target [24]. GBRTs are a generalization of boosting algorithm using arbitrary differentiable loss functions [25,26]. As a stage-wise additive model, weak learners (i.e., decision trees) are added one at a time to the model, while existing weak learners in the model are left unchanged. The trees are fitted to minimize a specified loss function through a gradient descent procedure by reducing the residuals of previous learners. The GBRTs were implemented in Python 3 scikit-learn [26].
The GBRTs model is trained based on SEVIRI spectral observations and satellite viewing geometry (SZA and AZA) as predictors and CloudNet LWP data as target. Two different sets of input features are used, as shown in Table 1: (a) all SEVIRI channels described in Section 2.1 and the satellite viewing geometry (hereafter, without feature selection); and (b) only those SEVIRI bands which are commonly used in LWP retrievals [7,27] and the satellite viewing geometry (hereafter, with feature selection). This is done to test if current physics-based retrieval techniques can be improved by adding further information from other bands not exploited yet. A Box-Cox power transformation is applied to the target variable (i.e., LWP) to ensure that lower values are not overrepresented in the data. The transformation is a way of transforming a non-normal dependent variable into a normal distribution [28]. The total numbers of samples were obtained as 450, 1207, and 412 for Leipzig, Lindenberg, and Juelich, respectively, for homogeneous case, and 528, 1415, and 482 for inhomogeneous case, as shown in Table A1. Date were collected from 11 August 2011 to 30 November 2015 for Leipzig, from 1 April 2011 to 29 November 2015 for Lindenberg, and from 8 April 2011 to 24 November 2015 for Juelich. All collected data are randomly separated into 67% of the data as training data to build the model and the remaining 33% to independently test the model accuracy.
Hyper-parameter tuning is applied on the training data in order to find the optimal model set up for the learning algorithm. The following parameters are tuned: the minimum number of samples required to split an internal node, the minimum number of samples required to be at a leaf node, maximum depth (number of layers of the decision trees), learning rate, and the number of estimators (i.e., the number of boosting stages), as shown in Table 2. Randomized five-fold cross-validated grid search is used to find the best combinations of hyper-parameters over a specific parameter settings based on R 2 score. During cross-validation, one fold is held out in a loop for validating the performance of the model that is trained based on the four remaining folds. The least squares loss function is used in this study. Early stopping is applied to terminate training when the validation score is no longer improving, which is to prevent overfitting.
To understand how input features affect the model's prediction, the following model-agnostic methods are examined. Permutation feature importance is used to evaluate the importance of each feature for the model's prediction. It is obtained by measuring the decrease in the model's score when a single feature value is permuted, i.e., replaced with randomly shuffled feature values [29]. The reduction in the model score indicates how much the feature contributes to the model. First, a baseline score matrix (R 2 used in this study) is evaluated on the training data, secondly the column of a feature is permuted and the matrix is evaluated again, and finally the permutation importance is determined as the difference between the baseline matrix and matrix obtained after the permutation of the feature column. Partial dependence (PD) plots are employed to analyze the interaction between individual features and the target [25]. They show how the predictions partially depend on values of the input variables of interest. For a more detailed interpretation of features, SHapley Additive exPlanations (SHAP) interaction plots are explored to examine the interaction effects involving two features on model's predictions [30].

Statistics for Model Performance
In the following, the performance of the new technique is evaluated side-by-side with the physics-based (CLAAS-2) product, which is used as the well-tested reference [2,27]. In Tables A1 and A2 (Appendix A), the predictive skill of the GBRT model along with CLAAS-2 is summarized for the three sites. Figure 1 shows an example of the comparison of LWP predicted by the GBRT model with CloudNet LWP in four different situations including homogeneous and inhomogeneous clouds, and both with and without feature selection for test data. The fitted line of the scatter plot for the homogeneous situation has a slope of 0.47 and an intercept of 38.5, as the higher values tend to be underestimated. These values differ from the CLAAS-2 LWP scatter plot, and from those reported by Greuell and Roebeling [2] (slope 1.12, intercept 0.8). A deviation was to be expected given the differences in locations, time and approach. It should be noted that Greuell and Roebeling [2] used LWP data less than 400 g m −2 and split LWP data into homogeneous and inhomogeneous cloud fields with the median value of samples in each bin for the separation. R 2 is found to be 47.3% for the homogeneous cloud situation for the GBRT approach, and 26% for CLAAS-2. It was observed that inhomogeneous situations feature a lower R 2 value and a higher bias than homogeneous ones (Figure 1), as well as for the other sites (Table A2). This is expected, given that the inhomogeneity of cloud fields has been reported as one reason for the discrepancy between ground-based and satellite-based LWP [2,9,27], but there seems to be no meaningful difference between homogeneous and inhomogeneous cases.
The bias is a measure of how far the predicted values are from the true values, and used as an indicator of accuracy in this study. It is calculated as the difference between the medians of model-predicted LWP and CloudNet LWP [7]. Percent bias (PB) is calculated by dividing the bias by the median of CloudNet LWP and multiplying it by 100, as in [2,7]. PB represents the average tendency where the predicted values are greater or less than the true ones, with the optimal value of 0.
Positive PB values mean model overestimation bias, while negative values model underestimation bias. The PB is found to be around 9%, which is almost the same as that reported by Roebeling et al. [7] (9%) and slightly better than those reported by Greuell and Roebeling [2] (15%) and shown in the CLAAS-2 LWP's scatter plot (14%). In general, both homogeneous and inhomogeneous situations without feature selection show higher R 2 values than those with feature selection for all sites. It may be because the other IR channels that are not used in those with feature selection could give valuable information to obtain a more accurate estimation of LWP, although they contribute much less to the prediction than the other channels, as discussed in Section 3.3. However, the difference is not obvious in the PB values (cf. Tables A1 and A2).  Figure 2 shows the medians and interquartile range of bias distribution with respect to LWP predicted from the GBRT model and CLAAS-2 for both homogeneous and inhomogeneous cloud fields without feature selection. The bias distributions of both the GBRT model and CLAAS-2 LWP show a tendency of increasing bias with increasing LWP values in general for both homogeneous and inhomogeneous cloud fields without feature selection, which is also observed for those with feature selection (not shown), especially when LWP > 100 g m −2 , which was also found by Greuell and Roebeling [2]. Deterioration of performance with increasing LWP values seems to be attributed to the limited samples for large values of LWP, as the number of observations (in red letters) drops substantially.

Bias Analysis for LWP Retrieval
The bias was calculated for bins with a width of 20 g m −2 from instantaneous LWP values as a function of CloudNet LWP for test data for homogeneous cloud fields without feature selection at each site as an example (Figure 3). The biases of LWP predicted by the GBRT model increase with increasing CloudNet LWP values, which is also observed for the other cloud fields with and without feature selection. The reduction in accuracy with increasing LWP values has been observed in past studies [2,7]. Roebeling et al. [7] reported an underestimation of about 30 g m −2 of SEVIRI-retrieved LWP at a CloudNet LWP of 100 g m −2 . This study is fairly consistent with this, with less than 25 g m −2 underestimation of predicted LWP at 100 g m −2 of CloudNet LWP. The limited number of samples at higher LWP values is likely to contribute to an increased bias especially in the GBRT model. Meanwhile, bias variability (represented by the error bars in the bias plots) is much larger for CLAAS-2 than in the GBRT approach. Figure 4 presents the bias calculated in the same way as for Figure 3 but from daily medians of LWP values. The ground-based instrument observes much smaller areas around nadir than satellite pixel size. Using daily medians may mitigate the effect of spatial mismatch caused by the scale differences between ground-based and satellite observations [7]. In general, using longer LWP-averaging times reduces the bias in this study, which agrees with the findings of Roebeling et al. [7]. They reported that the precision (i.e., variance) of the bias is significantly improved when using the daily LWP values instead of using the instantaneous LWP values. This study shows a reduction in the bias by approximately 17.5% on average when daily LWP values are used. Figure 5 shows the bias as a function of SZA and AZA for test data for homogeneous situations without feature selection in Leipzig as an example. At the range of SZA > 63 • , the bias shows a slight decrease, but not much. Figure 6 demonstrates the bias as a function of hours and months. Bias increases with increasing SZA, during late afternoon and in October and November, when SZA is mostly over 60 • or higher, which is consistent with the results of Roebeling et al. [7], who found that LWP from SEVIRI is overestimated at those times relative to CloudNet. It has been reported that at oblique viewing angles cloud properties can be overestimated for both broken and overcast clouds, as reflectances increase much in the forward-scattering direction [7,31]. It should be noted that there is no obvious difference between homogeneous and inhomogeneous cases with and without feature selection for angle and time-dependent errors, which are not shown here.

Relationship between LWP and Input Variables
The model input variables and their influence on the model's predictions are investigated for a better understanding of how the model works. Figures 7 and 8 show the feature importances and the partial dependencies of the GBRT models for the three study sites for without feature selection (Feature Set 1) and with feature selection (Feature Set 2), respectively, for homogeneous cloud fields to get a more reliable analysis. The y-axis of the PD plot shows the relative contribution of the input feature on the prediction. Partial dependence is the average response of an estimator for each possible grid value of the feature (x-axis) based on a weighted tree traversal [25]. It can be interpreted as a relationship between two variables (i.e., input feature and target). For example, positive (negative) values indicate that a specific grid value is corresponding to an increase (decrease) of predicted LWP. The important features identified in the feature importance result on the upper (Figure 7) and the left (Figure 8) can be related to high/obvious partial dependence in the PD plots.
At all sites, VIS channels appear to be the most important feature for the model's prediction in Figures 7 and 8. Other IR channels are much less significant to the prediction, which is also visible in the PD plot in Figure 7 with nearly no variations of partial dependence. It was as expected because VIS0.6 related to cloud optical thickness information, IR1.6 and IR3.9 channels, which are changed by the droplet effective radius, are commonly used as input to LWP retrievals [6,7,27]. As shown in the PD plot, the effect of the VIS0.6 on the LWP estimation variable at all sites is clearly seen. The reflectance increases with LWP that might correspond to the increase of the cloud thickness (cf. [32]). Clouds with high LWP tend to have larger droplets that absorb more NIR radiation and thereby lead to a smaller reflectance at the NIR [32]. This is well represented in the PD plots of the IR1.6 and IR3.9 variables. For smaller reflectance in the IR016 and IR039 plots, the model predicts higher LWP values. Compared to the other two sites, Leipzig shows relatively pronounced changes in the partial dependence of the IR 1.6 µm channel in Figure 8. Chang and Li [32] found that, as the reflectance saturates more quickly with greater DER, the dependence of the reflectance on the DER becomes larger as the DER decreases. According to CLAAS-2 DER data, Leipzig is found to have relatively smaller DER values compared to other sites in the selected samples ( Figure A1). Furthermore, the longer is the wavelength (1.6 → 3.7), the faster does the signal saturate [32], which may explain why Leipzig shows a stronger partial dependence of LWP on changes in IR1.6 compared to the IR3.9. Figure 7. Feature importance (top) and partial dependence plot (bottom) for training data for homogeneous cloud fields without feature selection. The y-axis of the partial dependence plot shows the changes of LWP (in units of LWP) relative to the mean prediction (centered around zero) for each grid value on the x-axis. Figure 9 shows that CloudNet LWP changes depending on satellite viewing geometry in the left panel and the SHAP interaction values in the right panel. SHAP values are useful for interpreting situations in which each input feature has a different contribution to the prediction, but works with each other to obtain the prediction. Each input feature either increases or decreases the prediction. SHAP interaction values allow visualizing a combined contribution as a feature's changes with another feature changing. Higher (lower) SHAP interaction values represent higher (lower) LWP predictions with different contributions from two input features. In the PD plots, it was found that SZA has a stronger effect on retrieved LWP particularly at high values for Leipzig and Lindenberg, while AZA shows a less pronounced partial dependence for all stations. The left panel in Figure 9 shows that LWP varies with both SZA and AZA, although this variation only seems to be systematic in the case of AZA. This suggests that the diurnal cycle of LWP cannot explain the relatively strong influence of SZA on LWP predicted by the model. In this case, the angle information can still play a supplementary role in retrieving LWP by interacting with VIS0.6. In the SHAP interaction plot, the interaction of VIS0.6 and SZA has negative contributions on the prediction in general, which means that the combined contribution reduces the predicted LWP values. Meanwhile, for the range of high SZA, high reflectance in VIS0.6 increases the predicted LWP, while low reflectance of VIS0.6 decreases the predicted LWP. It has been reported that retrieved LWP is strongly dependent on solar geometry [7,33]. This has been largely attributed to the COT retrievals, which directly affect LWP retrievals. COT tends to increase with increasing SZA and to be overestimated at SZA > 60. Note that, although SZA has the strongest interaction with VIS0.6, the interaction might be affected by other features, as SZA and LWP are correlated with other NIR channels to some extent.

Summary and Conclusions
This study presents a machine-learning based technique to retrieve cloud LWP using SEVIRI data. We analyzed the potential of a machine learning approach for designing reliable cloud-property retrievals. A statistical LWP retrieval was developed using spectral information from MSG-2 SEVIRI and satellite viewing geometry. CloudNet ground-based observation data were used to train the model, and CLAAS-2 data as a high-quality reference.
LWP predicted by the GBRT model was validated with independent CloudNet LWP observations in four different situations including homogeneous and inhomogeneous clouds, and both with and without prior selection of input features. During this validation, the skill (R 2 ) of the GBRT approach was found to be higher than that of the physics-based retrieval in all situations. Inhomogeneous situations feature lower R 2 values and higher biases in general, but the difference was not as pronounced as expected. The bias distributions of both the GBRT model and CLAAS-2 LWP show a tendency of increasing bias as LWP values increase in general for both homogeneous and inhomogeneous cloud fields, especially for high LWP values. Deterioration of performance with increasing LWP values seems to be attributed to the limited samples for large values of LWP. It was found that both homogeneous and inhomogeneous situations without feature selection show higher R 2 values than those with feature selection in general for all sites, but there was no meaningful difference between with and without feature selection in terms of bias.
VIS0.6 has been shown to be the most important feature for the model's prediction. The corresponding partial dependence plot also presents the clear dependence of LWP estimation on VIS0.6 at all sites. Meanwhile, LWP values increase with increasing droplet size as more radiation is absorbed by the droplets, leading to smaller reflectance in the NIR. This is well observed in the partial dependence plots. For lower reflectances in IR1.6 and IR3.9, the model predicts higher LWP values. The magnitude of this dependence is site-specific. Leipzig has stronger changes in the partial dependence of the IR 1.6 µm channel than the other two sites. As the reflectance saturates more quickly with larger DER, the dependence of the reflectance on the DER can be larger as the DER decreases. This is indirectly identified by CLAAS-2 DER with Leipzig having relatively smaller DER values than the other sites. SZA and AZA variables have a relatively low importance in the feature importance results, but it is shown that SZA has a strong interaction with VIS0.6 in the SHAP interaction plots. Overall, the interaction of VIS0.6 and SZA has a negative effect on the prediction, but a positive effect of the interaction on LWP prediction is observed for high SZA values as well.
The biases of instantaneous LWP values predicted by GBRT model increase with increasing CloudNet LWP values, which is due to limited samples at larger LWP. Error bars of the bias are much wider in CLAAS-2 compared to GBRT model. The bias from daily median LWP values shows that using longer LWP averaging times reduces the bias, which is consistent with past studies, as spatial mismatch caused by spatial scale difference between ground-based and satellite observations could be mitigated by using daily values instead of using instantaneous values. The bias tends to become negative (i.e., predicted LWP larger than CloudNet LWP) with increasing SZA. It was also found that time-dependent errors increase in late afternoon.
The results suggests that the LWP retrieved by the machine-learning model is useful, as it is in good agreement with ground-based reference measurements. In the cases and locations considered here its overall performance relative to CloudNet was better than the state-of-the-art physics-based CLAAS-2 product. Nevertheless, transferability to sites outside Germany and potentially different conditions would have to be tested. In addition, while CLAAS-2 is aimed at representing all clouds, a particular focus and filter was implemented in the approach presented here. Overall, the results highlight the potential of cloud-property retrievals based on machine learning.

Conflicts of Interest:
The authors declare they have no conflict of interest.
Appendix A Table A1. Summary of the relation between LWP GBRT and LWP gr for test data for each site in different situations. n is the total number of samples.

Situations
Sites Linear Relation of LWP GBRT