Prediction of Leaf Wetness Duration Using Geostationary Satellite Observations and Machine Learning Algorithms

: Leaf wetness duration (LWD) and plant diseases are strongly associated with each other. Therefore, LWD is a critical ecological variable for plant disease risk assessment. However, LWD is rarely used in the analysis of plant disease epidemiology and risk assessment because it is a non-standard meteorological variable. The application of satellite observations may facilitate the prediction of LWD as they may represent important related parameters and are particularly useful for meteorologically ungauged locations. In this study, the applicability of geostationary satellite observations for LWD prediction was investigated. GEO-KOMPSAT-2A satellite observations were used as inputs and six machine learning (ML) algorithms were employed to arrive at hourly LW predictions. The performances of these models were compared with that of a physical model through systematic evaluation. Results indicated that the LWD could be predicted using satellite observations and ML. A random forest model exhibited larger accuracy (0.82) than that of the physical model (0.79) in leaf wetness prediction. The performance of the proposed approach was comparable to that of the physical model in predicting LWD. Overall, the artiﬁcial intelligence (AI) models exhibited good performances in predicting LWD in South Korea.


Introduction
Leaf wetness (LW) refers to the presence of free water on the surface of a leaf [1]. Many studies have reported that LW and plant diseases resulting from bacteria and fungi are strongly correlated under temperatures favorable for infection [2,3]. Hence, the duration of LW, termed as LWD, is a critical meteorological variable for the risk assessment of plant diseases and disease control [4]. Despite its importance in plant epidemiology, LWD is rarely used in risk assessment and disease control due to the unavailability of data [5]. A standard protocol for the measurement of LW is unavailable and LWD is considered a non-standard meteorological variable [6][7][8]. Moreover, the absence of a standard protocol for the measurement of LWD leads to inconsistency in its observations from different observational networks. To overcome these limitations, LWD prediction models using standard meteorological variables such as air temperature (T air ), wind speed (WS), relative humidity (RH), and shortwave radiation have been suggested as alternatives for in-situ measurement [9][10][11][12]. These models utilize physical mechanisms and empirical relationships to predict LW using meteorological conditions. Among the physical models, the Penman-Monteith (PM) model, which uses the energy balance approach to predict LW, is widely used [13]. Recently, machine learning (ML) algorithms such as Table 1. Characteristics of the GK-2A soil-vegetation-atmosphere (AMI) channels [33][34][35].

In-Situ Meteorological Variables
The hourly observed data of meteorological variables such as LW, T air ( • C), RH (%), shortwave radiation (W/m 2 ), and WS (m/s) were collected from sites operated by the rural development administration (RDA). Measuring instruments for T air and RH were installed at 1.5 m above ground. Measuring instruments of shortwave radiation and WS were installed at 2.0 m above ground. LW data was measured at each site using two adjacent flat plate sensors (Model 237, Campbell Scientific) deployed at 2.0 m and facing north at an angle of 45 • to the horizontal [7]. A pair of LW sensors was used to assure wetness and dryness conditions. If there is a mismatch between the LW sensors, e.g., two sensors respectively recorded wetness and dryness, the measurement from the sensor recording wetness is used. There are 211 in-situ weather stations operated by the RDA. Since the LW is a non-standard meteorological variable, a quality control (QC) inspection of the LW observation is necessary. The LW observations are compared with those obtained using the number of hours of relative humidity (NHRH) method. The NHRH is the simplest method to estimate LW and it determines wetness based on RH observation [36]. The weather stations in which the Pearson correlation coefficient between observed LW and predicted LW by the NHRH method was more than 0.6 were selected. In addition, the stations that had less than 20% of missing data were selected. Twenty-one stations were selected after QC of meteorological variables, particularly LW. The locations of the selected weather stations are presented in Figure 1. The meteorological observations can be downloaded from http://weather.rda.go.kr/. For the purposes of this study, LWD is defined as a sum of LW from 0 to 24 h within a day.
Remote Sens. 2020, 12, 3076 4 of 21 0.6 were selected. In addition, the stations that had less than 20% of missing data were selected. Twenty-one stations were selected after QC of meteorological variables, particularly LW. The locations of the selected weather stations are presented in Figure 1. The meteorological observations can be downloaded from http://weather.rda.go.kr/. For the purposes of this study, LWD is defined as a sum of LW from 0 to 24 h within a day.

Penman-Monteith Model
The PM model was used as a representative model for comparison of the models using satellite observations. The PM model is a physical model that estimates LW based on latent heat flux (LE) that determines the evaporation or condensation of moisture on a surface such as a leaf [9]. The PM model can be applied in any location where the required meteorological observations are available. Sentelhas, et al. [13] reported that the PM model has low spatial variation in various meteorological conditions with good performance and high applicability. Additionally, air temperature observation at the leaf level is unnecessary for PM in estimating LW unlike other physical models that require the estimation of the dew amount and duration. The equation of LE is given as follows:

Penman-Monteith Model
The PM model was used as a representative model for comparison of the models using satellite observations. The PM model is a physical model that estimates LW based on latent heat flux (LE) that determines the evaporation or condensation of moisture on a surface such as a leaf [9]. The PM model can be applied in any location where the required meteorological observations are available. Sentelhas, et al. [13] reported that the PM model has low spatial variation in various meteorological conditions with good performance and high applicability. Additionally, air temperature observation at the leaf level is unnecessary for PM in estimating LW unlike other physical models that require the estimation of the dew amount and duration. The equation of LE is given as follows: where s, R n , e s , e a , γ, r a , and r b are the slope of the saturation vapor pressure curve (hPa), net radiation of the mock leaf (J/min/cm 2 ), saturated vapor pressure at the weather station air temperature ( • C), actual air vapor pressure (hPa), modified psychrometer constant (0.64 kPa/K is adopted in the current Remote Sens. 2020, 12, 3076 5 of 20 study), boundary layer resistance for heat transfer, and boundary layer resistance for WS, respectively. The boundary layer resistance for WS, r b , can be calculated as follows: where D is the effective dimension of the mock leaf (0.07 m in this study). The boundary layer resistance for heat transfer, r a , can be calculated as follows: where Z s , d, Z 0 , and WS * are the height of the wetness sensor (m), displacement height (0.5 Z c ), roughness length (0.13 Z c ), and friction velocity (m/s), respectively. Crop height is represented by Z c (m). For LE estimates at the weather station where the wetness sensor was at the same level as that of the T air , RH, and WS sensors, the boundary layer resistance for WS was not required in Equation (1).
Here, Equation (1) was used to predict LE since the levels of T air , RH, and WS sensors are different. When LE is greater than zero, the LW is predicted at the specific time. Since net radiation is not observed in the employed sites, they are estimated using the method suggested by Walter, et al. [37]. Since the PM model only considers the condensation of moisture in air near leaf, soil evaporation and transpiration cannot be considered in estimating LW.

Machine Learning Algorithms
LW prediction involves binary classification of two labels, that is, wetness and dryness, of data. Many ML algorithms were suggested as binary classifiers [38]. We employed logistic regression (LR), ELM, generalized boosted model (GBM), RF, support vector machine (SVM), and deep neural network (DNN) as candidates of classifiers for the LW prediction model. Some ML algorithms such as ELM, RF, and DNN provided good performances for predicting LWD in in-situ meteorological observations [17]. Sixteen channel radiances along with the time of measurement, latitude, and longitude were used as inputs for the ML algorithms. As LW often occurs from dusk to dawn, time is a good input feature. Latitude and longitude were used as these represent geographical characteristics of a target location. The data sets were classified into training, validation, and test data sets for building the ML-based LW prediction model. Observation data sets (8409 data points) of stations #6 and #11 (blue colored circles in Figure 1) were selected as test data sets while those of the other stations (red colored circles in Figure 1) were used as raw training data sets. The raw training data sets were grouped into training and validation data sets. The validation data set comprises of 20,000 (approximate 25% of total data) randomly selected data points out of 79,749 data points and the remaining data points were used as the training data set. The training data set was used to train ML algorithms, and the validation data set was used to optimize hyperparameters of ML algorithms based on the trained algorithms using the training data set. In the training of the ML algorithms, the oversampling strategy was employed to overcome the imbalance in the sample problem, which is, the number of observations showing dryness being much larger than that of wetness. After fixing the hyperparameters of the ML algorithms, the raw training data set was used to train the ML algorithms for building the LW prediction models. Brief descriptions of the employed ML algorithms are presented in the following subsections.

Logistic Regression Model
LR has been broadly used as a classification model [39]. This model provides probabilities of data points belonging to a class using the logistic function. The LR model is expressed as follows: where p(x), X, W, and b are the probability of a data point belonging to a class, input features, weights, and bias (called intercept), respectively. For regularization, the ridge regression, a technique to penalize the quantity of weights in the regression model using the L1 norm, was adopted in the LR. The weights of the LR using the ridge regression can be obtained as follows: where y i , C, and n are the label of the data point belonging to a class, tuning parameter, and the number of data points, respectively. The hyperparameter in the LR is the tuning parameter. A small value of the tuning parameter indicates strong regularization in the LR. To optimize the tuning parameter, cross validation is realized using the validation data set. Based on the results obtained, the LR with C = 0.1 leads to the largest accuracy (the number of correct predictions over the number of data points, ACC). Hence, the value of the tuning parameter in the employed LR was 0.1.

Extreme Learning Machine
ELM is a single layer network with randomly generated weights and bias between input and hidden layers [40]. Traditional neural networks need to iteratively optimize weights. Unlike the traditional neural networks, the model can be trained in a single iteration because of the randomized weight and bias. The ELM can be formulized by the following equation: where Y, β, and H are the labels, weight matrix between the hidden layer to the output layer, and the output vector of the hidden layer, respectively, and called nonlinear feature mapping.
where f a (·), W, B, and X are the activation function, weight matrix between the input layer to the hidden layer, bias, and input feature, respectively. In the current study, the sigmoid function (f a (x) = 1 1+exp(−x) ) was used as the activation function in the ELM. Since the weights (W) and bias (B) were randomly generated and the activation function (f a (·)) was known in the ELM, H was the deterministic variables from a data set. Thus, only β needs to be estimated in the ELM.
In the ELM, finding the appropriate weight set to avoid overfitting is important like in an artificial neural network. Tuning weights in the ELM can be considered as fitting a linear regression model using ordinary least squares. Ridge regression is employed to attenuate multicollinearity in the data set by adding a norm of parameters in the parameter estimation of the regression model [41]. The ELM model also adapted this strategy in weight tuning. The ELM attempts to achieve a better generalization performance by reaching not only the smallest training error but also the smallest norm of output weights. This minimization problem can take the form of ridge regression or regularized least squares as [42]: where the first term of the objective function is l 2 , a norm regularization term that controls the complexity of the model. The second term is the training error associated with the learned model. C > 0 is a tuning parameter. The ELM gradient equation can be analytically solved, and the closed-form solution can be written as following:β Remote Sens. 2020, 12, 3076 7 of 20 where I is an identity matrix. The regularized ELM model was used in this study. The tuning parameter and the number of hidden nodes are the hyperparameters of ELM. The tuning parameter and number of hidden nodes was systematically tested from 0.5 to 20 and from 100 to 600 for the validation data set, respectively. Based on the results of the cross validation using the validation data set, the ELM model had the largest ACC when the tuning parameter and number of hidden nodes were 10 and 200, respectively. Thus, these two values were employed for the optimal tuning parameter and the number of hidden nodes of the ELM, respectively. Input features were selected via the backward elimination method using the validation data set [43]. Based on the results of the backward elimination method, the ELM model led to the largest ACC when the channels #1, #2, and #7 were removed in the input feature set. The structure of the employed ELM model is presented in Figure 2.
Remote Sens. 2020, 12, 3076 7 of 21 and number of hidden nodes was systematically tested from 0.5 to 20 and from 100 to 600 for the validation data set, respectively. Based on the results of the cross validation using the validation data set, the ELM model had the largest ACC when the tuning parameter and number of hidden nodes were 10 and 200, respectively. Thus, these two values were employed for the optimal tuning parameter and the number of hidden nodes of the ELM, respectively. Input features were selected via the backward elimination method using the validation data set [43]. Based on the results of the backward elimination method, the ELM model led to the largest ACC when the channels #1, #2, and #7 were removed in the input feature set. The structure of the employed ELM model is presented in Figure 2.

Random Forest
The RF has been broadly applied for classification problems [44,45]. The RF was proposed by Breiman [46] and uses bagging to build a number of decision trees with controlled variance. Thus, the RF consists of an ensemble of simple decision trees. Each decision tree in the RF is grown using randomly selected samples. Subsequently, the nodes in each tree use randomly selected input features. The RF has two major steps: (1) randomness and (2) ensemble learning.
The randomness in the RF comes from random sampling of the entire data set and the selection of features with which every classification tree is built. The features in the data set are randomly sampled and replaced to create a subset with which to train one classification tree. At each node, the optimal split rule is determined using one of the randomly selected features from the employed features. Approximately two-thirds of the data were selected as the training subset. The features are also randomly selected without being replaced. The data set that is not included in the training subset is denoted as out-of-bag, and it is applied to evaluate the performance of the RF model and the importance of the features.
The ensemble learning method in the RF means that all individual decision trees in a collection of decision trees (called ensemble) contribute to a final prediction. A training subset is created after the random selection step. The classification and regression tree, but without pruning, is used to construct a single decision tree. To grow K number of trees in ensemble, this process (resampling a

Random Forest
The RF has been broadly applied for classification problems [44,45]. The RF was proposed by Breiman [46] and uses bagging to build a number of decision trees with controlled variance. Thus, the RF consists of an ensemble of simple decision trees. Each decision tree in the RF is grown using randomly selected samples. Subsequently, the nodes in each tree use randomly selected input features. The RF has two major steps: (1) randomness and (2) ensemble learning.
The randomness in the RF comes from random sampling of the entire data set and the selection of features with which every classification tree is built. The features in the data set are randomly sampled and replaced to create a subset with which to train one classification tree. At each node, the optimal split rule is determined using one of the randomly selected features from the employed features. Approximately two-thirds of the data were selected as the training subset. The features are also randomly selected without being replaced. The data set that is not included in the training subset is denoted as out-of-bag, and it is applied to evaluate the performance of the RF model and the importance of the features.
The ensemble learning method in the RF means that all individual decision trees in a collection of decision trees (called ensemble) contribute to a final prediction. A training subset is created after Remote Sens. 2020, 12, 3076 8 of 20 the random selection step. The classification and regression tree, but without pruning, is used to construct a single decision tree. To grow K number of trees in ensemble, this process (resampling a subset and training individual tree) is repeated K times. The final predicted label is the most frequent label among the predicted labels of all individual trees. The ranger library in the R package was used to construct an RF model [47]. The number of trees was systematically tested from 50 to 600 for the validation data set. Based on the results of the cross validation using the validation data set, the RF with 320 trees led to the largest ACC. Since the RF provides the importance of input features, the input features for the RF were selected based on the importance from the RF with 320 trees. The features were sequentially removed in the input feature set based on the amount of importance until the ACC of the RF decreased. The channel #1, #2, #5, and #6 and longitude were removed in the input feature set for the RF. The structure of the employed RF model is presented in Figure 3. subset and training individual tree) is repeated K times. The final predicted label is the most frequent label among the predicted labels of all individual trees. The ranger library in the R package was used to construct an RF model [47]. The number of trees was systematically tested from 50 to 600 for the validation data set. Based on the results of the cross validation using the validation data set, the RF with 320 trees led to the largest ACC. Since the RF provides the importance of input features, the input features for the RF were selected based on the importance from the RF with 320 trees. The features were sequentially removed in the input feature set based on the amount of importance until the ACC of the RF decreased. The channel #1, #2, #5, and #6 and longitude were removed in the input feature set for the RF. The structure of the employed RF model is presented in Figure 3.

Generalized Boosted Model
The GBM proposed by Friedman [48] is a widely used method in classification problems. Decision stumps or decision trees are used widely as weak classifiers in the GBM [48][49][50]. In the GBM, weak learners are trained to decrease loss functions, e.g., mean square errors. Residuals in the former weak learners are used to train the current weak learners. Therefore, the value of the loss function in the current weak learners decreases. The bagging method is employed to reduce correlation between weak learners, and each weak learner is trained with subsets sampled without replacement from the entire data set. The final prediction is obtained by combining predictions by a set of weak learners.
The GBM and RF adapted ensemble learning with a decision tree model (the weak learner). Both models produce one prediction based on a combination of predictions from a set of weak learners. Though the methods seem to be similar, they are based on different concepts. The major difference between the GBM and RF is that the tree in the GBM is fit on the residual of a subset of the former trees, while the RF trains a set of weak learners using a number of subsets. Therefore, the GBM can reduce the bias of prediction while the RF method can reduce variance of prediction. The gbm library in the R package (https://github.com/gbm-developers/gbm) was used to construct a GBM. The number of trees was systematically tested from 50 to 600 for the validation data set. Based on the results of the cross validation using the validation data set, the GBM with 300 trees leads to the largest ACC. Based on the results of the backward elimination method, the channel #1, #2, #5, and #6 and longitude were removed in the input feature set for the GBM, like in the RF. The structure of the employed GBM model is presented in Figure 4.

Generalized Boosted Model
The GBM proposed by Friedman [48] is a widely used method in classification problems. Decision stumps or decision trees are used widely as weak classifiers in the GBM [48][49][50]. In the GBM, weak learners are trained to decrease loss functions, e.g., mean square errors. Residuals in the former weak learners are used to train the current weak learners. Therefore, the value of the loss function in the current weak learners decreases. The bagging method is employed to reduce correlation between weak learners, and each weak learner is trained with subsets sampled without replacement from the entire data set. The final prediction is obtained by combining predictions by a set of weak learners.
The GBM and RF adapted ensemble learning with a decision tree model (the weak learner). Both models produce one prediction based on a combination of predictions from a set of weak learners. Though the methods seem to be similar, they are based on different concepts. The major difference between the GBM and RF is that the tree in the GBM is fit on the residual of a subset of the former trees, while the RF trains a set of weak learners using a number of subsets. Therefore, the GBM can reduce the bias of prediction while the RF method can reduce variance of prediction. The gbm library in the R package (https://github.com/gbm-developers/gbm) was used to construct a GBM. The number of trees was systematically tested from 50 to 600 for the validation data set. Based on the results of the cross validation using the validation data set, the GBM with 300 trees leads to the largest ACC. Based on the results of the backward elimination method, the channel #1, #2, #5, and #6 and longitude were removed in the input feature set for the GBM, like in the RF. The structure of the employed GBM model is presented in Figure 4.

Support Vector Machine
SVM is applied as a classifier for several classification problems in various fields [51][52][53]. The SVM was developed for solving classification problems based on mathematics unlike other machine learning algorithms such as ELM, RF, and GBM. For example, for linear classification problems, every procedure in the SVM can be proved by a mathematical basis. SVM was developed for building classifiers that maximize the margin, that is, the distance between any two groups. The distance between two groups is determined by the distance between support vectors, that is, the nearest vector to another group. In the current study, ν-SVM was employed for the SVM algorithm [54]. To find the hyperplane for maximizing the margin between the support vectors, the following optimization problem with its constraints needs to be solved.
where , , and (•) are regularization constant that ranges from 0 to 1, slack variable for ith data point, and kernel function that is the radial basis function (Θ , = exp − ) in the current study. The ν-SVM for the LW prediction model was implemented using the Scikit-learn in Python [55]. In the ν-SVM, ν is the hyperparameter that represents the tolerance of acceptance for support vectors, which violate the defined hyperplane, and it is optimized by the validation. Value of ν was systematically tested from 0.05 to 1 for validation data set. Based on the validation results, 0.1 was adopted for the value of ν in ν-SVM.

Deep Neural Network
The DNN is a multiple-hidden layer feed-forward network. The main difference between

Support Vector Machine
SVM is applied as a classifier for several classification problems in various fields [51][52][53]. The SVM was developed for solving classification problems based on mathematics unlike other machine learning algorithms such as ELM, RF, and GBM. For example, for linear classification problems, every procedure in the SVM can be proved by a mathematical basis. SVM was developed for building classifiers that maximize the margin, that is, the distance between any two groups. The distance between two groups is determined by the distance between support vectors, that is, the nearest vector to another group. In the current study, ν-SVM was employed for the SVM algorithm [54]. To find the hyperplane for maximizing the margin between the support vectors, the following optimization problem with its constraints needs to be solved.
where ν, ξ i , and Θ(·) are regularization constant that ranges from 0 to 1, slack variable for ith data point, and kernel function that is the radial basis function (Θ( ) in the current study. The ν-SVM for the LW prediction model was implemented using the Scikit-learn in Python [55]. In the ν-SVM, ν is the hyperparameter that represents the tolerance of acceptance for support vectors, which violate the defined hyperplane, and it is optimized by the validation. Value of ν was systematically tested from 0.05 to 1 for validation data set. Based on the validation results, 0.1 was adopted for the value of ν in ν-SVM.

Deep Neural Network
The DNN is a multiple-hidden layer feed-forward network. The main difference between shallow and deep neural networks is the number of hidden layers in the networks. The deep hidden layer of the DNN allows emulating a complex function relation between input and output with the extraction of the complicated feature structure [56]. In the current study, the sigmoid function was used for the activation function of the DNN model. Four hidden layers were adopted for the employed DNN model, and each of these hidden layers has a different number of nodes. The structure of the DNN model was manually optimized because there is no standard method to find the optimal structure. In addition, layer normalization was adopted after second, third, and fourth hidden layers. Layer normalization leads to improvement in the accuracy of the trained DNN model [57]. The structure of the employed DNN model is presented in Figure 5. The Adam algorithm with mini-batch was used to train the DNN model [58]. Pytorch library in python was used to implement the DNN model and train the DNN algorithm. [59]. Size of the mini batch, the number of epochs, and learning rate were set at 3000, 13,000, and 0.0025, respectively. These hyperparameters were manually optimized based on the evaluation of the trained model for the validation data set.
Remote Sens. 2020, 12, 3076 10 of 21 extraction of the complicated feature structure [56]. In the current study, the sigmoid function was used for the activation function of the DNN model. Four hidden layers were adopted for the employed DNN model, and each of these hidden layers has a different number of nodes. The structure of the DNN model was manually optimized because there is no standard method to find the optimal structure. In addition, layer normalization was adopted after second, third, and fourth hidden layers. Layer normalization leads to improvement in the accuracy of the trained DNN model [57]. The structure of the employed DNN model is presented in Figure 5. The Adam algorithm with mini-batch was used to train the DNN model [58]. Pytorch library in python was used to implement the DNN model and train the DNN algorithm. [59]. Size of the mini batch, the number of epochs, and learning rate were set at 3000, 13,000, and 0.0025, respectively. These hyperparameters were manually optimized based on the evaluation of the trained model for the validation data set.

Evaluation Measures
To evaluate the performances of the proposed LW prediction models, ACC, recall, precision, and F1 score (F) were employed and calculated as follows: Precision = + where TP, TN, FP, and FN indicate the numbers of true positives (correct prediction for wetness), true negatives (correct prediction for dryness), false positives (wrong prediction for wetness), and false negatives (wrong prediction for dryness), respectively. ACC is a proportion between the number of correct predictions from model and total number of predictions. Precision is the fraction of correct predictions among wetness predictions while recall is the fraction of correct prediction among predictions that wetness is observed. For example, when the numbers of predictions for wetness and dryness are 20 and 80 and total number is 200, ACC becomes 0.5 (=100/200). The precision and recall

Evaluation Measures
To evaluate the performances of the proposed LW prediction models, ACC, recall, precision, and F1 score (F) were employed and calculated as follows: where TP, TN, FP, and FN indicate the numbers of true positives (correct prediction for wetness), true negatives (correct prediction for dryness), false positives (wrong prediction for wetness), and false negatives (wrong prediction for dryness), respectively. ACC is a proportion between the number of correct predictions from model and total number of predictions. Precision is the fraction of correct predictions among wetness predictions while recall is the fraction of correct prediction among predictions that wetness is observed. For example, when the numbers of predictions for wetness and dryness are 20 and 80 and total number is 200, ACC becomes 0.5 (=100/200). The precision and recall have a tradeoff relationship. Thus, F was suggested to consider precision and recall in evaluating performance of classifier. Root mean square error (RMSE), Pearson correlation (Cor), and mean bias (mBias) are employed as evaluation criteria for the LWD. RMSE is presented as follows: where, E i , O i , and m are ith LWD prediction, ith LWD observation, and the number of data sets, respectively. The Pearson correlation can be calculated as follows: where, E and O are means of LWD prediction and observation, respectively. The mBias is given as follows: where, E i , O i , and n are ith LWD prediction, ith LWD observation, and the number of data set, respectively.

Performance Evaluation for LW Prediction Models
The results of the calculations of ACC, recall, precision, and F of LW predictions from all the employed models for the test data are presented in Table 2. Based on the ACC measure, RF shows the best performance in LW prediction among all the employed models including PM that uses in-situ meteorological observations. The GVM, SVM, and DNN models had comparable ACC values to that of the PM. Based on ACC, LR and ELM show poor prediction performances with values less than 0.7. Although ACC is often employed as the main performance evaluation measure of classification models, it has limited performance when there is imbalance in the data set. In such a situation, F is an alternative to evaluate the performance of the classification model, as it is a hybrid measure that combines recall and precision. The F values of the ML-based LW prediction models were lower than that of PM. Values of F for LR, GBM, and SVM were lower than 0.4. Although the ACCs of GBM and SVM were relatively high among the tested models, their Fs were very low. Since the proportion of dryness was approximately 80%, the ACC of a model had 0.8 when it only returned dryness. Hence, this result indicates that GBM and SVM overestimated dryness. Overall, the prediction performance of LR was the poorest among the models based on the evaluation measures.
The ELM, RF, and DNN models provided comparable F to PM. The values of the recall for ELM and RF were much lower than that of PM while these models led to high precision. Particularly, LW predictions by RF show the largest precision. These results mean that the wetness predicted by ELM and RF had high accuracy, but their performances in its detection were limited. The DNN model provided LW prediction with the largest recall while the value of its precision was very low. The DNN detected wetness but its prediction was inaccurate. These results show that some ML-based LW models provided a prediction performance comparable to that of the PM even though the PM slightly outperformed them. The results of evaluating performance of LW models indicate that the LW prediction using satellite observation and ML algorithm would be an alternative.

Performance Evaluation for LWD Predictions
The performances of the employed models for predicting LWD were evaluated using RMSE, Cor, and mBias for the test data set; these values are listed in Table 3. The PM had the smallest RMSE between the predicted and observed LWD among all the employed models for the test data set followed by DNN and RF. Based on the Cor, the DNN had the best performance followed by RF and PM. Overall, PM, RF, and DNN provided superior performances in predicting LWD. To investigate the prediction performances of PM, LR, SVM, GVM, RF, and DNN in detail, the density plots of LWD predictions and observations were made and they are shown in Figure 6. The data points having zero values in both prediction and observation were excluded to highlight the other values of LWD since the number of data with zero values was much larger than the number of other data points. Since ELM had the largest RMSE in Table 3, it was excluded from the density plots. Figure 6a shows that the LWD predictions by the PM were scattered around a diagonal line. Overestimation of the LWD predictions was presented by LR and DNN while underestimation was observed in RF, GBM, and SVM. The Cor values, after excluding data points having zeros in both prediction and observation, show that the DNN and RF would provide more accurate LWD predictions than that of the PM. The RMSE of DNN presented in Figure 6 was smaller than that of the PM. These results indicate that RF and DNN provide good performances in LWD prediction that are comparable to that of the PM. the number of data with zero values was much larger than the number of other data points. Since ELM had the largest RMSE in Table 3, it was excluded from the density plots. Figure 6a shows that the LWD predictions by the PM were scattered around a diagonal line. Overestimation of the LWD predictions was presented by LR and DNN while underestimation was observed in RF, GBM, and SVM. The Cor values, after excluding data points having zeros in both prediction and observation, show that the DNN and RF would provide more accurate LWD predictions than that of the PM. The RMSE of DNN presented in Figure 6 was smaller than that of the PM. These results indicate that RF and DNN provide good performances in LWD prediction that are comparable to that of the PM.  Since the LWD is a local phenomenon, investigating the prediction performance of the employed models with respect to the individual station or at the local scale is important. Hence, the evaluation measures for each station (stations #6 and #11) using the test data set were calculated and presented in Table 4. For station #6, PM and DNN provided similar prediction performances based on RMSE and Cor. Though the RF shows a better performance than LR, it was worse than PM and DNN. For station #11, RF had the best performance among the employed models. RMSE and Cor of the LWD predictions by RF were 3.67 h and 0.815, respectively. Particularly, RF had a better performance than PM that used in-situ meteorological variables for LWD prediction. The DNN also shows good performance, and it was competitive to PM. Although the results of the individual stations were consistent with the results of all the stations for the test data set, the prediction performances of the employed models varied with the local characteristics. To investigate the detailed characteristics of LWD predictions at the local scale, Remote Sens. 2020, 12, 3076 14 of 20 the density plots of the PM, LR, SVM, GVM, RF, and DNN were made for stations #6 and #11 and they are presented in Figures 7 and 8 Figures 7 and 8 show that the overall characteristics of the LWD predictions for the employed models were similar to those in Figure 6. Overestimation of LWD predictions is presented in LR and DNN while underestimation was observed in RF, GBM, and SVM. For station #6, the RF provided a strong underestimation while the DNN led to a small overestimation. For station #11, the PM seemed to overestimate LWD. Though the RF led to underestimation, the magnitude of underestimation was weaker than that for station #6. Unlike in the case of RF, overestimation by DNN became stronger in station #6.
Only the test data sets pertaining to the two stations were analyzed because of limited data availability. Dates that had a larger number of data points were only selected to evaluate the performance of the employed models for spatial distribution of LWD prediction as otherwise the generality of the evaluation can be lost. Four days from 1 March 2020 to 4 March 2020 were chosen. In this period, the LWD observations from 60 stations that were not used for raw training or as test data sets could be used for evaluation. The RMSEs of the LWD predictions by the PM, RF, and DNN for the 60 employed stations were 5.2 h, 3.4 h, and 9.1 h, respectively. Based on the RMSE, the RF shows the best performance for LWD prediction during this period. The biases of the PM, RF, and DNN were 3.2 h, 1.3 h, and 8.2 h, respectively. The models, PM, RF, and DNN had positive biases, and the bias of the DNN model was very large. The poor performance of the DNN was attributed to its large positive bias. The spatial distributions of LWD predictions of RF and DNN for the selected date are presented in Figure 9 to investigate their characteristics. Figure 9 shows that the values of LWD by DNN were much larger than those of RF. The overall patterns of spatial distributions of LWD by RF and DNN were dissimilar. The result indicates that the LWD prediction models could have a large discrepancy although the differences between the performances of the employed models were small based on the results of the evaluation measures. Overall, the LWD prediction model using the RF and satellite observation may be a good alternative to predict LWD in ungauged locations and obtain consistent quality of LWD prediction. Only the test data sets pertaining to the two stations were analyzed because of limited data availability. Dates that had a larger number of data points were only selected to evaluate the performance of the employed models for spatial distribution of LWD prediction as otherwise the generality of the evaluation can be lost. Four days from 1 March 2020 to 4 March 2020 were chosen. In this period, the LWD observations from 60 stations that were not used for raw training or as test data sets could be used for evaluation. The RMSEs of the LWD predictions by the PM, RF, and DNN for the 60 employed stations were 5.2 h, 3.4 h, and 9.1 h, respectively. Based on the RMSE, the RF shows the best performance for LWD prediction during this period. The biases of the PM, RF, and DNN were 3.2 h, 1.3 h, and 8.2 h, respectively. The models, PM, RF, and DNN had positive biases, and the bias of the DNN model was very large. The poor performance of the DNN was attributed to its large positive bias. The spatial distributions of LWD predictions of RF and DNN for the selected date are presented in Figure 9 to investigate their characteristics. Figure 9 shows that the values of LWD by DNN were much larger than those of RF. The overall patterns of spatial distributions of LWD by RF and DNN were dissimilar. The result indicates that the LWD prediction models could have a large discrepancy although the differences between the performances of the employed models were small based on the results of the evaluation measures. Overall, the LWD prediction model using the RF and satellite observation may be a good alternative to predict LWD in ungauged locations and obtain consistent quality of LWD prediction.

Discussion
The results of the study show that satellite observations can be used for LWD prediction. The RF model exhibited the best performance in LW prediction and a good performance in LWD prediction

Discussion
The results of the study show that satellite observations can be used for LWD prediction. The RF model exhibited the best performance in LW prediction and a good performance in LWD prediction using satellite observations. The DNN model also provided a good performance in LWD prediction.
Based on the results of the performance evaluation for the spatial distribution of LWD, RF showed the most accurate prediction although only snapshots of LWD observations were used in the evaluation process. Overall, the PM was the best LWD prediction model among all the employed models because it was among the best three models in all evaluation results. Notably, the values of recall and precision in LW prediction were higher than 0.5 for PM. Despite its superiority, PM cannot predict LWD in an ungauged location since in-situ meteorological data are required. Unlike PM, the other LWD prediction models proposed in the current study can make predictions in an ungauged location because they can use satellite observations. However, because PM can consider local meteorological characteristics, its superiority among the employed models was established. Although the PM had an advantage in terms of performance evaluation, the other LWD prediction models produced accurate LWD predictions that were comparable to that of the PM. Therefore, LWD prediction models that use satellite observations have the potential to predict LWD in locations where in-situ meteorological data is unavailable.
As listed in Table 1, each channel had a different set of physical properties. Investigating the specific influence of a channel on LWD prediction is critical in understanding the relationship between satellite and LWD observations. The RF model can provide the importance of input features in predicting a variable of interest [17]. Here, importance means the magnitude of decrease in the accuracy of the model when the selected input feature is not used in it. The RF model randomly selects input features and samples data for building a single decision tree, and therefore, some input features and data are not used in the training. By using the unused data, the importance of the input features can be calculated. As RF emerged as one of the good models for predicting LWD, investigating the importance of input features would establish the relationship between satellite and LWD observations. Figure 10 presents the importance of the input features from RF for the raw training data set. The results of the importance evaluation of RF show that channels #1 to #6 have a much lower importance than those of channels #7 to #16. The LWD is often generated at dawn and nighttime. As longwave channels can observe radiance at nighttime, these channels would provide critical information for predicting LWD. The most important channel was channel #7 as it can represent low cloud, fog, fire, and wind, which are important components for LWD prediction accounting for most of the generating mechanisms of LW. Furthermore, the channel #7 would detect the changes in radiance at dawn because it is a mid-infrared channel. Due to these characteristics, this channel may successfully deliver information for predicting LWD. The second most important channel was channel #12, which can represent the total ozone, turbulence, and wind. The channels that can represent critical meteorological variables such as humidity, solar radiation, wind speed, and temperature in LWD have high values of importance and so do related longwave channels. These results suggest that the use of mid-and longwave channels would be beneficial in LWD prediction.
Remote Sens. 2020, 12, 3076 16 of 21 using satellite observations. The DNN model also provided a good performance in LWD prediction. Based on the results of the performance evaluation for the spatial distribution of LWD, RF showed the most accurate prediction although only snapshots of LWD observations were used in the evaluation process. Overall, the PM was the best LWD prediction model among all the employed models because it was among the best three models in all evaluation results. Notably, the values of recall and precision in LW prediction were higher than 0.5 for PM. Despite its superiority, PM cannot predict LWD in an ungauged location since in-situ meteorological data are required. Unlike PM, the other LWD prediction models proposed in the current study can make predictions in an ungauged location because they can use satellite observations. However, because PM can consider local meteorological characteristics, its superiority among the employed models was established.
Although the PM had an advantage in terms of performance evaluation, the other LWD prediction models produced accurate LWD predictions that were comparable to that of the PM. Therefore, LWD prediction models that use satellite observations have the potential to predict LWD in locations where in-situ meteorological data is unavailable. As listed in Table 1, each channel had a different set of physical properties. Investigating the specific influence of a channel on LWD prediction is critical in understanding the relationship between satellite and LWD observations. The RF model can provide the importance of input features in predicting a variable of interest [17]. Here, importance means the magnitude of decrease in the accuracy of the model when the selected input feature is not used in it. The RF model randomly selects input features and samples data for building a single decision tree, and therefore, some input features and data are not used in the training. By using the unused data, the importance of the input features can be calculated. As RF emerged as one of the good models for predicting LWD, investigating the importance of input features would establish the relationship between satellite and LWD observations. Figure 10 presents the importance of the input features from RF for the raw training data set. The results of the importance evaluation of RF show that channels #1 to #6 have a much lower importance than those of channels #7 to #16. The LWD is often generated at dawn and nighttime. As longwave channels can observe radiance at nighttime, these channels would provide critical information for predicting LWD. The most important channel was channel #7 as it can represent low cloud, fog, fire, and wind, which are important components for LWD prediction accounting for most of the generating mechanisms of LW. Furthermore, the channel #7 would detect the changes in radiance at dawn because it is a mid-infrared channel. Due to these characteristics, this channel may successfully deliver information for predicting LWD. The second most important channel was channel #12, which can represent the total ozone, turbulence, and wind. The channels that can represent critical meteorological variables such as humidity, solar radiation, wind speed, and temperature in LWD have high values of importance and so do related longwave channels. These results suggest that the use of mid-and longwave channels would be beneficial in LWD prediction.  The results of this study may not represent the typical characteristics of LWD because the observation data only cover two seasons: winter and spring. The generating mechanisms of LW varied with the seasons, for instance, LW in summer was often caused by rainfall, while the condensation of water vapor on the leaf (dew) caused LW in winter. In addition, the frequency of LW varied with seasons. The proportion of LW observations used in this study was 80%. When the LW observations in summer were included, the proportion would be approximately 60%. To generalize results based on the performance evaluation for LWD prediction using satellite observations, a longer period is required. Hence, the results of the performance evaluation in the current study were indicative and not definitive in deciding the most appropriate model for LWD prediction using satellite observations. All the considered models for LWD prediction will be extensively compared in a future study. More input features providing local information, e.g., geographic, phenological, and vegetation information, are required to improve the performance of the LWD prediction model. Some channels may approximately reflect geographic and phenological information for the area of interest. Spatial resolution of the satellite observations is too coarse to describe the local geographic and phenological information. Due to this limitation, the proposed models may have limited performances. If this information can be integrated in the model for predicting LWD, its accuracy will be improved. A LWD prediction model using satellite observation based on physics should be developed to obtain a consistent prediction with large meteorological variations. The approach proposed in the current study may provide different results for other regions or countries because ML algorithms are employed as a classifier in building the LW prediction model using satellite observations. Results may also vary depending on the use of different products from different satellites. The main advantage of a physics-based LWD prediction model, e.g., PM, is that the model can be used in any region [13]. Standard protocols for measuring LWD and maintaining instruments are required to obtain reliable observations and expand their coverage. As shown in Figure 1, the northwestern region of South Korea is not represented in this study. Once a standard protocol is prepared, LWD observations of some stations in this area can be used. This will lead to an increase in coverage and a better performance of LWD prediction.
The ML algorithm can provide unrealistic results even if the LWD prediction model is properly developed. For example, the LWD prediction model using DNN shows good performance based on evaluation measures. In Figure 9, the LWD prediction model using DNN overestimated LWD for all tested days. In addition, stripe noise from the satellite images was amplified by RF in LWD prediction. Though the stripe noise in satellite images is inevitable, they can be reduced by noise filtering algorithms [60][61][62]. Strong stripe noise was not overserved raw satellite images. However, clear stripe noise appeared in Figure 9a-d. This result implies that application of the ML algorithms can amply stripe noise in the satellite images. Thus, when the LWD prediction model using ML algorithm is considered in the operational system, robustness of the prediction model should be thoroughly inspected for various cases in order to attenuate the possibility of having a wrong prediction.

Conclusions
An approach to obtain LWD prediction using satellite data and ML algorithms as classifiers was proposed. The feasibility of the proposed approach was evaluated using satellite data (GK-2A) and ML algorithms in South Korea through different models that showed good performances. Performance evaluation of the models showed that the performances of the proposed models in predicting LWD were comparable to that of PM. The proposed models can be utilized to obtain LWD data for locations in which meteorological instruments are not installed. Among the prediction models, RF and DNN showed good performances in predicting LWD in South Korea. These two ML algorithms can be used as classifiers in building an LWD prediction model using satellite observations in other regions. The applicability of satellite observations in LWD prediction needs further improvements. For example, integrating local information such as elevation and vegetation types in the LWD prediction model may improve its performance. Additionally, ML algorithms may perform poorly when data sets with different meteorological characteristics are used as inputs. Therefore, a physics-based LWD prediction model using satellite observations should be developed through future research.