Icing Detection over East Asia from Geostationary Satellite Data Using Machine Learning Approaches

: Even though deicing or airframe coating technologies continue to develop, aircraft icing is still one of the critical threats to aviation. While the detection of potential icing clouds has been conducted using geostationary satellite data in the US and Europe, there is not yet a robust model that detects potential icing areas in East Asia. In this study, we proposed machine-learning-based icing detection models using data from two geostationary satellites—the Communication, Ocean, and Meteorological Satellite (COMS) Meteorological Imager (MI) and the Himawari-8 Advanced Himawari Imager (AHI)—over Northeast Asia. Two machine learning techniques—random forest (RF) and multinomial log-linear (MLL) models—were evaluated with quality-controlled pilot reports (PIREPs) as the reference data. The machine-learning-based models were compared to the existing models through ﬁve-fold cross-validation. The RF model for COMS MI produced the best performance, resulting in a mean probability of detection (POD) of 81.8%, a mean overall accuracy (OA) of 82.1%, and mean true skill statistics (TSS) of 64.0%. One of the existing models, ﬂight icing threat (FIT), produced relatively poor performance, providing a mean POD of 36.4%, a mean OA of 61.0, and a mean TSS of 9.7%. The Himawari-8 based models also produced performance comparable to the COMS models. However, it should be noted that very limited PIREP reference data were available especially for the Himawari-8 models, which requires further evaluation in the future with more reference data. The spatio-temporal patterns of the icing areas detected using the developed models were also visually examined using time-series satellite data.


Introduction
Aircraft icing is a dangerous threat that results in many accidents which can cause fatalities and financial losses [1,2]. It is a phenomenon in which supercooled droplets (SCDs) collide with a hard surface forming an ice film. Clouds often contain SCDs, but icing occurs when there is a high density of SCDs [3][4][5]. When icing forms on aircraft bodies and wings, the aircraft's balance is disturbed, resulting in a loss of control. For this reason, detecting and avoiding potential icing areas is crucial for aviation safety. The detection of SCDs, especially in the freezing phase of rain, is usually conducted using a thresholding approach based on the subfreezing temperature range and high relative humidity [6].
Research efforts have been made to identify icing regions using ground or airborne observations and human reporting systems such as Tropospheric Airborne Meteorological Data Reporting (TAMDAR) and pilot reports (PIREPs). Such systems are designed to warn other flights by recording the time and location with detailed atmospheric information when airplanes are crossing dangerous areas [7]. Although observation data only provide limited spatiotemporal information on icing, they have been used as validation data in many icing-related studies [8][9][10].
Numerical forecasting models have also been used to estimate potential icing regions. For instance, high resolution numerical weather prediction model results have been used to calculate temperature (T), relative humidity (RH), vertical velocity (VV), and supercooled liquid water (SLW) as input parameters in potential icing calculation algorithms [10,11]. However, numerical models often provide inaccurate results [12], which increase the uncertainty of potential icing clouds identified by the icing algorithms [13].
Geostationary satellite sensors can be an effective alternative because they collect data over wide areas with high temporal frequency (~minutes). SLW, which potentially results in icing, tends to form near cloud tops where the air temperatures typically range from freezing temperature to −30 • C [14]. Thus, satellite observations over cloud tops can provide valuable information on icing [15]. In particular, satellite-derived data are greatly suitable for icing research due to the fact that icing intensity is closely related to several meteorological factors such as cloud temperature, thickness, phase, and distribution, which can be effectively derived from satellite images [15][16][17].
With an increasing interest in the synergistic use of satellite and ground observations for icing detection, many agencies and research groups have developed operational systems for icing detection, including the Flight Icing Threat (FIT) [15], and the National Aeronautics and Space Administration (NASA) Icing Remote Sensing System (NIRSS) [18]. The Alliance Icing Research Study (AIRS) investigated the climatological characteristics of icing with varying intensities over Canada, and compared meteorological data between aircraft measurements and ground-based remote sensing data, such as doppler radar, aircraft weather radar, light detection and ranging (LIDAR), infra-red (IR) and visible sensors [19]. The IR brightness temperature data derived from advanced very high resolution radiometers (AVHRR) onboard the National Oceanic and Atmospheric Administration (NOAA) polar-orbiting satellites were used for icing detection with temperature and relative humidity (T/RH) derived from numerical models, which resulted in a probability of detection (POD) of up to 70% for moderate-or-greater icing [17]. Minnis et al. [16] examined the feasibility of the cloud parameters produced from Geostationary Operational Environmental Satellite (GOES) satellite data as input factors in an icing detection algorithm. They suggested a probability-based algorithm with an empirical equation focusing on the liquid water path (LWP) and evaluated it with PIREPs resulting in a POD of 54.5%.
Bernstein et al. [10] proposed an icing detection algorithm that uses cloud effective radius (CER), cloud optical thickness (COT), and LWP products generated from GOES satellite data. The algorithm was based on a thresholding approach where the thresholds were determined through statistical analyses using a huge amount of in situ data over the contiguous United States (CONUS) region. They combined a fuzzy logic membership function and a human-derived decision tree to determine the thresholds rather than using simple physical thresholds such as those used in the T/RH scheme. The proposed icing detection algorithm was successfully validated using four hindcast icing cases. Smith et al. [15] improved the icing detection model proposed by Bernstein et al. [10], suggesting an operational NOAA FIT algorithm. However, the algorithm produced a high false alarm rate in detecting potential icing areas due to the different scales between satellite and reference data, because icing often occurs in a smaller region than the pixel size (i.e., 4 km by 4 km) of the satellite data used [15]. The thresholding approach based on the physical characteristics of icing has some limitations. Above all, the determination of thresholds for meteorological factors tends to be subjective, depending on experts [20]. Moreover, several factors in a physical icing algorithm come from model outputs (e.g., T, RH, and VV), which implies that the algorithm strongly depends on the accuracy of the corresponding physical model [10].
Although many studies have been conducted to detect and monitor satellite-based flight icing [15,21], they have focused on the areas over North America and Europe. There has been minimal exploration of satellite-based flight icing detection in Asia, even though several meteorological satellite sensor systems are in operation, such as the Communication, Ocean and Meteorological Satellite (COMS) Meteorological Imager (MI) and the Himawari-8 Advanced Himawari Imager (AMI). The Korea Meteorological Administration (KMA) proposed an icing detection system using COMS MI over East Asia [22]. The KMA algorithm produced a high false alarm rate when compared to PIREPs because it is a composite model of thresholding approaches based solely on the physical theory of icing [22]. This implies that the physical properties of in-flight potential icing regions might vary by location and adaptive thresholding approaches may need to be considered in order for a model to work with different areas and data.
A robust icing detection model can be proposed using a number of reference data for calibration and validation. Thus, it is necessary to evaluate icing detection models using long term data covering more than a few years, unlike the existing studies which have mostly focused on the evaluation of a short period of time (e.g., a few months or specific seasons) [10,[15][16][17]19,22].
Although several satellite-derived variables such as COT and CER are related to icing, each variable has its own uncertainty and the relationship between the variables and icing might not be robustly determined through a simple thresholding approach, which implies that more advanced approaches capable of handling non-linear behaviors are necessary for icing detection.
Recently, machine learning approaches have been introduced in many classification and regression tasks using satellite remote sensing [23][24][25][26][27][28]. Unlike typical statistical approaches, machine learning is generally free from data assumptions and has been proven to be effective in modeling non-linear behaviors. For meteorological applications, machine learning has been applied to detect phenomena that pose a risk to aviation, such as overshooting tops and cloud convective initiation [29][30][31].
The objectives of this research were to (1) develop icing detection algorithms for a portion of East Asia based on machine learning approaches with COMS MI and Himawari-8 AHI products; (2) compare the proposed algorithms with the existing FIT and KMA algorithms; and (3) interpret the properties of the potential icing clouds identified by the algorithms. Two machine learning approaches-random forest (RF) and multinomial log-linear model (MLL)-were used in this study.

Study Area
A part of East Asia was selected as the study area based on the spatial extent of the obtained reference data (i.e., PIREPs). The area (i.e., 94-141 • E and 5-47 • N) covers the major countries of East Asia (i.e., the Korean Peninsula, Japan, China, Hong Kong, and Taiwan). Based on the total air passenger traffic in 2016, one quarter of the world's top 20 airports are located in this region [32].

Geostationary Satellite Data
COMS is Korea's first multi-purpose geostationary satellite, and is equipped with a Meteorological Imager (MI) sensor observing East Asia every 15 min [33]. COMS MI provides data at one visible, shortwave infrared (SWIR), water vapor (WV), and two IR channels [22,29] (Table 1). COMS MI also provides level 2 products such as cloud analysis (CA) and cloud top temperature and heights (CTTP), which quantify the inner and upper properties of clouds. CA contains COT, measuring the optical depth of clouds with a scaling from 0 to 100; and cloud phase (CP), providing information on whether the major portion of a near cloud top is water, ice, mixed, unknown, or if the sky is clear [33][34][35]. CTTP provides information on cloud top conditions, including cloud top temperature (CTT), cloud top height (CTH), and cloud top pressure (CTP) [33][34][35]. The spatial resolution of the visible channel is 1 km, while the other products have a resolution of 4 km. COMS MI products were downloaded through the National Meteorological Satellite Center (NMSC) webpage (http://nmsc.kma.go.kr/). Himawari-8, a replacement of the Japanese Multifunctional Transport Satellite 2 (MTSAT-2), was launched by Japan Meteorological Agency (JMA) on 7 October 2014. Its major instrument is the Advanced Himawari Imager (AHI) that has 16 bands with a high spatial resolution (500 m-2 km), and scans Northeast Asia every 10 min [36][37][38]. The bands consist of three visible channels with 0.5-1 km resolution and thirteen IR channels with 1-2 km resolution (Table 1). In this study, we used only 14 bands from the site (ftp://hmwr829gr.cr.chiba-u.ac.jp/) because there are not yet publicly available cloud products from Himawari-8. COMS data from April 2011 to July 2017 and Himawari-8 data from July 2015 to July 2017 were used in this study.

Pilot Reports
Most aero-vehicles provide specific flight logs created by the pilot when they fly through areas of potentially problematic conditions or pre-defined areas. They record the time, position, and external environmental conditions that may cause aviation problems. Such flight logs are called PIREPs. These records provide information on not only basic external conditions (e.g., temperature, wind direction, and wind speed) but also particular events (e.g., turbulence and accretion of ice on the airframe) with the time and location. When ice crystals form on the aircraft body, it is recorded as 'icing' on the PIREP, together with its severity; this has been commonly used as reference data to develop and validate icing detection algorithms [10,14,39]. PIREPs are accumulated extensively in the CONUS, with thousands of icing records per year [10,15]. However, unlike the US, a relatively small number of icing observations are reported in East Asia. In this study, PIREPs from KMA and the Hong Kong Observatory (HKO) were used as reference data to develop icing detection algorithms.
Although PIREPs are commonly used as reference data for icing research, they are subjective and thus might provide inaccurate icing information. Furthermore, some PIREPs are reported in a few specific locations designated by governmental agencies. Thus, a quality control (QC) process for PIREPs should be conducted before using them as reference data. QC consists of several rules to verify that an icing PIREP has been obtained in an area corresponding to the physical characteristics of potential icing (e.g., SLW clouds). Table 2 shows the rules used to conduct the quality control of PIREPs using flight information and satellite-derived products. While a total of 54 icing PIREP cases (36 from HKO and 18 from KMA) were originally collected for COMS applications, only 24 PIREP cases were available for Himawari-8 applications. Through quality control, a total of 48 and 20 icing PIREPs were selected to develop icing detection models for COMS and Himawari applications, respectively. In addition, 113 carefully selected non-icing references from KMA PIREPs were used to develop the icing models (Table 3). Table 2. Quality control rules for selecting pilot reports (PIREPs) to be used to develop icing detection models. Six rules were applied to each pixel in the buffer with a radius of 20 km at an icing PIREP location. Pixels that met the rules, except for rule 3, were excluded. A final decision as to whether an icing PIREP was valid or not was made using the remaining pixels within the buffer.

Methodology
A flow chart of the proposed approach is provided in Figure 1. COMS MI and Himawari-8 AHI products were used as the input data. Since the satellite data have different spatial resolutions, all data were resampled to 1 km before input features were extracted based on the PIREP data. The extracted samples were used to develop icing detection models and a five-fold cross-validation method was used to evaluate the proposed models.

Sample Extraction
Since PIREPs are point-based data, spatial samples should be extracted to develop empirical icing detection models. In this study, a sample extraction method was used to obtain icing and nonicing samples from the COMS and Himawari-8 data. Once resampling of the input data was conducted, icing and non-icing samples were extracted using a cold cloud top temperature mask and spatial buffering (Figure 2). The cold cloud top mask uses the IR1 channel from COMS, and CH13 from Himawari-8 with a threshold of 273 K brightness temperature, since IR1 and CH13 (i.e., ~11 µm) represent cloud top temperature [3,40]. Since point-based icing PIREP data typically have a spatial significance of ~20 km in radius [10], a buffer with a radius of 20 km from an icing PIREP location was used to extract the icing pixels in this study. For non-icing samples, a more conservative buffer size (i.e., a radius of 15 km) was used to ensure the reliability of the training data, as much more nonicing PIREP data exists compared to icing PIREP data.
Through these processes, a total of 53,760 icing and 117,148 non-icing pixels for COMS, and 5506 icing and 12,352 non-icing pixels for Himawari-8, respectively, were extracted as reference data. The reference data were divided into five groups based on icing and non-icing cases, not individual pixels, and a five-fold cross validation was conducted to evaluate the machine-learning-based icing detection models.

Sample Extraction
Since PIREPs are point-based data, spatial samples should be extracted to develop empirical icing detection models. In this study, a sample extraction method was used to obtain icing and non-icing samples from the COMS and Himawari-8 data. Once resampling of the input data was conducted, icing and non-icing samples were extracted using a cold cloud top temperature mask and spatial buffering ( Figure 2). The cold cloud top mask uses the IR1 channel from COMS, and CH13 from Himawari-8 with a threshold of 273 K brightness temperature, since IR1 and CH13 (i.e.,~11 µm) represent cloud top temperature [3,40]. Since point-based icing PIREP data typically have a spatial significance of~20 km in radius [10], a buffer with a radius of 20 km from an icing PIREP location was used to extract the icing pixels in this study. For non-icing samples, a more conservative buffer size (i.e., a radius of 15 km) was used to ensure the reliability of the training data, as much more non-icing PIREP data exists compared to icing PIREP data.

Machine Learning
Random forest (RF) has proven very effective for several meteorological satellite-based applications such as the detection of convective initiation and overshooting tops, and rainfall rate estimation [29][30][31]41]. The multinomial log-linear model (MLL) is an improved version of logistic regression that incorporates an artificial neural network approach for parameter optimization [42]. Logistic regression has often been used in meteorological satellite applications [30,31] and operational systems, such as to monitor rapid development thunderstorms (RDT) using Spinning Enhanced Visible and Infrared Imager (SEVIRI) satellite data [43]. Thus, in this study, the two machine learning classification approaches-RF and MLL via neural networks-were used to develop icing detection models. RF, which is based on classification and regression trees (CART; [44]), produces numerous CARTs and adopts an ensemble approach to obtain the final output from the resultant trees. It incorporates two randomizations to overcome the major limitation of CART, which is the sensitivity to training data configuration, which often results in overfitting. By developing many independent trees from different sets of training samples and input variables, RF tries to provide relatively unbiased results [45][46][47]. RF has proven useful in various remote sensing tasks for both classification and regression [48][49][50][51]. RF also provides information on how input variables contribute to a given task. It calculates the decrease in accuracy using out-of-bag samples when an input variable is perturbed [52][53][54]. The larger the decrease in accuracy, the more significant the variable. RF can provide results in probability form from the ensemble approach.
MLL is a type of linear classification model for predicting the logarithmic form of a dependent variable [55,56]. MLL can be calculated by Equation (1): where y is the instant target variable, i is each input variable, ω is the weight of each variable, and the ƒ(X) is the value of each variable [55][56][57]. MLL works well when a dependent variable has complex relations with the input variables. The coefficients associated with input variables are determined through an advanced coefficient fitting approach, named neural networks. Since the coefficients of Through these processes, a total of 53,760 icing and 117,148 non-icing pixels for COMS, and 5506 icing and 12,352 non-icing pixels for Himawari-8, respectively, were extracted as reference data. The reference data were divided into five groups based on icing and non-icing cases, not individual pixels, and a five-fold cross validation was conducted to evaluate the machine-learning-based icing detection models.

Machine Learning
Random forest (RF) has proven very effective for several meteorological satellite-based applications such as the detection of convective initiation and overshooting tops, and rainfall rate estimation [29][30][31]41]. The multinomial log-linear model (MLL) is an improved version of logistic regression that incorporates an artificial neural network approach for parameter optimization [42]. Logistic regression has often been used in meteorological satellite applications [30,31] and operational systems, such as to monitor rapid development thunderstorms (RDT) using Spinning Enhanced Visible and Infrared Imager (SEVIRI) satellite data [43]. Thus, in this study, the two machine learning classification approaches-RF and MLL via neural networks-were used to develop icing detection models. RF, which is based on classification and regression trees (CART; [44]), produces numerous CARTs and adopts an ensemble approach to obtain the final output from the resultant trees. It incorporates two randomizations to overcome the major limitation of CART, which is the sensitivity to training data configuration, which often results in overfitting. By developing many independent trees from different sets of training samples and input variables, RF tries to provide relatively unbiased results [45][46][47]. RF has proven useful in various remote sensing tasks for both classification and regression [48][49][50][51]. RF also provides information on how input variables contribute to a given task. It calculates the decrease in accuracy using out-of-bag samples when an input variable is perturbed [52][53][54]. The larger the decrease in accuracy, the more significant the variable. RF can provide results in probability form from the ensemble approach. MLL is a type of linear classification model for predicting the logarithmic form of a dependent variable [55,56]. MLL can be calculated by Equation (1): where y is the instant target variable, i is each input variable, ω is the weight of each variable, and the ƒ(X) is the value of each variable [55][56][57]. MLL works well when a dependent variable has complex relations with the input variables. The coefficients associated with input variables are determined through an advanced coefficient fitting approach, named neural networks. Since the coefficients of the normalized regression model imply the relative contribution of the input variables, they can be used to analyze the relative importance of the variables. In this study, MLL is implemented using an 'nnet' package in R, which builds a feed-forward single-hidden-layer neural network to fit MLL [57].

Existing Satellite-Based Icing Detection Algorithms
In the US, the FIT algorithm is used to identify potential icing clouds based on the physical properties of the icing [15,58]. The FIT algorithm consists of condition-based rules using CP and COT to produce a binary icing mask (Table 4). KMA also proposed an icing detection system using COMS MI products over the Northeast Asia region based on the physical characteristics of icing [22]. Similar to the FIT algorithm, the KMA algorithm consists of condition-based rules based on level-1b data and their differences (BTD) ( Table 4).

Accuracy Assessment
The performance of icing detection algorithms was evaluated using five-fold cross-validation based on a 2 × 2 contingency table, helping to quantify the detection probability of models [29][30][31][59][60][61] (Table 5) ). In addition to the performance indices, the standard error (E), which is the standard deviation divided by the total fold number, was also used to compare model stability.

Variable Importance
The relative variable importance identified by the RF and MLL models is shown in Figure 3. The mean decrease accuracy (MDA) values calculated when a variable is randomly permuted were used to identify significant variables in the RF model, while the absolute normalized coefficient values were used in the MLL model [62,63]. It should be noted that these metrics suggest the relative importance of variables in estimating the target variable, not the absolute indicator of variable significance.

Variable Importance
The relative variable importance identified by the RF and MLL models is shown in Figure 3. The mean decrease accuracy (MDA) values calculated when a variable is randomly permuted were used to identify significant variables in the RF model, while the absolute normalized coefficient values were used in the MLL model [62,63]. It should be noted that these metrics suggest the relative importance of variables in estimating the target variable, not the absolute indicator of variable significance.  In COMS models, while CTH was identified as being very significant for both the RF and MLL models, visible, IR2, COT, and CTT were given different relative importances from model to model (Figure 3a,b). According to the cloud product retrieval algorithm of COMS, CTH is produced using CTT and IR channels, and COT is retrieved from the visible channel [34]. Thus, the significant input variables in the COMS models can be grouped into two groups-a long-wave infrared (LWIR) group (i.e., CTH, CTT, and IR) and a visible (VIS) group (COT and visible)-which have been frequently used in icing studies [10,15,20]. While the LWIR and VIS groups are generally identified as important by RF, the LWIR group was only considered significant by the MLL models. This might be due to the skewed distribution of the variables from the VIS group. Visible-based variables are sensitive to thin clouds, but they tend to saturate with thicker clouds [34]. Thus, it is not necessary to have high coefficients for the normalized variables (i.e., sensitive to much smaller values in the range of 0-1) in the MLL model.
The variable importance given by the Himawari-8 models is somewhat different to that of the COMS models. Near-infrared (NIR) channels (i.e., CH05, CH07, and CH08) identified as the most significant variables by the RF model, while the LWIR channels (i.e., CH11, CH13, and CH14) were considered very important in the MLL model. NIR channels are widely used as a source to produce CP and cloud particle size data [33][34][35][64][65][66][67], and LWIR channels are related to the cloud top temperature and the amount of water vapor, which are the important input variables of the FIT algorithm [15].

Model Performance
The RF-and MLL-based icing detection models were developed using PIREP-based reference data for COMS and Himawari-8. Since the existing FIT and KMA models require level 2 products to generate icing masks, they were tested only with COMS data. Figure 4 shows the accuracy assessment results for the two theoretical (i.e., existing) models and the four machine learning models. When the COMS data were used, the RF model yielded the highest POD (~87.1%), OA (~79.5%), and TSS (~62.9%) among the four tested models, followed by the MLL model. However, the RF model resulted in a relatively high POFD (~24.3%) compared to the KMA model (7.6%). The standard errors of the accuracy metrics (i.e., POD, OA, and TSS) of the RF model were slightly higher than those of the MLL model, which implies that MLL showed more consistent performance than RF for icing detection. This might be due to the small number of samples for RF, resulting in slight overfitting for some folds. On the other hand, the rule-based models (i.e., FIT and KMA) consistently performed worse than the RF and MLL models, resulting in low standard errors (average~3.5% indicating stable performance by fold) with OAs of 67.1% and 67.5%, and TSSs of 27.1% and 10.0%, respectively. Overall, the RF icing detection model resulted in the best performance among the four models for COMS.
When the Himawari-8 data were used, the results were similar to the COMS-based models; the RF model produced better performance than the MLL model. However, the accuracy difference between the RF and MLL models for Himawari-8 was smaller than that for COMS. The standard errors of the performance metrics for the Himawari-8 models were much higher than those for COMS-based models (Figure 4). For example, the standard errors of the mean TSS for the machine-learning-based COMS models were 6.5% and 5.1% (i.e., for the RF and MLL models, respectively), while those for the Himawari-8 models were 14.5% and 9.6%. Such differences in the standard errors are possibly due to the much smaller number of samples for Himawari-8 than COMS. This implies that the RF and MLL models for Himawari-8 might not be robust, often resulting in varied performance for the samples even though they produced high accuracy.
icing samples for the Himawari-8-based models were generally smaller than those for COMS-based models. The RF models produced a more scattered icing pattern in a number of small patches, while the MLL models showed more clustered icing regions. These patterns were consistently found in other cases (not shown). Although all machine-learning-based models produced high PODs, the MLL models did not correctly identify the icing/non-icing cases shown in Figures 5 and 6. In summary, the COMS RF model and the Himawari-8 RF model resulted in the best performance in both the quantitative and qualitative analyses.  It is not appropriate to directly compare these results to those from other studies as different data were used. However, the accuracy metrics from the proposed approaches are higher than those from the literature. For example, Choi et al. [22] evaluated the KMA model using COMS-PIREP data, resulting in a POD of 57.9%, a POFD of 27.5%, an OA of 72.46%, and a TSS of 30.4%. Similarly, the FIT model was evaluated using GOES-PIREP data, yielding a POD of 62%, a POFD of 39%, an OA of 58%, and a TSS of 4% in the daytime [15]. PIREPs were used as reference data in both studies. Figures 5 and 6 depict the distribution of the potential icing areas produced from the six models as examples for selected icing and non-icing PIREP cases. Among the six models, the FIT model identified more low cloud areas as potential icing regions compared to the other models. This pattern often occurred when the phase of low clouds was SLW for other cases (not shown). However, such a pattern was not found where the cloud phase was neither water nor mixed.
Among the empirical machine learning models, the Himawari-8-based models classified less areas as potential icing clouds than the COMS based models when the same algorithm was used. This is possibly because a smaller number of training samples were available for the Himawari-8 models (Table 3), which often results in overfitting [62,63]. In particular, the ranges of the input values of the icing samples for the Himawari-8-based models were generally smaller than those for COMS-based models. The RF models produced a more scattered icing pattern in a number of small patches, while the MLL models showed more clustered icing regions. These patterns were consistently found in other cases (not shown). Although all machine-learning-based models produced high PODs, the MLL models did not correctly identify the icing/non-icing cases shown in Figures 5 and 6. In summary, the COMS RF model and the Himawari-8 RF model resulted in the best performance in both the quantitative and qualitative analyses. Remote Sens. 2018, 10, x FOR PEER REVIEW 12 of 19

Temporal Variation of Icing
The spatial-temporal distribution of potential icing clouds was examined using COMS and Himawari-8 time-series data for two selected icing cases. The COMS RF and Himawari-8 RF models yielding the best performance were used to monitor potential icing areas (Figures 7 and 8). A 15 min interval was used for the COMS RF model, while a 10 min interval was applied for the Himawari-8 RF model. Although the Himawari-8 RF model detected more areas with relatively warm clouds (Tb ~ 270 K) as potential icing regions and depicted a relatively sharper pattern, there was no abrupt change in the spatial distribution of the time-series icing maps. Similarly, the COMS RF results for

Temporal Variation of Icing
The spatial-temporal distribution of potential icing clouds was examined using COMS and Himawari-8 time-series data for two selected icing cases. The COMS RF and Himawari-8 RF models yielding the best performance were used to monitor potential icing areas (Figures 7 and 8). A 15 min interval was used for the COMS RF model, while a 10 min interval was applied for the Himawari-8 RF model. Although the Himawari-8 RF model detected more areas with relatively warm clouds (Tb~270 K) as potential icing regions and depicted a relatively sharper pattern, there was no abrupt change in the spatial distribution of the time-series icing maps. Similarly, the COMS RF results for the 5 January 2016 03:30 UTC series showed no drastic changes over time. This implies that both models are relatively stable considering the typical temporal variation of atmospheric conditions. In addition, potential icing areas often occur and decay rapidly in small patches, which indicates that icing detection can greatly benefit from satellite observations from geostationary sensor systems with a high temporal resolution (~10 min).
Remote Sens. 2018, 10, x FOR PEER REVIEW 14 of 19 the 5 January 2016 03:30 UTC series showed no drastic changes over time. This implies that both models are relatively stable considering the typical temporal variation of atmospheric conditions. In addition, potential icing areas often occur and decay rapidly in small patches, which indicates that icing detection can greatly benefit from satellite observations from geostationary sensor systems with a high temporal resolution (~10 min).   the 5 January 2016 03:30 UTC series showed no drastic changes over time. This implies that both models are relatively stable considering the typical temporal variation of atmospheric conditions. In addition, potential icing areas often occur and decay rapidly in small patches, which indicates that icing detection can greatly benefit from satellite observations from geostationary sensor systems with a high temporal resolution (~10 min).

Novelty and Limitations
This study has proposed several machine-learning-based approaches for the detection of potential icing clouds, and compared their performance to the existing physical-theory-based models. The proposed models produced better performance than the existing models, which are based on simple rule-based approaches for the reference data available in this study. The superiority of the proposed models is partly due to the uncertainty of the satellite-derived products used in the existing physical-theory-based models. For example, the fraction correct for COMS CPH was reported as 0.642 when compared to Moderate Resolution Imaging Spectroradiometer (MODIS) data [66]. Similarly, the COMS COT yielded a root mean square error (RMSE) of 3.45 when compared to the MODIS COT [67]. Using such products as inputs for the threshold-based rules may result in a significant increase in uncertainty. The simple threshold-based rules used in the existing models might not be sufficient to model the complex characteristics of potential icing areas. By using additional input variables other than those used in the existing models, the proposed approaches were able to provide more accurate results with more sophisticated modeling techniques. The stable distribution of potential icing time-series confirmed the robustness of the proposed approaches.
However, there are several limitations to this study. First of all, the proposed models are heavily dependent on the reference data (i.e., PIREPs). Although careful quality control was conducted to secure accurate icing and non-icing samples, there is no guarantee that the selected icing and non-icing samples cover all possible icing and non-icing clouds. In particular, it is possible that potential icing areas may be missed simply because there is no visually identifiable icing phenomenon, which often leads to the underestimation of icing regions [68]. Not only the quality of the PIREPs, but also the number of PIREPs, is important to develop successful machine-learning-based icing detection models. For example, there are many PIREPs (more than 20,000) over CONUS for winter from 2006-2008 [15], which is ideal for the development of machine-learning-based models. However, only a small number of icing PIREPs are available in East Asia, which is a major limitation of this research. Such a small sample size often results in overfitting [62,63]. Since satellite-derived products typically provide information on the characteristics of cloud tops, it should be noted that PIREP-based icing information collected far below the cloud tops is not always closely related to satellite-derived cloud top products.

Conclusions
In this study, several machine-learning-based icing detection models were proposed and compared to the existing physical-theory-based models using COMS and Himawari-8 satellite data. Two machine learning approaches-RF and MLL-were used to develop the icing detection models. Both machine learning-based models resulted in better performance (POD of 68-82% and POFD of 16-18%) than those of the existing physical-theory-based models (POD of 12-36% and POFD of 7-27%) when COMS data were used. Overall, the RF models produced better performance than the MLL models according to the five-fold cross validation. However, the RF models resulted in higher standard errors in the performance metrics than the MLL models, which implies a tendency for overfitting. The variable importance identified by the machine learning models generally followed the physical characteristics of icing, e.g., resulting in high contributions by cloud top height.
Icing occurs in a variety of meteorological conditions, and thus a number of reference data collected from such conditions are required to develop more robust icing detection models. Although a small amount of qualified PIREP data, which might only cover a subset of those conditions, were used to develop the machine-learning-based icing detection models in this study, the results are promising. Therefore, the models deserve further research with more qualified reference data in order for governmental agencies to adopt such an approach from an operational perspective. The proposed approaches will be refined when more quality-controlled PIREP data become available. They will also be evaluated and optimized for the Geo-Kompsat-2 Advanced Meteorological Imager, which will be launched later in 2018 by the Korean government. In addition, the synergistic use of empirical machine learning and physical-theory-based models will be investigated to improve the detection of potential icing areas in the future.