4.1. Selection of Interest Fields
Table 5 shows the Interest Fields selected according to their relative variable importance for CC cell detection with the LR and RF models for Los Mochis and Mexico City. In both study areas, according to the principal component analysis (PCA), the number of principal components that collectively explain 95% of the total variance in the datasets was five, which means that the dimensionality of the datasets is potentially reduced to five components while still retaining a significant amount of information from the original dataset. Here, the first five Interest Fields with the highest weighting coefficients in LR and highest MDI values in RF were selected. In both cases, the Interest Fields with the highest degree of contribution in the detection of CC cells coincided with four predictor variables, CtT, CtH01, CtH02, and CtH03. Accordingly, the dimensionality of the original datasets was reduced from 12 to 6 components.
The Interest Fields that recur in both models and zones are the most significant for the detection of deep convective clouds. For example, the Interest Field CtT, corresponding to the values of cloud-top temperature detected using the ABI-GOES 14 channel (11.2 µm TB), is one of those with the highest degree of contribution in both study sites, because the temperature at the top of the cloud is cooler than the surrounding environment being a typical property of deep convective clouds. During the convection process, this type of cloud reaches significant altitudes to regions where the atmospheric pressure is lower, giving way to the adiabatic cooling process, where the rising air expands and cools as it moves upward through the atmosphere [
78], resulting in the temperature at the top of the cloud presenting significantly lower values. CtT is the Interest Field with a high degree of importance that is recurrent in other works; for example, Refs. [
3,
17,
18] reported that CtT greatly contributed to the identification of Convective Initiation event (i.e., the probability that a given cumulus cloud object will develop into a ≥35-dBZ-intensity radar echo at −10 °C altitude [
9]). In addition, in [
28,
79], CtT was identified as the most contributing variable to overshooting convective cloud tops’ (OTs) classification using RF and LR models.
The following Interest Fields in order of importance are CtH01, CtH02, and CtH03. These spectral differences between ABI-GOES channels provide information on cloud-top height (cloud depth [
3]). Similar to CtT, CtH01 is one of the variables with a higher degree of contribution in the detection of CC cells. This Interest Field is used to determine lower stratospheric moisture, and it is positive when water vapor is present above the cloud tops, which is an indicator of the presence of OTs [
28]. For the glaciation indicators, only CtG01 was determined as one of the most significant variables.
Despite the time trend variables proposed as Interest Fields providing information on the rate of vertical cloud-top growth, in general, these variable predictors resulted in relatively lower contributions in both LR and RF models (except TChCtH03 in the Mexico City site).
The distribution of Interest Fields selected after feature importance analysis for Los Mochis and Mexico City are shown in
Figure 6 and
Figure 7, respectively. CC and noCC sets are statistically differentiable in the six selected Interest Fields (which improves the classification task in the ML models) for the Los Mochis area. For the CC class, there are extremely small Inter Interquartile Ranges (IQRs) with a median close to 0 in the case of the spectral difference fields, and in that sense, ref. [
28] claims that as a cloud reaches its local equilibrium level or the height of the tropopause, all channel differences will be almost zero. On the other hand, with the noCC class, the data distribution is much broader. In the case of the cloud-top temperature (CtT) field, the cells present higher values because the tops of the deep convection clouds are colder due to adiabatic cooling. In the case of the spectral difference fields, these mostly present negative values.
For the case study of Mexico City, overlapping distributions can be seen in the boxplots of the selected Interest Fields, which are reflected in the performance metrics of the ML models, which are notably lower than the metrics in the case of Los Mochis (
Section 4.2). However, the general trends are preserved, and the CtT values are lower for the CC class; moreover, in the spectral difference fields of both height and glaciation, the noCC values are lower. Concerning the IQRs of the predictors, the distributions are more balanced for both classes, which, together with the distribution overlap, allows us to infer greater complexity in the modeling of CC events than at the Los Mochis site. Returning to the CtT Interest Field, its high degree of contribution to the detection of the CC class has been clearly shown, and cells with lower temperature values are associated with the presence of potential storm clouds. In this context, ref. [
80] lists some cloud-top temperature thresholds reported in studies related to the identification of deep convection clouds, for example, 241 and 221 °K [
81], 221 °K [
82], 225 °K [
83], 255 and 206 °K [
84]. Here, it was estimated that storm clouds manifest at a CtT threshold of <220 °K for the Los Mochis study site and <260 °K for Mexico City.
4.2. Performance and Validation of Detection Models
Each ML approach was evaluated on the basis of six performance metrics that indicate each model’s ability to detect convective cells, false alarm rates, etc. The effect of a post-processing filter based on lightning incidence was also evaluated.
Figure 8 and
Figure 9 show the results for datasets corresponding to Los Mochis and Mexico City, respectively.
For the Los Mochis site, there is a small difference between the results of the ML approach (ML series) and those with the integrated post-processing filter (ML + LF). In this regard, the ability of the models to detect the CC class, represented by the POD metric, presents values close to 0.8, while the false alarm ratio (FAR) in all the ML approaches is lower than 0.2. This means that the presence of potential deep convection events can be efficiently detected without the need to integrate the lightning incidence data. However, this does not indicate that the presence of lightning is uncommon in this zone; on the contrary, Ref. [
85] reported a significant lightning strike density in northeastern Mexico. The region with the highest presence of lightning activity is in the narrow strip between the Sierra Madre Occidental and the Gulf of California, with a maximum in the months of July and August, corresponding to the dates used to construct the datasets of this research work. In addition, the presence of lightning strikes from the GLM-GOES was detected in 8 of the 14 images used to construct the training and test datasets.
Nevertheless, the Mexico City site shows significant changes when the Lightning Filter is integrated, especially favoring the POD, CSI, and F1 metrics. For example, the highest value was POD = 0.68, obtained using RF, which increased to POD = 0.72 after post-processing. The second-highest POD was achieved with LR, which increased from POD = 0.63 to POD = 0.7 with the integration of the Lightning Filter. Comparatively to the Los Mochis dataset, in Mexico City the integration of GLM data significantly improves the system’s ability to detect the CC class in all ML approaches because lightning strike incidence is a reliable indicator of a deep convection event generation, and, like the Los Mochis site, this zone has intense lightning activity. In addition to POD, the CSI and F1 metrics increased in ML approaches with lower overall performances, such as LRstacking and Soft-Voting. On the other hand, LF has a negative impact on the BIAS metric, increasing the values above the optimum of 1. This metric ranges from 0 to infinity and allows us to assess whether the forecasting method tends to underestimate the CC (BIAS < 1) or overestimate it (BIAS > 1 [
86]). In this sense, this metric, by presenting values greater than 1 when integrating the Lightning Filter, could reveal one of the limitations of labeling convective cells with the MODIS sensor, which has a lower resolution than the ABI-GOES products, since there are areas labeled as noCC that present lightning activity that clearly indicates the generation of a convective environment. Therefore, in future work, it will be proposed to include unsupervised learning approaches that simplify the task of generating a reference set with another sensor.
Overall, for the Los Mochis dataset, no significant differences were found between the use of model ensembles and simpler approaches such as LR, which is consistent with the findings of [
17,
23,
28], who reported that this model is a robust alternative for convective-hazard modeling and detection. In contrast, for the Mexico City dataset, significant differences in the performance of each ML approach were observed. These differences are variable for each metric, but it is perceivable that LRstacking, RFstacking, and Soft-Voting are the ML approaches with the poorest performances. Instead, for all six metrics, LR, RF, MLP, Bagging, and Hard-Voting showed similar performance.
Regarding the POD values, ultimately, the acceptable level of POD may vary depending on the specific goals and requirements of the forecasting or detection system. However, a high POD may be necessary to minimize the risk of missing hazardous events. In this sense, for the Los Mochis dataset, the highest value was POD = 0.84, while in the Mexico City dataset it was POD = 0.72, with both estimated with LR after post-processing filtering. In this context, Refs. [
28] and [
29] report performances of POD = 0.77 and POD = 0.79, respectively, in their work on Ots’ detection, while [
3] and [
2], who studied CI detection, obtained values of POD ≈ 0.8 and POD > 0.7. Other examples are [
87] with their proposed pre-convective-environment alerting and monitoring system, which obtained a POD between 0.66 and 0.7; on the other hand, [
88] with his convective storm nowcasting based on CNN, POD values were close to 0.7.
The FAR metric provides information about how often the classifier makes incorrect positive predictions when it should not. In the case of Los Mochis, the FAR values were below 0.2, and these indices did not show significant variations among the different ML approaches, with values between 0.16 and 0.18. In contrast, in Mexico City, significant variations were observed, with the highest values being LRstacking and RFstacking with FAR = 0.51. For the rest of the ML approaches, it can be seen that LF increased the number of false alarms, but to a lesser extent. For example, for LR, a change from FAR = 0.40 to 0.42 was estimated, whereas for RF, the change was from FAR = 0.41 to 0.43. In the literature, FAR values vary significantly depending on the type of convective forecasting performed. For example, [
28] and [
29] reported FAR values = 0.3 and 0.09, whereas [
89] obtained FAR values between 0.01 and 0.18. For CI detection, ref. [
2] estimated FAR values between 0.46 and 0.83; [
3] a FAR ≈ 0.2; in [
17] the FAR ranges between 0.22 and 0.36; and in [
18] 0.46 to 0.37.
Accuracy provides an overview of how effective a model is at correctly classifying samples. In the Los Mochis dataset, there are no significant differences between the Acc values of the ML approaches, which generally have values slightly higher than 0.7. In this case, the addition of LF increases the Acc value, but not substantially. In Mexico City, the lowest values were recorded for LRstacking and RFstacking with Acc = 0.62, while LR, RF, Bagging, and Hard-Voting had Acc values of 0.7. This metric is susceptible to bias in datasets with class imbalance because it does not consider the distribution of classes in the dataset. Therefore, it is important to complement the performance analyses of the models with other performance metrics. Here, the Acc values between Los Mochis and Mexico City were more similar than the other metrics because there was more class imbalance in the Los Mochis (CC class = 69%, noCC class = 31%) than in the Mexico City dataset (CC class = 42%, noCC class = 58%).
BIAS is another common metric in studies related to the detection of convective events (e.g., [
86]), as well as in the forecasting of convective hazards such as lightning strikes (e.g., [
24]) because it evaluates whether the forecast method tends to underestimate or overestimate the CC occurrence. The estimated BIAS values for Los Mochis range from 0.97 to 1.03, showing a relatively good performance in terms of false alarms and misses. In contrast, in Mexico City, there is an important variation between the ML approaches, with values up to BIAS > 1.15 for RF, which increases to BIAS > 1.25 after LF. The analysis of different metrics allows for the evaluation of the forecasting framework proposed in this study. In this sense, the integration of data from the GLM increases the overestimation rate of the CC class concerning the reference target dataset, but it favors the ability of the system to detect the occurrence of CC.
CSI measures the accuracy of a forecast in predicting the occurrence of a specific event (e.g., severe thunderstorms) relative to observations. It is particularly useful in situations where false alarms or missed events can have significant consequences, such as severe weather forecasting. For Los Mochis, the dataset values close to 0.7 have been found; therefore, a CSI value greater than 0.7 indicates a good classification model. This means that the model is reasonably accurate in identifying positive cases while minimizing false alarms and misses. The CSI results for the Mexico City dataset range between 0.4 and 0.5, with a slight increase after LF, which is higher for the worst-performing metrics such as LRstacking, RFstacking, and Soft-Voting. A CSI between 0.4 and 0.7 indicates a moderate level of success in the classification task. Although the model makes correct positive predictions, there is scope for enhancement in reducing false alarms.
Regarding the F1 metric, a high F1 score (closer to 1) indicates that the model has high levels of precision and recall. In other words, it correctly classifies positive instances while avoiding false positives and false negatives. From the performance of Los Mochis, it can be concluded that all ML approaches can correctly classify positive cases, controlling false positives and false negatives. Similar to the CSI metric, in the Mexico City dataset, the post-processing LF improves the F1 score of the models with the lowest performance, whereas in the ML approaches, LR, RF, MLP, Bagging, and Hard-Voting, F1 values are estimated at F1 = 0.63.
Figure 10 presents the Receiver Operating Characteristic (ROC) curves for the ML approaches for the Los Mochis and Mexico City datasets. The ROC curves demonstrate the trade-off between sensitivity and specificity at different threshold values. A classifier with a curve closer to the upper-left corner indicates superior performance because it achieves high true-positive rates while maintaining low false-positive rates. For Los Mochis, homogeneous behavior was observed among the ML approaches, except for RFstacking, which showed a lower discrimination capacity. The area under the ROC curve (AUC) values is higher than 0.7, which means that, in general, the ML approaches have an acceptable discrimination capacity between classes. In the case of Mexico City, the ROC curve analysis revealed that LR, RF, MLP, Bagging, and Hard-Voting consistently outperformed the other classifiers across various threshold values. Their ROC curves exhibit a steeper ascent, indicating better discrimination between CC and noCC cases. This result suggests that these ML approaches are well suited for CC detection in this particular dataset. However, it is essential to consider the information provided by the other metrics.
Examples of a deep convection event simulated from each ML approach for Los Mochis and Mexico City are shown in
Figure 11 and
Figure 12, respectively; this comparison allows for a qualitative evaluation of the performance of each model, while the IoU metric indicates the overlap between the predicted regions and the ground truth (
Table 6).
In the case of Los Mochis, the simulation of the event that occurred on 15 August 2018 shows a tendency to underestimate the total area of the deep convective cloud at the cloud edges, with the exception of LR where the IoU = 0.86 values are the highest recorded. The presence of false-negative pixels at the cloud edge indicates the difficulty of simulating this transition zone where cloud properties estimated from ABI-GOES data become diffuse. Therefore, future work will consider adding an additional transition class between CC and non-CC. In most ML approaches, except RF, a convective core is simulated in the lower-right part of the simulation domain that does not match the reference labels for this event, which could be related to a better ability to detect deep convection events from AGI-GOES data than from MODIS.
The ensemble models LRstacking and RFstacking show the worst performance, with values IoU = 0.44 and IoU = 0.62, respectively. In these ML approaches, a pronounced salt-and-pepper effect is observed, which usually occurs in pixel-based classification tasks when no contextual spatial information is provided [
90]. This pattern is also present in the simulated event for Mexico City (
Figure 12), where LRstacking, RFstacking, and Svoting also show this effect, which is related to the low-performance metrics of the test dataset. In this event, two well-separated clouds can be seen forming over the areas of higher topographic elevation, which, in addition to the presence of nuclei with lower cloud-top temperature, allows us to infer that its formation was due to a forced convection process. However, as in Los Mochis, all models show difficulties in simulating the edge of the clouds, with the difference being that, in this case, the areas of both convective clouds are overestimated with the presence of false-positive pixels.
As in the previous case, this area of opportunity can be addressed by adding an additional transition class or using alternative approaches that maintain the spatial structure, such as Convolutional Neural Networks (CNN) or other spatially aware models.