Lightning Nowcasting Using Dual-Polarization Weather Radar and Machine Learning Approaches: Evaluation of Feature Engineering Strategies and Operational Integration

Alves, Marcos Antonio; Molina, Rosana Alves; Oliveira, Bruno Alberto Soares; Calvo, Daniel; Araujo Filho, Marcos Cesar Andrade; Ferreira, Douglas Batista da Silva; Santos, Ana Paula Paes; Saraiva, Ivan; Pinto, Osmar; Daher, Eugenio Lopes

doi:10.3390/cli13080168

Open AccessArticle

Lightning Nowcasting Using Dual-Polarization Weather Radar and Machine Learning Approaches: Evaluation of Feature Engineering Strategies and Operational Integration

by

Marcos Antonio Alves

^1,*

,

Rosana Alves Molina

¹,

Bruno Alberto Soares Oliveira

¹

,

Daniel Calvo

¹,

Marcos Cesar Andrade Araujo Filho

¹,

Douglas Batista da Silva Ferreira

²

,

Ana Paula Paes Santos

²,

Ivan Saraiva

³

,

Osmar Pinto, Jr.

⁴ and

Eugenio Lopes Daher

¹

FITec Technological Innovations, Belo Horizonte 30140-150, Brazil

²

ITV Vale Institute of Technology, Belém 66055-090, Brazil

³

Operations and Management Center of the Amazon Protection System, Tarumã, Manaus 69049-630, Brazil

⁴

INPE National Institute for Space Research, São José dos Campos 12227-010, Brazil

^*

Author to whom correspondence should be addressed.

Climate 2025, 13(8), 168; https://doi.org/10.3390/cli13080168

Submission received: 23 May 2025 / Revised: 1 August 2025 / Accepted: 2 August 2025 / Published: 14 August 2025

Download

Browse Figures

Versions Notes

Abstract

Lightning nowcasting is crucial for ensuring safety and operational continuity in weather-exposed industries such as mining. This study evaluates three machine learning (ML)-based approaches for predicting lightning using dual-polarimetric weather radar data collected in the eastern Amazon, Brazil. The strategies propose advances in literature in three ways by involving (i) grouping radar variables by temperature layers, (ii) statistical summaries at key altitudes, and (iii) analyzing all the 18 levels of reflectivity data combined with Principal Component Analysis (PCA) dimensionality reduction and ensemble models. For each approach, models such as Random Forest, Support Vector Machines, and XGBoost were trained and tested using data from 2021–2022 with class balancing and feature engineering techniques. Among the approaches, the PCA-based ensemble achieved the best generalization (recall = 0.89, F1 = 0.77), while the layer-based method had the highest recall (0.97), and the altitude-based strategy offered a computationally efficient alternative with competitive results. These findings confirm the predictive value of radar-derived features and emphasize the role of feature representation in model performance. Additionally, the best model was integrated into the operational LEWAIS alert system, and four integration strategies were tested. The strategy that combined alerts from both ML and LEWAIS systems reduced the failure-to-warn rate to 0.0531 and increased the lead time to 10.18 min, making it ideal for safety-critical applications. Overall, the results show that ML models based solely on radar inputs can achieve robust lightning nowcasting, supporting both scientific advancement and industrial risk mitigation.

Keywords:

lightning forecasting; weather radar data; machine learning; polarimetric variables

1. Introduction

Lightning discharges pose significant risks to human safety and industrial operations, particularly in open environments such as mining areas. Electrical storms can lead to fatal accidents, infrastructure damage, and costly interruptions in industrial production. Beyond mining, lightning events also threaten ports, airports, construction sites, agriculture, wind energy operations, offshore activities, mountain-top stations, and power plants, where unexpected (and sometimes unmonitored) lightning strikes can cause severe operational disruptions. In open-air mining environments, lightning poses a particularly high risk due to the extensive exposure of equipment, personnel, and infrastructure. Mining sites often operate across large, elevated, and remote areas with limited shelter and the significant use of tall metallic structures such as drilling rigs, haul trucks, and conveyor belts, all of which are vulnerable to hazards. Sudden storms can halt operations, damage high-value machinery, injure works, or even cause fatalities, especially when evacuation protocols are delayed or unavailable. Furthermore, blasting operations must be suspended during thunderstorm activity, creating costly delays.

In the eastern Amazon, particularly in the Carajás Mineral Province, Santos et al. [1] have shown that topographic elevation and changes in land cover in mining areas can influence lightning occurrence, further reinforcing the need for localized and high-resolution forecasting in such regions. Accurate nowcasting is therefore essential for minimizing downtime and ensuring worker safety in real-time industrial decisions.

Also, lightning can trigger wildfires, as reported in Pineda et al. [2], emphasizing the need for reliable forecasting systems to mitigate such hazards. The lack of effective forecasting and prediction systems increases risks, as missed alerts expose workers and infrastructure to potential damage, while false alarms disrupt operations unnecessarily, leading to economic losses. Developing accurate and reliable lightning prediction models is, therefore, essential for both safety and productivity.

Efforts to improve lightning monitoring and modeling also extend beyond industrial applications. Kovář et al. [3], for example, explored the use of very-low-frequency signals generated by lightning for long-range radio navigation, highlighting the broader applicability of lightning-based data in geolocation systems. Kákona et al. [4] performed mobile ground-based measurements across central Europe using high-speed cameras and radio receivers, offering new insights into lightning development and the limitations of field-based radiation detection. Although distinct in scope, such studies reinforce the scientific and technological relevance of accurately detecting and modeling lightning events across diverse fields.

Additionally, understanding the spatial distribution of lightning activity is essential for improving model performance in regions with distinct convective behavior. Albrecht et al. [5] provided a high-resolution satellite-derived lightning climatology that identified global hotspots and revealed key geographical influences, such as topography and land–water contrasts, on lightning occurrence. Although climatological in nature, these insights support the development of more localized and data-driven nowcasting strategies.

Recent advances in weather radar technology have provided high-resolution meteorological data for nowcasting severe weather events, including lightning. Various data sources, including ground-based detection networks, satellite observations, and numerical weather prediction (NWP) models, have been used for lightning forecasting. However, studies have shown that polarimetric weather radar data are particularly useful for analyzing storm microphysics and charge separation processes, which are important to lightning formation [6,7,8,9]. The integration of machine learning (ML) models with weather radar data has shown promising results in improving prediction accuracy by capturing nonlinear relationships between storm characteristics and lightning occurrence. Additionally, multi-source data integration has been shown to enhance forecast performance, as highlighted in [10,11], demonstrating that combining radar, satellite, and numerical models can improve prediction accuracy.

Some studies have explored different approaches to lightning forecasting using radar and ML techniques. Abreu et al. [6] demonstrated that increasing reflectivity values in the vertical structure of clouds correlates with lightning activity. Hayashi et al. [7] investigated the relationship between hydrometeor classification and lightning rates using dual-polarization radar, identifying ice-phase hydrometeors as key contributors to storm electrification. Capozzi et al. [12] proposed a multi-parameter approach for cloud-to-ground (CG) lightning detection, demonstrating that quadratic discriminant analysis (QDA) outperformed traditional and single-variable models. More recently, Rombeek, Leinonen, and Hamann [8] highlighted the importance of polarimetric radar variables in nowcasting severe weather hazards, showing that deep learning (DL) architectures can improve lightning predictions. Additionally, studies such as [13,14,15] have explored ML-based approaches using various meteorological inputs, achieving significant improvements in prediction accuracy.

Despite these advancements, there is still a need for optimized methodologies that integrate different radar-based predictive approaches to improve forecasting reliability, particularly in regions with complex meteorological dynamics such as the Amazon region in Brazil. These studies did not directly compare alternative feature-engineering strategies on the same dataset to quantify incremental gains, nor provide a detailed theoretical analysis of why specific radar signatures drive electrification, or assess the operational trade-offs of model complexity and false-alarm rates in real-time systems. As a result, the unique methodological value and practical applicability of different radar-derived representations remain underexplored.

This study proposes an ML-based lightning prediction approach using polarimetric weather radar data with a focus on nowcasting over a mining region in Pará, Brazil, which is in the Amazon area. Unlike previous studies, this research leverages three different approaches for feature extraction and prediction: (i) grouping radar variables into temperature-based layers, (ii) computing descriptive statistics of reflectivity and polarimetric variables at different altitudes, and (iii) applying Principal Component Analysis (PCA) to multi-level radar data and combining multiple models into an ensemble one. The main contributions include (i) evaluating different feature engineering strategies for lightning prediction, (ii) optimizing ML models for a high-risk industrial environment, and (iii) integrating the most effective model into the Lightning Early Warning Artificial Intelligence System (LEWAIS) operational forecasting system [16]. This study aims to contribute to both scientific research and industrial safety with a view both on worker’s safety and productivity.

This work is organized as follows: Section 2 reviews the related work in lighting prediction and radar-based modeling; Section 3 presents the data sources, including weather radar and lighting datasets, describes briefly the physical properties of lightning, and details the three modeling approaches proposed; Section 4 reports and discusses the results, including performance comparisons and integration with the LEWAIS; and finally, Section 5 provides the conclusions and outlines directions for future research.

2. Related Works

Previous studies have established a relationship between cloud microphysics and lightning occurrences using weather radar data. This is because the electrification process within a thunderstorm is directly linked to the hydrometeors that compose the cloud [6,7,9]. Understanding this relationship is crucial for developing accurate lightning prediction models, particularly those leveraging polarimetric radar data and ML techniques.

Several studies have explored the vertical structure of clouds and their association with lightning. Abreu et al. [6] analyzed the relationship between cloud structure and lightning frequency in northern Brazil using reflectivity profiles from the Tropical Rainfall Measuring Mission satellite radar. Their dataset consisted of reflectivity profiles with 80 vertical levels (one every 250 m), ranging from 0 to 80 dBZ. The study found that as lightning frequency increased, reflectivity values in the vertical profile also increased, demonstrating a clear connection between reflectivity and lightning occurrences. Similarly, Hayashi et al. [7] used dual-polarization radar data, a hydrometeor classification algorithm, and historical lightning data to investigate microphysical properties associated with lightning rate in 10 isolated storm cases over the Kanto Plain, Japan. The study found that ice particles within the 35 dBZ volume (V35IC) had the highest correlation coefficient (r = 0.75) and the lowest normalized root mean square error (NRMSE = 8.3%) CG lightning, and r = 0.69, NRMSE = 8.1% for intra-cloud (IC) lightning.

A different approach was taken by Capozzi et al. [12], who developed a multi-parameter method to detect the CG lightning stroke rate in convective cells using a low-cost X-band single-polarization radar. The reported findings demonstrated that a QDA-based classification approach outperformed traditional single-parameter methods. Furthermore, QDA surpassed Fuzzy Logic and Support Vector Machine (SVM)-based models, except for the Heidke Skill Score, where an SVM with a Gaussian kernel performed best.

Advances in ML for lightning prediction have also provided significant improvements in the field. Rombeek, Leinonen, and Hamann [8] emphasized the importance of polarimetric variables in nowcasting thunderstorm hazards using recurrent-convolutional neural networks. By incorporating hydrometeor characteristics from multiple altitudes, their approach enhanced predictions of precipitation, hail, and lightning activity. This research further validates the importance of analyzing radar-derived microphysical features, a key aspect of our layered reflectivity analysis in Approach 1 and height-based feature extraction in Approach 2, which are described in the next section. ML-based lightning prediction methods have been explored in other regions as well. Mostajabi et al. [13] used ML to predict lightning risk within a 30 km radius around 12 meteorological stations in Switzerland. Their model, based on four meteorological variables (air pressure, temperature, relative humidity, and wind speed), was validated using lightning detection system data. Among the ML models tested, XGBoost produced the best results for lead times of up to 30 min.

Further evidence of ML effectiveness for lightning forecasting comes from Shan et al. [14], who applied several ML models to analyze the relationship between atmospheric radiation measurement data and lightning records from the earth networks total lightning network. The study identified key variables influencing lightning formation with Random Forest (RF) emerging as the best predictor. When convective clouds were detected, RF predicted lightning with 76.9% accuracy and an Area Under the Curve (AUC) of 0.850. Bao et al. [15] designed a deep learning-based lightning prediction system using Multi-Layer Perceptron and ResNet50, achieving 88.2% accuracy, 92.2% precision, 81.5% recall, and an F1-score of 86.4%. These studies validate the use of ML models in our study, particularly the RF and XGBoost models, which were tested across all three approaches.

More recently, research on lightning nowcasting using weather radar has incorporated Doppler radar, NWP, and ML models to enhance forecasting accuracy. Fata et al. [17] explored CG lightning nowcasting by fusing remote sensing and NWP data, integrating geostationary meteorological sensors and Doppler radar. Their results showed that Gaussian Process Regression improved prediction lead times by up to 15 min with higher spatial confidence. Yin et al. [10] developed a model that integrates GNSS-derived precipitable water vapor, weather radar, and satellite data, demonstrating a 20% increase in prediction accuracy over radar-only or satellite-only models. Hosalikar et al. [18] focused on thunderstorms in eastern India, using Doppler Weather Radars and satellite data to improve CG lightning prediction. Additionally, Cintineo et al. [11] introduced the third version of the ProbSevere model, which integrates radar, lightning, and satellite data with ML for severe weather nowcasting, emphasizing the importance of radar-derived features. Pineda et al. [2] examined lightning-induced wildfires, using radar reflectivity and lightning data to characterize dry thunderstorms. Finally, Kundu et al. [19] conducted a radar-based analysis of severe lightning events in northeast India, offering insights into storm dynamics.

These studies demonstrate the increasing sophistication of lightning prediction methods, which is driven by advances in weather radar technology, data fusion, and ML techniques. Our study differs from previous works in three key aspects: (i) data representation, since most studies use integrated or single-level radar data, we analyze reflectivity and polarimetric variables at multiple altitudes to assess thunderstorm microphysics; (ii) geographic zone, because our study is one of the few ML-based in the Amazon region, specifically in a mining zone where workers face high exposure to storms. The region’s intense convection, diverse hydrometeor profiles, and strong electrification [1] make it distinct from those previously studied; (iii) application and integration, beyond predicting lightning, we aim to integrate the best-performing model into the Lightning Early Warning Artificial Intelligence System (LEWAIS) [16] to improve worker safety and operational efficiency. By addressing these aspects, our study contributes to advancing lightning prediction methodologies, particularly in regions with complex meteorological dynamics, while also providing practical applications for early-warning systems.

3. Materials and Methods

3.1. Data Gathering

The data used in this study were obtained from an X-band dual-polarization weather radar, installed in the Carajás Urban Center, Pará, Brazil. This radar can cover a radius of up to 150 km, providing updates at 5-min intervals. The available data cover the period from 1 March 2021 (when the radar was installed and operational) to 31 January 2022. In the most recent data, the series presented problems such as reading errors without data measurements of rain clouds at the evaluated heights and lack of information due to equipment maintenance and, therefore, they were not used.

The raw radar data are stored in HDF5 format [20], which are processed and converted into a NetCDF format [21]. The resulting NetCDF file contains 21 different parameters, distributed across 18 vertical levels ranging from 1 to 17 km, including the Constant Altitude Plan Position Indicator (CAPPI) for the following variables: radial velocity, differential phase shift, horizontal reflectivity (zdr), specific differential phase (kdp), and correlation coefficient between horizontal and vertical polarization (rhohv). The CAPPI product is generated by combining radar scans at different elevation angles and extracting a horizontal cross-section at a specific altitude.

The 150 km-wide coverage of the radar data can be observed through the blue circle. However, this study focused on an area within a 20 km radius around a point of interest, where a mining region is located, represented by the red circle in Figure 1a. The weather radar’s installation site is shown in Figure 1b.

In this work, we used the CAPPI for the variables zh, zdr, kdp, and rhohv. The zh variable measures the power of the returned radar signal in horizontal polarization, providing information about the size and concentration of liquid or solid water particles. Zdr measures the difference in reflectivity between horizontal and vertical polarizations, making it useful for distinguishing between different types of hydrometeors, such as rain and hail. Kdp indicates the rate of change in the differential phase shift with distance, which is commonly used to estimate the rainfall rate. Finally, rhohv measures the correlation between the reflected signals in both polarizations, helping to identify areas of homogeneous precipitation and detect the presence of contaminants such as noise or debris.

To determine whether lightning discharges occurred within a specific time interval and to perform predictions, data from the BrasilDAT dataset [22] were utilized. This dataset, provided by the National Institute for Space Research (INPE), includes the timestamp (hour/minute/second), location (latitude/longitude), and type of discharges, which are categorized as CG and IC. The BrasilDAT dataset compiles multiple data sources to determine the precise time, type, and location of lightning discharges, and it has been previously validated and used in research [1,16]. To create the response variable, which represents the presence or absence of lightning discharges in the subsequent 5 min (matching the processing interval of the NetCDF files), it was sufficient to compare the NetCDF timestamp with the BrasilDAT dataset.

It is important to mention that in operational contexts, weather radar data typically require about 5 min for a full volume scan, which is followed by 4–5 min for preprocessing and model input generation. Although this study assumes preprocessed data, such latency should be considered in real-time deployments, effectively shifting the nowcasting window to at least 10 min to ensure timely warnings.

3.2. Physical Properties of Lightning

Lightning is a complex atmospheric electrical phenomenon arising from charge separation processes within convective clouds, particularly cumulonimbus. These discharges play an important role in atmospheric electricity balance and are commonly categorized into several types based on their geometry and propagation:

Cloud-to-Ground: Characterized by a discharge from the cloud base to the Earth’s surface, CG strikes are responsible for most lightning-related damage and are the primary focus of this study. CG events are typically associated with strong electric field build-up near the surface, which is intensified by vertical air movement and microphysical processes within the storm cell [1,16].
Intra-Cloud (IC) and Cloud-to-Cloud (CC): These occur entirely within or between clouds. Although more frequent than CG, they are generally less hazardous to surface infrastructure and are less detectable by ground-based systems [23].
Ground-to-Cloud: A rarer class, often initiated from tall structures upward, usually under high electric field conditions [23,24].

The duration of lightning varies from a few microseconds in leader propagation stages to tens or hundreds of milliseconds in multi-stroke CG events. CG strokes typically exhibit peak currents ranging from 10 to 200 kA, which are accompanied by intense optical and radio-frequency emissions [24]. Optical characteristics, such as the intensity and duration of light emission, differ notably across lightning types and may also vary with atmospheric conditions.

Understanding these physical properties is important for those works that employ ML-based nowcasting models in weather radar data [25,26]. Here, it is important to mention that in our application, we have already received all these data processed and labeled by INPE; therefore, the model is not directly influenced by these characteristics. However, understanding signatures is important for evaluating model outputs.

CG discharges are often associated with strong reflectivity in the lower and mid-levels of the cloud due to the presence of large ice and mixed-phase hydrometeors [25,26].
IC lightning, in contrast, may correlate with elevated reflectivity at higher altitudes but weaker signatures near the cloud base [23].
The presence of hail, graupel, and supercooled water, key ingredients in charge separation, can be inferred from dual-polarization radar variables such as kdp, zdr, and rhohv, which were selected as model inputs in this study [25,26].

The radar-based features used in our model, such as vertical reflectivity profiles and polarimetric parameters across atmospheric layers, aim to indirectly capture these physical conditions. However, a limitation remains: most ground-truth datasets only record CG lightning, meaning our target variable is inherently biased toward surface-reaching events.

Despite this constraint, the ML models may implicitly learn to differentiate lightning types based on radar patterns. For future work, more granular datasets could enable multi-class classification that separates CG and IC events. Finally, for an in-depth understanding of the physical aspects of this phenomenon, please refer to Dwyer and Uman [23].

3.3. Lightning Prediction Approaches

This work describes three different approaches that were evaluated for predicting short-term lightning discharges. Furthermore, the most promising approach was intended to be integrated into the LEWAIS lightning prediction system [16] to enhance its two primary goals: improving people’s safety and mining productivity. For clarity, the three approaches are summarized in Table 1. A more detailed explanation, however, is provided in the subsequent subsections.

3.3.1. Approach 1: Grouping of zh, zdr, kdp, and rhohv Data by Layers

Description: The zh, zdr, kdp, and rhohv data in the warm, mixed 1, mixed 2, and cold layers were used as input for training ML models. The study of cloud microphysical behavior, based on Mattos et al. [9] and supported by the findings of Abreu et al. [6], demonstrated that stratifying polarimetric variables by thermal layers is effective in detecting changes in the microphysical profile of clouds during lightning events in the southeastern region of Brazil. Based on this evidence, the variables were organized by altitude ranges within the study area and employed as predictors in the tested models. We used temperature profiles (in °C) as the vertical coordinate instead of height levels (in km), following [9]. The vertical structure of these profiles was divided into four phases: warm (above 0 °C), mixed 1 (between 0 °C and −15 °C), mixed 2 (between −15 °C and −40 °C), and cold (below −40 °C). The data for these measurements were obtained from the University of Wyoming [27], considering all days of 2021 (training set) and 2022 (test set) at 12 UTC. The median height for each layer is defined in Table 2, which defines the classification of atmospheric layers based on temperature profiles and their corresponding altitude ranges.

The warm layer consists of regions with temperatures above 0 °C, which are typically associated with liquid precipitation. The mixed layers (Mixed 1 and Mixed 2) correspond to regions where phase transitions occur, such as the coexistence of supercooled water and ice, which are crucial for storm microphysics and charge separation processes. The cold layer, with temperatures below −40 °C, is predominantly composed of ice particles and plays a key role in lightning formation. The altitude thresholds were determined based on median values from observational data.

Pre-processing: For these analyses, only samples with valid reflectivity within a 20 km radius of the point of interest were used. From these samples, the minimum, mean, and maximum values for each variable at each corresponding temperature level were extracted, totaling 48 variables used as input in the models.

Even so, the dataset contained many missing values, especially in the higher layers, which is justified by the absence of clouds at those altitudes outside of storm periods. In the descriptive analyses, it was observed that the mean values of the variables, when separated by class, showed significant discrepancies.

Thus, to avoid issues during model training, missing data were imputed based on the mean of each class (i.e., replacing null values with the mean value corresponding to the class label). This imputation method yielded better results compared to using the median or KNN imputation. This is likely because the variables, when separated by class, exhibited approximately symmetric distributions.

Although the Shapiro–Wilk test rejected the strict normality hypothesis (p < 0.05), some polarimetric variables exhibited unimodal distributions and low relative skewness (|γ| < 1.5), particularly in the class where lightning occurred. However, several variables showed high skewness and kurtosis, indicating the presence of extreme values that may reflect real physical phenomena rather than statistical outliers. Additionally, it is important to note that in meteorology, extreme values of reflectivity (ZH > 50 dBZ) or specific differential phase (KDP > 3°/km) represent real physical phenomena (e.g., hail, heavy rain) rather than mere outliers [9]. The mean preserves these physical signatures better than the median or neighborhood-based methods (KNN), which may underestimate critically important extreme values.

The standardization was performed using the StandardScaler. The data were split into 2021 for training and 2022 for testing. The RandomUnderSampler technique was used for class balancing, which reduces the majority class by randomly selecting observations until a sample of the same size as the minority class is obtained. As a result, the training set contained 2084 samples from class 0 and 2084 samples from class 1. The test set was not balanced and consisted of 88,343 samples from class 0 and 2033 samples from class 1.

Methods: The experiments evaluated the following methods: RF, LR, XGBoost, SVM, and EHGB using the set of hyperparameters detailed in Table 3 during the GridSearch procedure.

The experimental evaluation revealed consistently superior recall performance across all model configurations with sustained values exceeding 90% classification accuracy for positive instances. In light of these findings, hyperparameter optimization was conducted through exhaustive GridSearchCV methodology with precision explicitly designated as the principal optimization metric. This strategic selection criterion serves dual purposes: (i) maintaining the models demonstrated proficiency in positive class identification while (ii) systematically reducing Type I errors through precision maximization, thereby achieving an optimal operational equilibrium in the precision–recall trade-off space.

3.3.2. Approach 2: Descriptive Statistics of zh, zdr, kdp, and rhohv Data by Height

Description: The zh, zdr, kdp, and rhohv data at altitudes of 3 km, 6 km, and 9 km were used. These heights allow polarimetric variables to capture key processes in lightning formation, making them fundamental for lightning discharge prediction. At 3 km, which is below the melting level (approximately 4.5 km [9]), it is possible to identify rainfall intensity. At 6 km, ice particle crystallization occurs, which is a crucial process for electric charge transfer. At 9 km, there is an intensification of cloud electrification in the upper layers with the accumulation and separation of ice crystals, leading to the polarization of electrical charges, which is an essential factor for lightning formation. Results presented by Hayashi et al. [7] highlighted the importance of analyzing reflectivity and ice-phase hydrometeors in different cloud layers, reinforcing the approach of stratifying radar data by height as in this approach. For analysis purposes, only valid observed values were considered, meaning that data were included only when all variables were successfully captured by the radar.

Pre-processing: Each row represented the data from an .nc file, and the columns contained the minimum, mean, maximum, and standard deviation values of the variables zh, zdr, kdp, and rhohv at altitudes of 3 km, 6 km, and 9 km. Additional variables were created to assist the algorithms, with One-Hot Encoding [28] applied to categorical variables, including the month, day, week, hour, quarter, and period of the day. A total of 4627 samples were collected with 2896 from class 0 and 1731 from class 1. With this, random (70% for training and 30% for testing) or temporal data splitting (2021 for training and 2022 for testing) were evaluated, with data imputation with SimpleImputer applying the median, SMOTEENN (SMOTE + Edited Nearest Neighbors) for class balancing, and GridSearchCV with cv = 10 folds for hyperparameter tuning. Additionally, Boruta [29] was used for feature selection for the model.

Methods: The methods used in this approach were RF, XGBoost, Decision Tree (DT), LDA, KNN, and Naive Bayes (NB). In this approach, GridSearchCV was also used for hyperparameter optimization with the values described in Table 4.

3.3.3. Approach 3: zh Data at 18 Height Levels

Description: In this approach, data from the entire volume (for all 18 levels) were obtained within a 20 km radius around the target. The reflectivity data are organized in a three-dimensional structure with dimensions 18 × 300 × 300, representing 18 levels, where each level is a 300 × 300 matrix of positions. For each level, 1264 reflectivity values were extracted, ranging approximately from −5 to 70. With 18 levels, the total number of reflectivity values per radar file was 22,752. The results reported in [12] highlighted the advantages of multi-feature approaches for lightning prediction, which aligns with this approach, where reflectivity values from 18 height levels are used as model inputs. Also, as in [13], this approach uses high-dimensional reflectivity data as ML model inputs for lightning forecasting.

Pre-Processing: Since negative reflectivity values are considered too low, they were conventionally replaced with zero for analysis purposes. The next step involved labeling each radar file based on the 22,752 reflectivity values. The temporal split of data resulted in 77,692 samples for the training set (2021) with 75,544 samples for class 0 and 2148 for class 1 and 75,406 samples for test set (2022) with 73,649 for class 0 and 1757 for class 1. Class balancing was performed by randomly removing samples from the majority class in the training data. To reduce the number of variables in the problem, the Boruta, Isomap, KPCA, SelectKBest, SVD, and PCA approaches were evaluated with PCA yielding the best results.

PCA was applied to the training data to transform the dataset and reduce its dimensionality to 300 features. There were two key steps: (1) Eigenvalue decomposition and a scree plot computed eigenvalues to assess the variance explained. We retained components covering ≥90% cumulative variance. (2) Component loadings extracted loading vectors to determine which radar levels and variables contributed the most to each PC. The physical interpretation is that PC1 was dominated by high loadings on z at mid-levels (2–5 km) and elevated kdp in upper elevations, suggesting sensitivity to convective cores rich in ice-phase hydrometeors responsible for charge separation, and PC2 with high weights on zdr near the melting layer (~3 km), indicating liquid–solid transition regions.

Methods: The ML algorithms Logistic Regression, DT, RF, and SVM, all using GridSearchCV for hyperparameter optimization (see Table 5). In this approach, after conducting several experiments, it was observed that RF and SVM-based models achieved the best results. Based on this, an additional procedure was performed to ensure the robustness of the results: a loop of 30 iterations was executed, randomly varying the balance of the training and test sets for both RF and SVM.

From the 60 models developed, the top five performing models achieved mean recall values of 88.09%, 88.04%, 87.87%, 87.67%, and 87.56%, respectively. These highest-performing models were subsequently selected to compose an ensemble with recall serving as the primary selection criterion to optimize detection performance [30]. Since the study’s objective is to reduce alert failures (i.e., when the system fails to issue an alert despite a lightning discharge occurring in the monitored area)—a critical aspect for accident prevention—in this approach, we prioritized the recall metric, which measures the true positive rate. In other words, this metric evaluates the proportion of correctly identified lightning events relative to the total number of actual events.

The central idea is that by combining the strengths of multiple models, the ensemble can capture different patterns in the data, leading to a more robust and accurate prediction.

3.4. Model Selection and Programming Environment

To evaluate model performance, well-known classification metrics were used, including accuracy, precision, recall, and F1-score, which are all derived from the confusion matrix. Accuracy (Acc) defined in Equation (1) measures the proportion of correctly classified instances. Precision (P) defined in Equation (2) represents the proportion of correctly predicted lightning events out of all predicted positive cases. Recall (R), also known as Sensitivity or True Positive Rate, defined in Equation (3), measures the proportion of actual lightning events that were correctly identified, which is the most critical case in our study, as this leaves people unprotected. Finally, F1-score, which is defined in Equation (4), is the harmonic mean of precision and recall, providing a balance between the two.

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(1)

P = \frac{T P}{T P + F P}

(2)

R = \frac{T P}{T P + F N}

(3)

F 1 = 2 \times \frac{P \times R}{P + R}

(4)

where TPs (True Positives) are the correctly predicted positive instances, TNs (True Negatives) are the correctly predicted negative instances, FPs (False Positives) are the incorrectly predicted positive instances, and FNs (False Negatives) are the incorrectly predicted negative instances.

These metrics are used to compare different models and select the one with the best performance under varying conditions. In Section 4.1, we further relate these standard ML metrics to operational performance criteria, such as FAR and FTW, used in our real-word scenario.

All experiments were implemented using Python version 3.9.6, with VSCode as IDE. The main libraries were Pandas, Numpy, Scikit-learn, Boruta (for feature selection), and Matplotlib v. 3.4.2 and Seaborn v. 0.13.2 for visualization and exploratory data analysis.

4. Results and Discussion

This section presents the most promising results obtained from the machine learning approaches for lightning prediction. The weather radar, installed in 2021, has started generating data over the mining area, and these data are currently being analyzed by multiple research teams. The goal is that accurate forecasting, which minimizes both false positives and false negatives, is crucial for enhancing safety and optimizing productivity. A missed alert poses significant safety risks to workers involved in outdoor activities, whereas a false alert disrupts operations, resulting in economic losses.

Approach 1 uses zh, zdr, kdp, and rhohv data within a 20 km radius from the point of interest in the warm, mixed 1, mixed 2, and cold layers as input for training ML models. Due to differences in variable means across classes, missing values were imputed using the class mean, and StandardScaler was applied for feature scaling. To address class imbalance, RandomUnderSampler was used, balancing the training set to 2084 samples per class, while the test set remained imbalanced (88,343 samples for class 0 and 2033 for 1). The evaluated models included RF, LR, XGboost, SVM, and EHGB. Across all models, accuracy exceeded 97% with class 0 demonstrating both precision and recall above 98%. The best-performing model was EHGB (with the best parameters: learning_rate = 0.05, max_depth = 5, max_iter = 200), while LR showed the weakest performance, particularly in recall.

In Approach 2, a different perspective was taken by computing descriptive statistics (min, mean, max, and std) of radar variables at 3 km, 6 km, and 9 km altitudes. As in the previous approach, only valid radar recordings were included in the analysis. The best results came from the training and test sets that were created through random splitting (70–30%), and missing values were imputed using the median (SimpleImputer). To address class imbalance, SMOTEENN was applied, which was followed by GridSearchCV (cv = 10 folds) for hyperparameter tuning. The best model in this approach was the DecisionTreeClassifier (ccp_alpha = 0.001, max_depth = 10, random_state = 42), which achieved 0.88 recall for class 0 and 0.71 for class 1.

In Approach 3, PCA proved to be an effective method for reducing dataset dimensionality while preserving essential radar information. The number of components retained is shown in Figure 2 with the selection criterion based on capturing over 95% of the total feature variance.

Special attention is going to be given to this approach, since it was that with the best results. After training different models using LR, DT, RF, and SVM, all optimized with GridSearchCV, RF and SVM achieved the best results based on the recall metric. In this application, the recall metric was considered the most appropriate for model selection, as it reflects the model’s ability to correctly identify actual lightning events, which is critical in operational warning systems. Using these two algorithms, 60 models were trained, and the top five were selected, averaging 0.87855. These models were then combined into an ensemble to increase generalization and predictive robustness. Ensemble is a technique that combines the predictions of multiple models to improve the overall forecast accuracy. With this strategy, ensemble methods can limit the variance and bias errors associated with single ML models. There are different approaches to creating an ensemble, such as bagging, boosting, and stacking. In this case, the combination of the predictions of the top five models was used to form a bagging ensemble, whose classification task was decided by majority vote. Bagging is known for reducing variance without increasing the bias, while boosting reduces bias [30]. The central idea is that by combining the strengths of multiple models, the ensemble can capture different patterns in the data and provide a more robust and accurate forecast.

The application of the ensemble to the 2022 dataset presented a significant challenge due to the large volume (about 75k registers). To handle this, the data were split by month; i.e., initially, January data were processed by the ensemble, followed by February data, and so on. This stepwise approach allowed for efficient memory management and ensured all data were analyzed without compromising system integrity. The results for each month are described in Table 6. It is possible to observe that the ensemble model achieved 0.944 recall for class 0 and 0.658 for class 1.

It is also noteworthy that the results vary by month and, on closer analysis, on days with few lightning strikes. A critical observation is that in months with low lightning activity, the model tends to generate a significant number of FP. To investigate this issue further, we conducted a case-by-case analysis of the 2022 test samples where the model misclassified lightning events, i.e., where the model predicted no lightning occurrence, but at least one lightning strike was observed within the 20 km radius from the mine, represented by the red circle. A few illustrative examples are shown in Figure 3.

In Figure 3a, only one CG lightning strike (red point) was detected within the 20 km monitored area with low reflectivity values in that region. These types of events were common in cases where the ensemble model failed, specifically when one or very few discharges occurred near the edge of the monitored circle but with low radar reflectivity in the central area. A similar pattern is seen in Figure 3b, where reflectivity indicates possible convective activity, but this activity lies outside the monitored area. Since the ML model is spatially constrained to the 20 km radius, it does not consider surrounding convection that could still influence the region of interest. This reveals a limitation of the current model in accounting for storm cells developing just outside the monitoring boundary.

On the other hand, in Figure 3c, strong atmospheric activity is clearly observed via both radar reflectivity and lightning detections. However, this activity is centered outside the monitored radius. Due to natural dispersion, a few strikes (CG and IC) occurred within the zone, which led to misclassification by the model. Finally, Figure 3d presents an anomalous case where strong atmospheric activity was evident in the region, including within the monitored area, but for unknown reasons, the radar failed to register reflectivity values during that period. This may suggest potential radar equipment malfunction and some kind of interference affecting data acquisition.

Overall, in all the FN cases revisited, the number of lightning strikes was very low, typically one or two events, which was similar to the first three cases presented. This suggests that the model is generally effective in associating reflectivity signatures with lightning activity. However, since it is restricted to interpreting atmospheric data only within the 20 km area, it occasionally fails when lightning originates from more distant convection systems but still impacts the target zone.

Finally, a comparison of the three approaches is summarized in Table 7, highlighting the best results in bold. The findings indicate that radar-based polarimetric data are valuable for lightning prediction with Approach 3 showing the best generalization capability.

These results confirm that Approach 3 (zh at 18 levels with PCA) provides the most reliable generalization, while Approach 1 (grouping by layers) prioritizes recall at the expense of precision. Approach 2 (descriptive statistics by height) proved efficient in capturing storm microphysics with a competitive recall. Future studies should refine feature extraction, hyperparameter tuning, and ensemble learning techniques to further optimize prediction accuracy.

4.1. Integration with an Existing System

Recently, LEWAIS was proposed in [16] as an operational lightning warning system for the same area analyzed in this study (referred to as “P2”). The model divides the region into predefined quadrants (which refers to the spatial monitoring area around the target one) and applied a two-step grid search to optimize conflicting operational goals. These include minimizing false alarm rate (FAR), failure-to-warn (FTW), and operational downtime, while maximizing lead time. The Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) multicriteria method was used to rank the solutions. Historical lightning discharge data are used to fine-tune alert issuance strategies that account for both safety and productivity.

To utilize these metrics, we establish a connection between machine learning metrics and operational performance. FP contributes to the FAR, as both represent instances where an unnecessary warning is issued. FN contributes to the FTW, as both indicate missed lightning events. Precision helps reduce FAR by improving the trustworthiness of alerts, while recall is inversely related to FTW, as higher recall signifies fewer missed warnings. The F1-score balances these trade-offs, proving to be a useful metric for selecting the most suitable model for integration.

Thus, to quantitatively evaluate performance, LEWAIS relies on a contingency table comparing system alerts which actual lightning events. The following metrics are derived as follows: FAR is represented by Equation (5), which measures the proportion of false alerts among all alerts issued, and FTW, Equation (6), measures the proportion of missed events among all true lightning events. Operational downtime is the proportion of time (in hours) that operations are suspended due to active alerts relative to the total period. Lead time, in its turn, is the average time (in minutes) between alert issuance and the first subsequent lightning event, which is computed for correctly predicted cases.

F A R = \frac{F P}{T P + F P}

(5)

F T W = \frac{F N}{T P + F N}

(6)

To integrate our radar-based Ensemble ML model into this framework, we replicated the original LEWAIS metrics for the year 2022 using lightning data from INPE and defined this as the baseline (Model 1). We then evaluated three integration strategies:

Model 2—Conditional Ensemble: Uses the ensemble model when radar data are available. Otherwise, defaults to LEWAIS;
Model 3—Combined Alerts: Triggers an alert if either LEWAIS or ensemble generates one;
Model 4—Ensemble Priority: Prioritizes ensemble alerts. If only one model triggers, Ensemble is used.

The results obtained for the LEWAIS system and for the three integrated models are presented in Table 8. The best results for each metric are highlighted in bold.

Among the four models, Model 3 offered the lowest FTW and the greater lead time, making it the most appropriate for safety-critical environments, such as mining operations. In contrast, Model 4 achieved the lowest FAR and downtime, which may benefit productivity, but its high FTW makes it less suitable when worker safety is the main concern. Model 2 provided moderate improvements in terms of FTW but still had a high FAR. Overall, the results show that integrating radar-based ML predictions with existing systems such as LEWAIS can improve performance, but the choice of integration strategy must be aligned with operational priorities, balancing safety and efficiency.

4.2. Comparison with Literature

The results of this study confirm the effectiveness of ML models combined with polarimetric radar data for lightning prediction, especially in a high-risk region like Pará, Brazil. When compared with related works, our approach demonstrates both consistency with previous findings and relevant methodological advances.

Abreu et al. [6] identified a link between vertical reflectivity profiles and lightning frequency. Our Approach 2, which uses descriptive statistics at key altitudes, builds on this idea and achieved 0.71 recall for class 1 with a DT model, reinforcing the importance of features based on altitudes. Hayashi et al. [7] showed that ice-phase hydrometeors identified by dual-polarization radar have strong correlation with lightning. This supports our proposed Approach 1, which grouped variables by different layers. The best EHGB model achieved a recall of 0.97, confirming the value of microphysical layer analysis. Capozzi et al. [12] demonstrated that multi-parameter classification outperforms traditional methods. Similarly, our Approach 3, using PCA and the Ensemble model, achieved an average recall of 87.9% across models, showing the benefit of dimensionality reduction and model combination. Additionally, Rombeek, Leinonen, and Hamann [8] focused on DL with polarimetric data across altitudes. In Approach 3, the utilization of 18 radar levels tends to reflect a similar strategy, showing good generalization through ensemble learning (even without DL architectures). Yin et al. [10] and Cintineo et al. [11] emphasized the benefit of fusing radar, satellite, and NWP data. Despite relying solely on radar data, our results demonstrate that properly processed radar features can yield good lightning predictions. Finally, in [14], its statistical summaries of atmospheric data to predict lightning were used, achieving high accuracy. Our Approach 2 uses a similar analysis by using radar inputs, with similar results, validating the use of summarized features at different atmospheric levels.

In summary, this present work aligns with and extends prior research by proposing three radar-based ML approaches, demonstrating good recall score as well as generalization, and offering practical solutions for nowcasting in industrial contexts.

4.3. Sensibility Analysis

A sensibility analysis was conducted to evaluate how different preprocessing techniques, feature selection methods, and hyperparameter optimization strategies impacted model performance throughout the analysis of the approaches.

In Approach 1, imputing missing values using the class mean was more effective than other strategies, as it preserved class-specific characteristics. StandardScaler normalization was essential for models sensitive to feature scale, such as SVM and LR. Class imbalance was addressed using RandomUnderSampler, which improved recall for the minority class (lightning in the target location) but reduced training data volume. The use of grid search cross-validation and Boruta did not show any significant improvement. In fact, while SVM with grid search improved precision, it also reduced recall, highlighting the trade-offs in optimizing specific metrics, as shown in [16]. In Approach 2, SMOTEENN outperformed undersampling by keeping more relevant samples. Imputation with median using SimpleImputer provided robustness against outliers. The DT model, in its turn, achieved better results without the need for advanced feature selection (this method is also known for providing feature importance ranking). It suggests that the statistical features based on altitude were informative, as also reported in the literature; see Abreu et al. [6]. In Approach 3, sensitivity centered around the number of principal components. Using 300 components captured nearly all variance without performance loss. The Ensemble with the top 5 models significantly improved generalization and reduced variance, confirming the benefit of model aggregation.

Overall, while preprocessing techniques like scaling and balancing were important, the choice of feature representation, by layer, altitude, or vertical profile, had the most impact on predictive performance. These findings highlight the importance of adapting the ML modeling pipeline to the specific characteristics of meteorological data.

5. Conclusions

This study presented and evaluated different machine learning-based approaches for short-term lightning prediction using dual-polarization weather radar data over a mining region in the eastern Amazon. The study addressed a critical challenge in industrial operations exposed to open-air weather hazards: how to predict lightning with sufficient accuracy to protect workers and minimize unnecessary operational interruptions.

We explored three different approaches for feature representation, which are (i) by temperature-based atmospheric layers, (ii) altitude-based statistical summaries, and (iii) full vertical radar profiles processed via PCA. Among them, the PCA-based ensemble (Approach 3) showed the best generalization, while the layer-based method (Approach 1) delivered the highest recall, making it ideal for maximizing detection. The altitude-based statistical model (Approach 2) offered a lightweight yet effective alternative. These findings are in line with other works highlighting the relevance of feature engineering in radar-based lightning prediction and the trade-offs involved between recall, precision, and model robustness.

Additionally, the study tested the integration of the Ensemble model with the operational LEWAIS system, which is a quadrant-based technique used for lightning prediction. Among four integration strategies, the approach that triggered alerts when either LEWAIS or the ML ensemble predicted lightning (Model 3) yielded the lowest FTW rate and longest lead time, achieving the best balance for operational safety. In contrast, prioritizing the Ensemble model (Model 4) reduced FAR and operational downtime, a potential advantage for production-driven contexts, albeit at the expense of higher miss rates.

These results demonstrate the potential of radar-based ML models not only for improving forecasting performance but also for supporting decision making in operational systems. The integration of data science techniques into industrial safety frameworks, such as lightning alert systems, may offer an important path to reduce risks and optimize processes.

Despite the promising results presented in this study, some limitations should be acknowledged. First, the models were trained and tested with data obtained exclusively from a single X-band radar installed in a specific region of the eastern Amazon. Thus, the configuration of the scanning strategy of this radar used in this study was not optimized for the detection of storm severity, which may have negatively influenced the results found. This geographic and instrumental restriction may limit the generalization of the models to other regions with different microphysical characteristics or convective regimes. Furthermore, the study did not consider the integration of satellite data or numerical models, which could complement the radar information and improve the robustness of the forecasts.

It is also important to highlight that although the class balancing strategies reduced the influence of imbalance, the occurrence of discharges is naturally sparse, which may impact performance in extreme operational scenarios. Finally, the integrated warning system still depends on technological infrastructure and real-time data that may not be available in all industrial operations. These factors should be considered in future applications and studies to expand the model. Therefore, future work can explore the use of data fusion with satellite or NWP inputs, real-time deployment and DL architectures, aiming at greater scalability and operational intelligence.

Author Contributions

Conceptualization, D.B.d.S.F., A.P.P.S., D.C., O.P.J., I.S. and E.L.D.; methodology, M.A.A., R.A.M., B.A.S.O., M.C.A.A.F. and I.S.; software, M.A.A., R.A.M., B.A.S.O. and M.C.A.A.F.; validation, D.C., D.B.d.S.F., A.P.P.S. and I.S.; formal analysis, M.A.A., R.A.M., M.C.A.A.F. and B.A.S.O.; investigation, M.A.A., R.A.M., M.C.A.A.F. and B.A.S.O.; resources, I.S. and O.P.J.; data curation, I.S. and O.P.J.; writing—original draft preparation, M.A.A. and R.A.M.; writing—review and editing, I.S., B.A.S.O., D.C. and M.C.A.A.F.; visualization, M.A.A., R.A.M. and B.A.S.O.; supervision, D.B.d.S.F., I.S. and D.C.; project administration, D.B.d.S.F. and E.L.D.; funding acquisition, D.B.d.S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are available upon request from the authors.

Acknowledgments

The authors thanks to Vale S.A., FITec—Technological Innovations, and ITV—Vale Institute of Technology.

Conflicts of Interest

Authors Marcos Antonio Alves, Rosana Alves Molina, Bruno Alberto Soares Oliveira, Daniel Calvo, Marcos Cesar Andrade Araujo Filho, and Eugenio Lopes Daher were employed by the company FITec Technological Innovations. Douglas Batista da Silva Ferreira and Ana Paula Paes dos Santos were employed by the company ITV—Vale Institute of Technology, Osmar Pinto Jr was employed by the company INPE National Institute for Space Research, and Ivan Saraiva was employed by the company CENSIPAM—Operations and Management Center of the Amazon Protection System.

References

Santos, A.P.P.d.; Ferreira, D.B.d.S.; Nascimento Júnior, W.d.R.; Souza-Filho, P.W.M.e.; Pinto Júnior, O.; Lima, F.J.L.d.; Bourscheidt, V.; Mattos, E.V.; Costa, C.P.W.d.; Nogueira Neto, A.V.; et al. Lightning under different land use and cover, and the influence of topography in the Carajás Mineral Province, Eastern Amazon. Atmosphere 2024, 15, 375. [Google Scholar] [CrossRef]
Pineda, N.; Rodríguez, O.; Casellas, E.; Bech, J.; Montanyà, J. Meteorological factors associated with dry thunderstorms and simultaneous lightning-ignited wildfires: The 15 June 2022 outbreak in Catalonia. Agric. For. Meteorol. 2024, 359, 110268. [Google Scholar] [CrossRef]
Kovář, P.; Puričer, P.; Mikeš, J. Study of the applicability of radio signals emitted by lightning for long-range navigation. J. Navig. 2023, 76, 641–652. [Google Scholar] [CrossRef]
Kákona, J.; Mikeš, J.; Ambrožová, I.; Ploc, O.; Velychko, O.; Sihver, L.; Kákona, M. In situ ground-based mobile measurement of lightning events above central Europe. EGUsphere 2022, 2022, 547–561. [Google Scholar] [CrossRef]
Albrecht, R.I.; Goodman, S.J.; Buechler, D.E.; Blakeslee, R.J.; Christian, H.J. Where are the lightning hotspots on Earth? Bull. Am. Meteorol. Soc. 2016, 97, 2051–2068. [Google Scholar] [CrossRef]
Abreu, L.P.; Gonçalves, W.A.; Mattos, E.V.; Mutti, P.R.; Rodrigues, D.T.; da Silva, M.P.A. Clouds’ microphysical properties and their relationship with lightning activity in northeast Brazil. Remote Sens. 2021, 13, 4491. [Google Scholar] [CrossRef]
Hayashi, S.; Umehara, A.; Nagumo, N.; Ushio, T. The relationship between lightning flash rate and ice-related volume derived from dual-polarization radar. Atmos. Res. 2021, 248, 105166. [Google Scholar] [CrossRef]
Rombeek, N.; Leinonen, J.; Hamann, U. Exploiting radar polarimetry for nowcasting thunderstorm hazards using deep learning. Nat. Hazards Earth Syst. Sci. 2024, 24, 133–144. [Google Scholar] [CrossRef]
Mattos, E.V.; Machado, L.A.; Williams, E.R.; Albrecht, R.I. Polarimetric radar characteristics of storms with and without lightning activity. J. Geophys. Res. Atmos. 2016, 121, 14–201. [Google Scholar] [CrossRef]
Yin, W.; Zhou, C.; Zhou, F.; Tian, Y.; Yang, X.; Wang, X.; Tian, R.; Xiao, Y.; Zhang, W.; Yao, Y. A Lightning Nowcasting Model using GNSS PWV and Multi-source Data. IEEE Trans. Geosci. Remote Sens. 2024, 2024, 5802910. [Google Scholar] [CrossRef]
Cintineo, J.L.; Pavolonis, M.J.; Sieglaff, J.M. ProbSevere Version 3: Improved Exploitation of Data Fusion and Machine Learning for Nowcasting Severe Weather. Weather Forecast. 2024, 39, 1937–1958. [Google Scholar] [CrossRef]
Capozzi, V.; Montopoli, M.; Mazzarella, V.; Marra, A.C.; Roberto, N.; Panegrossi, G.; Dietrich, S.; Budillon, G. Multi-variable classification approach for the detection of lightning activity using a low-cost and portable X band radar. Remote Sens. 2018, 10, 1797. [Google Scholar] [CrossRef]
Mostajabi, A.; Finney, D.L.; Rubinstein, M.; Rachidi, F. Nowcasting lightning occurrence from commonly available meteorological parameters using machine learning techniques. NPJ Clim. Atmos. Sci. 2019, 2, 41. [Google Scholar] [CrossRef]
Shan, S.; Allen, D.; Li, Z.; Pickering, K.; Lapierre, J. Machine-learning-based investigation of the variables affecting summertime lightning occurrence over the Southern Great Plains. Atmos. Chem. Phys. 2023, 23, 14547–14560. [Google Scholar] [CrossRef]
Bao, R.; Zhang, Y.; Ma, B.J.; Zhang, Z.; He, Z. An artificial neural network for lightning prediction based on atmospheric electric field observations. Remote Sens. 2022, 14, 4131. [Google Scholar] [CrossRef]
Alves, M.A.; Oliveira, B.A.S.; Ferreira, D.B.S.; Santos, A.P.P.; Maia, W.F.S.; Soares, W.S.; Silvestrow, F.P.; Rodrigues, L.F.M.; Daher, E.L.; Pinto, O., Jr. An automated technique and decision support system for lightning early warning. Int. J. Environ. Sci. Technol. 2025, 22, 2289–2304. [Google Scholar] [CrossRef]
Fata, A.; Moser, G.; Procopio, R.; Bernardi, M.; Fiori, E. A Gaussian Process Regression Method to Nowcast Cloud-to-Ground Lightning from Remote Sensing and Numerical Weather Modeling Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 1963–1981. [Google Scholar] [CrossRef]
Hosalikar, K.S.; Mukhopadhyay, P.; Sen Roy, S.; Pawar, S.D.; Zacharia, S.; Kumari, P.; Muppa, S.K.; Mohapatra, M. Unfolding the mechanisms of the development of thunderstorms over eastern India: THUNDER-F field experiment. J. Earth Syst. Sci. 2024, 133, 216. [Google Scholar] [CrossRef]
Kundu, S.S.; Chhari, A.; Srivastava, A.; Chakravorty, A.; Gogoi, R.B.; Aggarwal, S.P. Investigation of thunderstorm characteristics with severe lightning events over NE region of India. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, 10, 95–101. [Google Scholar] [CrossRef]
The HDF Group. The HDF5 Library & File Format. Available online: https://www.hdfgroup.org/solutions/hdf5/ (accessed on 6 August 2024).
Unidata. Network Common Data Form (NetCDF). Available online: https://www.unidata.ucar.edu/software/netcdf/ (accessed on 6 August 2024).
Pinto, O., Jr.; Pinto, I.R.C.A. Brasildatdataset: Combining data from different lightning locating systems to obtain more precise lightning information. In Proceedings of the 25th International Conference on Lightning Detection, Ft. Lauderdale, FL, USA, 12–15 March 2018. [Google Scholar]
Dwyer, J.R.; Uman, M.A. The physics of lightning. Phys. Rep. 2014, 534, 147–241. [Google Scholar] [CrossRef]
Peterson, M.; Liu, C. Characteristics of lightning flashes with exceptional illuminated areas, durations, and optical powers and surrounding storm properties in the tropics and inner subtropics. J. Geophys. Res. Atmos. 2013, 118, 11–727. [Google Scholar] [CrossRef]
Utsav, B.; Deshpande, S.M.; Das, S.K.; Pawar, S.D.; Pandithurai, G. Relationship between convective storm properties and lightning over the Western Ghats. Earth Space Sci. 2022, 9, e2022EA002232. [Google Scholar] [CrossRef]
Reinhart, B.; Fuelberg, H.; Blakeslee, R.; Mach, D.; Heymsfield, A.; Bansemer, A.; Durden, S.L.; Tanelli, S.; Heymsfield, G.; Lambrigtsen, B. Understanding the relationships between lightning, cloud microphysics, and airborne radar-derived storm structure during Hurricane Karl (2010). Mon. Weather Rev. 2014, 142, 590–605. [Google Scholar] [CrossRef]
University of Wyoming. Radiosonde Data. Available online: https://weather.uwyo.edu (accessed on 13 March 2024).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Kursa, M.B.; Jankowski, A.; Rudnicki, W.R. Boruta–a system for feature selection. Fundam. Inform. 2010, 101, 271–285. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]

Figure 1. (a) CAPPI from horizontal reflectivity (Zh) at an altitude of 6 km on 16 March 2021, at 15:45, and (b) weather radar installed in the Urban Center of Carajás, Pará, Brazil.

Figure 2. Number of components extracted from the PCA in Approach 3.

Figure 3. Examples of FN cases where the model predicted no lightning but at least one lightning occurred within the monitored area at an altitude of 6 km. These events occurred on 11 January 2022, at 9:25 pm (a) and 11:25 pm (b); 23 January 2022, at 2:35 pm (c); and 25 January 2022, at 9:15 am (d).

Table 1. Summary of the approaches employed in this article, considering the description, polarimetric variables, pre-processing techniques, ML methods applied in each, and related works.

Approach	Data	Variables	Pre-Processing	Methods	Related Studies
Grouping data by layers	It uses the variables in the warm, mixed 1, mixed 2, and cold layers.	zh, zdr, kdp, and rhohv	Data imputation using mean per class, StandardScaler normalization, class balancing using RandomUnderSampler; temporal split.	RF, LR, XGBoost, SVM, and EHGB.	[6,8,9]
Descriptive statistics by height	It uses the minimum, mean, maximum, and standard deviation of the variables at 3, 6, and 9 km of altitude.	zh, zdr, kdp, and rhohv	Random or temporal split; imputation with SimpleImputer, SMOTEENN balancing; GridSearchCV for hyperparameter tuning; Boruta for feature selection.	RF, XGBoost, DT, LDA, KNN, NB.	[7,8,14]
Data in 18 levels of height	It uses data from all 18 height levels.	zh	Temporal split; balancing using RandomUnderSampler; PCA for dimensionality reduction.	LR, DT, RF, SVM, Ensemble.	[12,13]

RF: Random Forest; LR: Logistic Regression; XGBoost: Extreme Gradient Boosting; SVM: Support Vector Machine; EHGB: Ensemble Hist Gradient Boosting; DT: Decision Tree; LDA: Linear Discriminant Analysis; KNN: K-Nearest Neighbors; NB: Naïve Bayes.

Table 2. Definition of atmospheric layers based on temperature and altitude.

Layer	Temperature	Height
Warm	Above 0 °C	Less than 5.112 km
Mixed 1	0 to −15 °C	Greater than or equal to 5.112 km and less than 7.429 km
Mixed 2	−15 to −40 °C	Greater than or equal to 7.429 km and less than 10.960 km
Cold	Below −40 °C	Greater than 10.960 km

Table 3. Hyperparameters used in the GridSearch procedure for the ML algorithms in Approach 1.

Algorithm	Hyperparameters
RF	{‘n_estimators’: [100, 200, 300], ‘max_depth’: [None, 10, 20, 30], ‘min_samples_split’: [2, 5, 10]}
LR	{‘solver’: [‘liblinear’, ‘saga’], ‘C’: [0.001, 0.01, 0.1, 1, 10, 100], ‘penalty’: [‘l1’]}
XGBoost	{‘n_estimators’: [100, 200, 300], ‘max_depth’: [3, 6, 9, 10], ‘learning_rate’: [0.001, 0.01, 0.1, 0.2]}
SVM	{‘C’: [0.1, 1, 10, 100], ‘kernel’: [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’], ‘degree’: [2, 3, 4], ‘gamma’: [‘scale’, ‘auto’]}
EHGB	{‘learning_rate’: [0.01, 0.1, 0.2], ‘max_iter’: [100, 200], ‘max_depth’: [None, 10, 20], ‘min_samples_leaf’: [20, 50]}

Table 4. Hyperparameters used in the GridSearch procedure for the ML algorithms in Approach 2.

Algorithm	Hyperparameters
RF	{‘n_estimators’: [100, 150, 150], ‘max_depth’: [10, 15, 20], ‘min_samples_split’: [2, 5, 10], ‘min_samples_leaf’: [1, 3, 5]}
XGBoost	{‘lambda’: [1 × 10⁻³, 1.0, log = True], ‘alpha’: [1 × 10⁻³, 1.0, log = True], ‘learning_rate’: [0.001, 0.01, 0.1, 0.2], ‘n_estimators’: [50, 100], ‘max_depth’: [3, 6, 10], ‘subsample’: [0.5, 0.9], ‘colsample_bytree’: [0.5, 0.9], ‘scale_pos_weight’: [1.0, 5.0]}
DT	{‘criterion’: [‘gini’, ‘entropy’], ‘splitter’: [‘best’, ‘random’], ‘max_depth’: [3, 5, 7, 9], ‘min_samples_split’: [2, 5, 10], ‘min_samples_leaf’: [1, 2, 5], ‘ccp_alpha’: [0.0, 0.01, 0.1]}
LDA	{‘solver’: [‘lsqr’, ‘eigen’], ‘shrinkage’: [None, ‘auto’, 0.1, 0.5]}
KNN	{‘n_neighbors’: [3, 5, 7, 10], ‘weights’: [‘uniform’, ‘distance’], ‘metric’: [‘euclidean’, ‘manhattan’], ‘algorithm’: [‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’]}
NB	{‘var_smoothing’: [1 × 10⁻⁹, 1 × 10⁻⁸, …, 1 × 10⁻⁵]}

Table 5. Hyperparameters used in the GridSearch procedure for the ML algorithms in Approach 3.

Algorithm	Hyperparameters
LR	{‘penalty’: [‘l1’, ‘l2’, ‘elasticnet’, ‘None’], ‘C’: [0.001, 0.01, 0.1, 1, 10, 100], ‘solver’: [‘lbfgs’, ‘liblinear’, ‘saga’], ‘l1_ratio’: [0.0, 0.5, 1.0], ‘max_iter’: [100, 200, 500]}
DT	{‘criterion’: [‘gini’, ‘entropy’, ‘log_loss’], ‘splitter’: [‘best’, ‘random’], ‘max_depth’: [None, 5, 10, 20, 30], ‘min_samples_split’: [2, 5, 10], ‘min_samples_leaf’: [1, 2, 4], ‘max_features’: [None, ‘sqrt’, ‘log2’], ‘class_weight’: [None, ‘balanced’]}
RF	‘n_estimators’: [50, 100, 200, 500], ‘max_depth’: [None, 10, 20, 30], ‘min_samples_split’: [2, 5, 10, 15], ‘min_samples_leaf’: [1, 2, 4, 6], ‘max_features’: [‘auto’, ‘sqrt’, ‘log2’], ‘bootstrap’: [True, False], ‘criterion’: [‘gini’, ‘entropy’]}
SVM	{‘C’: [0.1, 1, 10, 100, 1000], ‘kernel’: [‘linear’, ‘rbf’, ‘poly’, ‘sigmoid’], ‘gamma’: [‘scale’, ‘auto’, 0.1, 0.01, 0.001, 0.0001], ‘degree’: [2, 3, 4], ‘coef0’: [0.0, 0.1, 0.5, 1.0]}

Table 6. Monthly results by the bagging ensemble model using the principal components approach.

Class	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Avg
0	0.84	0.91	0.87	0.93	0.98	0.98	0.97	0.91	0.97	0.98	0.95	0.99	0.94
1	0.92	0.83	0.93	0.87	1.00	0.78	0.00	0.76	0.80	0.71	0.81	0.53	0.66

Table 7. Comparative results achieved in the three approaches for lightning prediction.

Approach	Best Model	Precision	Recall	F1-Score	Key Observations
Approach 1 (Grouping by Layers)	EHGB	0.61	0.99	0.75	High recall; limited precision gains with tuning
Approach 2 (Descriptive Statistics by Height)	Decision Tree	0.83	0.79	0.74	Effective in capturing microphysics but recall can be improved
Approach 3 (zh at 18 Heights with PCA)	Ensemble (Top 5 Models)	0.68	0.89	0.77	Best generalization; computationally expensive but robust

Table 8. Performance comparison of LEWAIS and integration strategies with the Ensemble model.

Prediction System	FAR	FTW	Downtime	Lead Time
Model 1—LEWAIS	0.6517	0.1206	0.0407	8.87
Model 2—Conditional Ensemble	0.8451	0.0967	0.0930	9.57
Model 3—Combined alerts	0.8518	0.0531	0.1031	10.18
Model 4—Ensemble priority	0.5556	0.2202	0.0278	7.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alves, M.A.; Molina, R.A.; Oliveira, B.A.S.; Calvo, D.; Araujo Filho, M.C.A.; Ferreira, D.B.d.S.; Santos, A.P.P.; Saraiva, I.; Pinto, O., Jr.; Daher, E.L. Lightning Nowcasting Using Dual-Polarization Weather Radar and Machine Learning Approaches: Evaluation of Feature Engineering Strategies and Operational Integration. Climate 2025, 13, 168. https://doi.org/10.3390/cli13080168

AMA Style

Alves MA, Molina RA, Oliveira BAS, Calvo D, Araujo Filho MCA, Ferreira DBdS, Santos APP, Saraiva I, Pinto O Jr., Daher EL. Lightning Nowcasting Using Dual-Polarization Weather Radar and Machine Learning Approaches: Evaluation of Feature Engineering Strategies and Operational Integration. Climate. 2025; 13(8):168. https://doi.org/10.3390/cli13080168

Chicago/Turabian Style

Alves, Marcos Antonio, Rosana Alves Molina, Bruno Alberto Soares Oliveira, Daniel Calvo, Marcos Cesar Andrade Araujo Filho, Douglas Batista da Silva Ferreira, Ana Paula Paes Santos, Ivan Saraiva, Osmar Pinto, Jr., and Eugenio Lopes Daher. 2025. "Lightning Nowcasting Using Dual-Polarization Weather Radar and Machine Learning Approaches: Evaluation of Feature Engineering Strategies and Operational Integration" Climate 13, no. 8: 168. https://doi.org/10.3390/cli13080168

APA Style

Alves, M. A., Molina, R. A., Oliveira, B. A. S., Calvo, D., Araujo Filho, M. C. A., Ferreira, D. B. d. S., Santos, A. P. P., Saraiva, I., Pinto, O., Jr., & Daher, E. L. (2025). Lightning Nowcasting Using Dual-Polarization Weather Radar and Machine Learning Approaches: Evaluation of Feature Engineering Strategies and Operational Integration. Climate, 13(8), 168. https://doi.org/10.3390/cli13080168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightning Nowcasting Using Dual-Polarization Weather Radar and Machine Learning Approaches: Evaluation of Feature Engineering Strategies and Operational Integration

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Data Gathering

3.2. Physical Properties of Lightning

3.3. Lightning Prediction Approaches

3.3.1. Approach 1: Grouping of zh, zdr, kdp, and rhohv Data by Layers

3.3.2. Approach 2: Descriptive Statistics of zh, zdr, kdp, and rhohv Data by Height

3.3.3. Approach 3: zh Data at 18 Height Levels

3.4. Model Selection and Programming Environment

4. Results and Discussion

4.1. Integration with an Existing System

4.2. Comparison with Literature

4.3. Sensibility Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI