1. Introduction
Lightning, a transient yet high-impact atmospheric discharge phenomenon, serves as a pivotal indicator of severe convective weather systems (e.g., thunderstorms, tornadoes) and a critical variable for long-term climate analysis. For short-term disaster prevention, real-time lightning detection enables early warnings of extreme weather, reducing casualties and economic losses caused by lightning-induced wildfires, power grid failures, and structural damage [
1]. In climate research, decadal-scale lightning pattern analysis provides insights into atmospheric convection dynamics, global warming-induced changes in storm intensity, and variations in atmospheric electrical activity [
2]. Among diverse lightning-monitoring technologies, optical detection—relying on high-speed cameras or ground-based optical sensors—offers unique advantages: it captures direct visual information of discharge morphology (e.g., channel structure, brightness dynamics) that complements electromagnetic or radar-based methods, making it indispensable for fine-grained lightning classification [
3].
However, practical optical lightning monitoring faces three intertwined challenges that have long hindered the development of robust classification systems. The first is lightning’s instantaneity: a typical cloud-to-ground lightning discharge lasts only 10–100 milliseconds, manifesting as abrupt brightness spikes across 3–5 consecutive frames in high-speed video. This “single-frame abrupt change” contrasts sharply with the gradual feature variations targeted by most image recognition models, often leading to confusion with transient noise (e.g., sunlight glints on clouds, fast-moving birds) [
4]. The second is morphological diversity: lightning discharges exhibit highly variable forms—branched (tree-like channels), spherical (rare ball lightning), and cloud-to-ground (single thick channels)—each with distinct spatial features (e.g., edge sharpness, brightness distribution). This diversity renders manually designed feature sets (e.g., edge density, texture statistics) ineffective, as a feature optimized for branched lightning fails to generalize to spherical events [
5]. Third, and most restrictive, is scarce labeled samples: lightning events are spatially scattered and unpredictable, making large-scale labeled dataset construction logistically costly and time-consuming. For example, collecting over 1000 annotated lightning video clips requires years of continuous observation across multiple sites, leaving most studies with limited samples that exacerbate overfitting in deep learning models [
6].
Remote sensing and meteorological communities have long sought efficient methods to mitigate such sample scarcity, and recent advances in few-shot learning (FSL) have shown promise in remote sensing scene classification—Wang et al. evaluated a meta-transfer approach for few-shot remote sensing scene classification, demonstrating that transferable meta-learners can mitigate data scarcity issues even with extremely limited samples (e.g., 1–5 labeled samples per class) [
7]. However, direct adaptation of these FSL methods to lightning optical data remains underexplored: most existing FSL frameworks for remote sensing focus on static scenes (e.g., forests, deserts) and fail to account for lightning’s unique “millisecond-scale instantaneity”—a critical constraint for capturing the transient brightness changes that define true discharges. This gap means even state-of-the-art FSL models struggle to generalize to lightning’s dynamic morphology, let alone meet the real-time inference demands of field-deployed monitoring systems.
Existing solutions have struggled to address these challenges comprehensively. Traditional optical classification methods rely on handcrafted features and shallow classifiers: edge detection (e.g., Canny operator) identifies lightning channels but confuses jagged cloud edges with branched lightning [
4]; optical flow tracks motion but is disrupted by fast-moving cloud masses, leading to false positives in stormy weather [
5]; and frame difference methods (e.g., Otsu-thresholded frame subtraction) reduce static background noise but fail to capture the discriminative temporal patterns of dim or short-duration lightning [
2]. These methods lack adaptability to complex environments (e.g., fog, heavy rain) and cannot scale to diverse lightning morphologies.
Deep learning has advanced optical recognition by automating feature extraction, but its application to lightning classification is constrained by the few-shot dilemma. Convolutional Neural Networks (CNNs) like MobileNetV2 [
8] achieve high accuracy on large datasets but overfit severely when labeled samples are scarce—for instance, a CNN trained on 50 branched lightning samples may misclassify spherical lightning as non-lightning due to limited exposure to diverse morphologies [
6]. Few-shot learning (FSL) frameworks, designed to generalize from limited examples, have shown promise in remote sensing but remain underadapted to lightning optical data: Prototypical Networks [
9] learn class centroids but assume uniform feature distribution, which fails for lightning’s scattered morphological features; adaptive metric learning [
10] improves inter-class separation but ignores lightning’s transient temporal cues; and self-supervised pre-training [
3] (e.g., Masked Autoencoders on waveforms) focuses on 1D electromagnetic signals, not 2D optical frame dynamics critical for classification [
4].
To bridge these gaps, this study proposes a Frame Difference Triplet Network (FD-TripletNet), a novel deep learning framework tailored for few-shot optical lightning classification. The core innovations address the three key challenges: (1) frame difference matrices—computed as the absolute pixel-wise difference between consecutive frames—explicitly capture lightning’s “single-frame abrupt brightness change,” suppressing noise from static backgrounds or slow-moving clouds [
2]; (2) Triplet Loss with dynamic hard example mining enhances discriminative feature learning by compacting intra-class features (e.g., different branched lightning events) and separating inter-class features (e.g., lightning vs. strong light glare), even with limited labeled samples [
3]; (3) non-consecutive frame selection balances efficiency and robustness by retaining only the most informative frames (optimal K = 4), avoiding redundancy while ensuring coverage of transient discharge processes [
5]. Additionally, the framework adopts a lightweight MobileNetV2 backbone [
11], enabling real-time inference on edge devices (e.g., remote weather stations) to fill the ground-based monitoring gap in sparsely instrumented areas [
1].
The remainder of this manuscript is structured as follows:
Section 2 reviews related work on lightning detection, few-shot learning, and efficient network architectures;
Section 3 details the design of FD-TripletNet, including data preprocessing, network architecture, loss function, and training strategies;
Section 4 presents experimental results and comparative analyses with baseline methods;
Section 5 discusses the model’s strengths, limitations, and practical implications; and
Section 6 concludes with future research directions.
3. Model Design and Research
This study designs a deep learning model that integrates multimodal information and triplet learning mechanisms for lightning detection tasks. The model is deeply optimized regarding aspects from data processing, network architecture, and loss function to training strategies, and the following details the design ideas and key technical implementations of the model.
3.1. Dataset Construction and Preprocessing
To achieve accurate lightning detection, this study constructs a video dataset with two types of labels: “with lightning“ and “without lightning.” In the data preprocessing stage, a unique multi-frame processing strategy is designed for the temporal characteristics of video data.
Optical lightning discharges are characterized by abrupt and transient brightness variations embedded within complex and dynamic backgrounds. To emphasize discriminative temporal changes while suppressing static scene content, temporal difference is adopted as the core preprocessing and representation strategy.
Given an input optical video sequence
each frame
is first converted to grayscale and normalized to the range [0, 1] to reduce sensor-dependent variability and stabilize network training. These preprocessing steps are intentionally lightweight and do not introduce explicit denoising operations, allowing the subsequent temporal representation to remain physically interpretable. Noise suppression is primarily achieved through temporal differencing and frame selection rather than explicit spatial denoising, ensuring that physically meaningful brightness variations are preserved.
To highlight lightning-included brightness changes, the frame difference representation is defined as the absolute pixel-wise intensity difference between frames separated by a temporal interval
:
Adjacent-frame differencing () effectively captures abrupt intensity changes associated with return strokes. However, formulation above implicitly assumes that discriminative temporal information is concentrated within a single-frame interval and may fail to adequately represent gradually evolving discharge processes, such as leader development or decay phases, which often exhibit weaker but temporally extended optical emissions.
To accommodate the heterogeneous temporal dynamics of lightning discharges, the frame difference operation is extended to a multi-scale temporal formulation by computing difference maps over multiple temporal intervals:
This multi-scale representation preserves sensitivity to instantaneous brightness spikes while improving temporal coverage for weaker and gradually evolving discharge processes.It should be noted that the selected temporal intervals () do not impose strict or exclusive assumptions on lightning discharge phases. Instead, they provide complementary temporal coverage for heterogeneous discharge dynamics, allowing the model to flexibly capture both abrupt and gradual brightness variations without explicitly segmenting physical stages.
3.2. Network Architecture Design
The proposed separated-path model innovatively processes and fuses frame difference information and original frame information independently. As shown in
Figure 1, its architecture includes four core modules: frame difference feature-extraction module, original frame feature-extraction module, temporal aggregation module, and classification decision module.
3.3. Frame Difference Feature-Extraction Module
To effectively extract features of the frame difference matrix, a 1 × 1 convolutional layer is first introduced to convert the single-channel frame difference matrix into a three-channel one. This operation can not only increase the number of data channels but also extract local features through convolutional operations. Subsequently, MobileNetV2 is connected as the backbone network, whose core structure of depthwise separable convolution decomposes traditional convolution operations into depthwise convolution and pointwise convolution. Depthwise convolution independently performs spatial feature extraction for each input channel, and pointwise convolution (1 × 1 convolution) realizes information fusion between channels. Taking a convolutional layer with input feature map size and output channel number as an example, the computational complexity of traditional convolution is as follows:
where
k denotes the convolution kernel size. The computational complexity of depthwise separable convolution is as follows:
This significantly reduces the computational complexity.
In addition, MobileNetV2 adopts a linear bottleneck structure, consisting of three parts: 1 × 1 convolution for dimension expansion, depthwise separable convolution for feature extraction, and 1 × 1 convolution for dimension reduction. A linear activation function is used during dimension reduction to avoid information loss in low-dimensional spaces. In this model, MobileNetV2 is initialized with ImageNet pre-trained weights and fine-tuned for lightning detection tasks. The extracted feature maps are compressed into fixed-length vectors by an adaptive average pooling layer.
Table 1 shows the parameter settings of each module in the model for this experiment, and
Figure 2 shows the model structure.
3.4. Loss Function Design
To enhance the model’s discriminative ability and classification accuracy, a joint loss function integrating Triplet Loss and Binary Cross-Entropy Loss is designed to supervise model training from different perspectives.
3.4.1. Adaptive Triplet Loss
In the sample construction stage, triplet samples are generated through the custom TripletLightningDataset class. For each anchor sample (a), the positive sample (p) is selected from samples of the same class with the closest frame difference statistics, and the negative sample (n) is selected from samples of different classes with the most significant frame difference statistics to ensure that the sample group has high discriminability.
Regarding distance metrics, traditional Triplet Loss mostly uses a single distance calculation. The adaptive Triplet Loss proposed in this study introduces a learnable weight parameter , which is constrained to the interval [0, 1] through the Sigmoid function to dynamically fuse Euclidean distance and Cosine distance. The specific calculation process is as follows:
- 1.
Euclidean Distance Calculation: Measures the absolute distance in the feature vector space.
Calculate the Euclidean distances and between the anchor sample and the positive/negative samples, respectively.
- 2.
Cosine Distance Calculation: Measures the direction difference of feature vectors.
Calculate and .
- 3.
Distance Fusion: The fusion of Euclidean Distance and Cosine Distance.
This allows the model to automatically learn the importance of the two distances.
The final loss function is
where the margin value m = 0.8. By minimizing this loss, the model learns more discriminative feature representations.
3.4.2. Binary Cross-Entropy Loss
For the classification task, the Binary Cross-Entropy Loss function (BCEWithLogitsLoss) is adopted:
where y is the true label,
is the original output of the model, and
is the Sigmoid function, which is used to measure the difference between the predicted probability and the true label.
3.4.3. Joint Loss Function
The final joint loss function is
In this study, both and are set to 1.0, and the weights are adjusted to balance the impact of the two losses on training.
3.5. Temporal Frame Selection and Sequence Construction
Lightning events are temporally sparse and typically occupy only a limited subset of frames within a continuous video stream. Therefore, temporal selection is preformed at two complementary levels: frame-level screening and sequence-level construction.
3.5.1. Frame-Level Screening Based on Intensity Variation
In practical implementation, a statistical screening strategy based on the average pixel intensity variation of the frame-difference matrix is adopted to identify informative frames. For each multi-scale frame-difference map, is computed as the mean pixel value of the corresponding matrix.
According to the magnitude of , frame-difference maps are categorized into three intensity intervals:
Low-intensity (), corresponding to weak brightness changes typically observed during leader or decay phases;
Medium-intensity (), representing transitional discharge states;
High-intensity (), associated with abrupt brightness spikes of return stroke phases.
Within each intensity interval, frame-difference maps are sorted in descending order of while preserving their original temporal order to maintain the integrity of discharge dynamics. For a target sequence length K, frames are sampled a proportional allocation strategy based on a 2:1:1 intensity template (high:medium:low). The exact number of frames selected from each interval is scaled with K, prioritizing high-intensity responses while maintaining coverage of medium-intensity and low-intensity discharge phases. This design emphasizes phase representation rather than enforcing fixed frame counts and therefore generalizes naturally across different sequence lengths.
For rare lightning events, such as dim intra-cloud flashes that lack high-intensity responses, the required frames are supplemented by selected the highest- samples from adjacent intensity intervals to ensure sufficient temporal coverage.
To further suppress background interference, an isolate-frame exclusion mechanism based on temporal continuity is introduced. Low-intensity frame-difference maps are retained only if temporally adjacent frames within a frames window exhibit consistent intensity responses. Slow and continuous wind-include interference, such as swaying vegetation, typically generates consecutive low-intensity frame differences with stable spatial distributions, whereas lightning discharge produce transient and spatially dispersed brightness changes. By exploiting this difference in temporal continuity characteristic, the proposed mechanism provides a principled way to distinguish slow background motion from lightning-included variations.
The selected frames preserve their relative temporal order to ensure logical continuity of the discharge process.
3.5.2. Sequence-Level Temporal Length Determination
After frame-level screening, the selected frame-difference maps are assembled into temporal sequences of fixed length K, which serve as the input to the network. The choice of sequence length directly affects the balance between capturing complete discharge dynamics and avoiding redundant background information.
From a physical perspective, optical lightning discharges typically manifest over several consecutive frame in high-speed video. Based on this observation, candidate sequence length within the range K = 3 to 7 are evaluated. Shorter sequences may fail to capture complete discharge signatures, while longer sequences tend to introduce redundant background fluctuations.
The optical sequence length is determined empirically through comparative experiments, ensuring a balance between physical interpretability and data-driven optimization.
For short video segments containing fewer than K valid frames, or in degenerate cases where only a single frame is available, a default padding mechanism is applied. All-zero matrices are used to fill missing positions, ensuring consistency of input dimensions without introducing artificial brightness patterns.
Figure 3 shows the experiment process of sequence length optimization.
3.6. Training Strategies
To achieve efficient model training and stable convergence, a combination of mixed-precision training, gradient accumulation, and optimizer and learning rate adjustment strategies is used.
3.6.1. Mixed-Precision Training
The autocast and GradScaler of the torch.cuda.amp library are used to implement mixed-precision training. Autocast automatically switches part of the computation to half-precision to reduce memory usage and accelerate computation. GradScaler dynamically adjusts the gradient scaling factor to avoid numerical instability caused by half-precision computation, ensuring training stability and accuracy.
3.6.2. Gradient Accumulation
To solve the problem of small batches under hardware resource constraints, gradient accumulation technology is adopted. Gradients of multiple small batches are accumulated to achieve the effect of equivalent large-batch training. In this study, the gradient accumulation steps are set to 2; that is, model parameters are updated once after calculating the gradients of two small batches, thereby improving training stability and convergence speed without increasing memory consumption.
3.6.3. Optimizer and Learning Rate Adjustment
The Adam optimizer is selected to update model parameters, which combines the advantages of Adagrad and RMSProp and dynamically adjusts the learning rate according to the gradient history, with an initial learning rate set to 0.0001. A learning rate decay strategy is adopted during training: when the validation set accuracy no longer improves for several consecutive epochs, the learning rate is reduced to 0.1 times the original. Meanwhile, an early stopping mechanism is introduced: if the validation set accuracy does not improve for 10 consecutive epochs, training is stopped to avoid overfitting and balance the model’s generalization ability and fitting ability.
4. Experiments and Results
4.1. Dataset
The data used in this study are all from two optical observation stations built by the research team, located in Nanchang, Jiangxi (28.7°N, 115.5°E), and Zhongshan, Guangdong (22.6°N, 113.4°E), respectively. The video acquisition system deployed at both stations is standardized to ensure data consistency: high-speed industrial cameras were used, operating at a frame rate of 1000 fps (frames per second) to capture the microsecond-scale dynamics of lightning discharges. To ensure stability in outdoor meteorological environments, each camera was installed 2 m above the ground on a weatherproof platform, equipped with a dustproof and waterproof housing to prevent lens fogging in high-humidity condition. The raw video data is stored in uncompressed raw format to preserve maximum brightness details, and the data is transferred to a local storage server via Ethernet. This study collects 459 optical video data from May 2024 to October 2024. The video files collected by the camera have a resolution of 1280 × 1024. The dataset includes lightning video samples under different lighting conditions and different cloud occlusion degrees, and the non-lightning dataset includes video data of strong light irradiation, moving objects, and other events easily mistaken for lightning. Using OpenCV 4.8.0, we processed raw uncompressed videos: we loaded files via cv2.VideoCapture(), validated frame count/1000 fps frame rate, retained 5–10-s valid segments. We then computed pixel-wise absolute difference matrices for consecutive frames to highlight lightning’s transient brightness changes, finally archiving data as compressed .npz files.
The dataset structure is visualized in
Figure 4. The dataset is divided into Lightning and Non-lightning categories: for lightning samples, we further distinguish strong lightning (sharp edges) and weak lightning (blurred edges); the Non-lightning category includes no lightning (background only) and living (interference like moving objects). To address the issue that lightning channels are not easily distinguishable in original grayscale images, we use a lightning channel-enhanced visualization to highlight the discharge paths, which is only for better visual presentation (no temporal compositing methods involved in the model).
In this dataset, 319 video files are used for training, and 140 video files are used for testing. The divided dataset is shown in the
Table 2.
4.2. Evaluation Metrics
This study uses accuracy, precision, recall, F1-score, False Positive Rate (FPR), False Negative Rate (FNR), and confusion matrix to measure the experimental results of the lightning recognition method. The specific calculation formulas are as follows:
- 1.
Accuracy: Measures the overall correctness of predictions.
- 2.
Precision: Reflects the proportion of correctly predicted lightning events among all predicted lightning events.
- 3.
Recall: Indicates the proportion of actual lightning events that are correctly identified.
- 4.
F1-Score: A balanced measure that combines precision and recall.
- 5.
FPR: Quantifies the proportion of non-lightning events incorrectly classified as lightning.
- 6.
FNR: Quantifies the proportion of actual lightning events incorrectly classified as non-lightning.
- 7.
Confusion Matrix: A tabular representation of prediction outcomes.
From these formulas, the following variables are defined:
TP (True Positives): Number of lightning events correctly classified as lightning.
TN (True Negatives): Number of non-lightning events correctly classified as non-lightning.
FP (False Positives): Number of non-lightning events incorrectly predicted as lightning.
FN (False Negatives): Number of lightning events incorrectly predicted as non-lightning.
These metrics collectively provide a comprehensive assessment of the model’s ability to distinguish between lightning and non-lightning events, particularly critical for minimizing false alarms in real-world applications.
4.3. Experimental Settings
The experiments in this study are implemented in Python 3.12 on a Windows 10 system with an NVIDIA GeForce RTX 4060 GPU. The training configuration includes an initial learning rate of 0.001, a batch size of 32, and 50 training epochs with cosine decay scheduling. The joint loss function combines Triplet Loss and cross-entropy loss to optimize the lightning recognition model, effectively addressing the challenges of few-shot learning and instantaneous feature extraction in lightning optical classification.
4.4. Baseline Methods
The proposed method is compared with the following baselines:
- 1.
Traditional Frame Difference with SVM [
5]: Uses Otsu threshold segmentation to extract frame difference features, followed by classification with SVM.
- 2.
MobileNetV2 [
8]: Directly inputs original frames into MobileNetV2 without frame difference or Triplet Loss.
- 3.
TripletNet (Static Input) [
24]: Uses Triplet Loss but input static single frames instead of frame difference matrices.
- 4.
Single-Scale FD-TripletNet (): Uses only adjacent-frame differences.
- 5.
Multi-Scale FD-TripletNet (): Proposes variant incorporating multi-scale temporal differencing.
In addition, we discuss the work of Qian et al. [
14], which achieves competitive lightning identification performance on large-scale datasets using deep learning. However, their approach relies on extensive labeled data (over 10,000 samples), whereas FD-TripletNet targets few-shot scenarios (fewer than 300 labeled samples per class), which better reflects realistic constraints for rare lightning events. Therefore, direct performance comparison under identical data regimes is non-trivial.
4.5. Overall Performance Comparison
We first compare the overall classification performance of all methods.
Table 3 summarizes the results on the test set in terms of accuracy, precision, recall, F1-score, false positive rate (FPR), and false negative rate (FNR).
Traditional methods perform poorly due to their reliance on manual features, which are easily disturbed by cloud movement and noise; MobileNetV2 shows improvement but lacks robustness in few-shot scenarios with lower recall for weak lightning; TripletNet (Static Input) benefits from Triplet Loss yet fails to capture lightning’s temporal dynamics, resulting in lower recall than both FD-TripletNet variants; the Single-Scale FD-TripletNet () excels at capturing instantaneous return stroke signals but misses gradual discharge processes, limiting its performance on weak or complex lightning; while the Multi-Scale FD-TripletNet () achieves comprehensive improvement—with accuracy increasing by 2.5%, recall rising by 4.4%, and FNR dropping by 2.7% compared to the single-scale variant—thanks to its adaptive multi-scale fusion design: captures instantaneous return stroke signals, retains gradual leader phase dynamics, and the model automatically weights discriminative features across scales to avoid redundancy, additionally demonstrating stronger robustness to weak lightning and low-SNR scenarios and addressing the single-scale variant’s blind spot in gradual discharge processes.
It is worth emphasizing that the proposed multi-scale fusion does not rely on an explicit attention or weighting mechanism. Instead, the relative contribution of different temporal scales is learned implicitly through end-to-end training within the metric learning framework. This design avoids introducing additional model complexity while allowing the network to adaptively emphasize discriminative temporal information.
4.6. Ablation Study
4.6.1. Impact of Temporal Scale Selection
To investigate the effect of temporal scale selection, we compare the Single-Scale FD-TripletNet () with the Multi-Scale FD-TripletNet ().
The results show that incorporating multiple temporal intervals significantly improves recall and reduces FNR. This improvement indicates that larger temporal gaps () effectively capture cumulative brightness variations associated with gradual leader phases, complementing the instantaneous return stroke signals captured by .
4.6.2. Impact of Frame Selection Strategies
Figure 5 illustrates the performance difference between frame selection strategies for both single-scale and multi-scale settings.
For the Single-Scale FD-TripletNet (), the classification accuracy of the non-consecutive frame strategy reaches its maximum of 92.3% at K = 4, and remains stable at approximately 91.3% for , indicating strong resistance to temporal redundancy. Precision further improves with increasing K, peaking at 91.2% at K = 6, while recall remains consistently around 90.5% across all tested sequence lengths—an important property for minimizing missed lightning detections. The corresponding F1-score, which jointly reflects precision and recall, also demonstrates the clear advantage of the non-consecutive frame selection strategy over consecutive sampling.
Error rate analysis shows that non-consecutive frame selection maintains a stable false positive rate below 0.08 and a false negative rate no higher than 0.09, substantially outperforming the larger fluctuations observed under consecutive frame selection. This behavior is consistent with the physical characteristics of lightning discharges, which typically span 3–5 frames. By avoiding redundant adjacent frames, non-consecutive sampling preserves salient abrupt optical changes while effectively suppressing background noise.
For the Multi-Scale FD-TripletNet (), accuracy reaches a peak of 94.8% at K = 5 and only exhibits a mild decline to 93.7% at K = 7, reflecting enhanced robustness to temporal redundancy and improved adaptability to multi-scale discharge dynamics. Precision achieves its highest value of 93.5% at K = 6, while recall remains consistently high at approximately 94.1% across all K values, which is particularly critical for safety-sensitive lightning detection tasks. The resulting F1-score further confirms the superiority of the non-consecutive strategy under multi-scale settings.
From an error-rate perspective, non-consecutive frame selection stabilizes the false positive rate below 0.08 and reduces the false negative rate to ≤0.04, outperforming consecutive sampling. This outcome aligns well with the physical structure of lightning discharges: abrupt return strokes are effectively captured at , while more gradual leader-phase dynamics are retained at . When combined with non-consecutive frame selection, the multi-scale differencing strategy aggregates discriminative temporal features across scales while filtering redundant background responses and isolated interference, thereby maximizing the synergy between multi-scale temporal representation and efficient frame sampling.
This superiority stems from the multi-scale strategy’s ability to distinguish “lightning dynamics” from “background redundancy”: captures core return stroke signals, supplements leader phase features, and non-consecutive selection filters redundant background frames without losing critical temporal information.
4.6.3. Confusion Matrix Analysis
Figure 6 and
Figure 7 present the confusion matrices of the Single-Scale and Multi-Scale FD-TripletNet, respectively.
For the Single-Scale FD-TripletNet (), the model correctly classifies 81 lightning events and 49 non-lightning events, with 5 false negatives and 5 false positives. Among the false negatives, three cases correspond to weak lightning dominated by prolonged leader phases, which are insufficiently captured by adjacent-frame differencing alone.
For the Multi-Scale FD-TripletNet (), the number of false negatives is reduced from 5 to 2, and true positive detections increase accordingly. Notably, only two false negative corresponds to a weak lightning event, indicating that the inclusion of larger temporal intervals effectively captures gradual discharge dynamics.
Overall, the multi-scale variant demonstrates improved discriminative ability, particularly in reducing missed detections of weak and complex lightning events.
5. Discussion
The experimental results on the optical lightning dataset demonstrate the effectiveness of the proposed Frame Difference Triplet Network (FD-TripletNet) and provide insight into its design rationale, practical applicability, and remaining limitations. Rather than attributing the performance gains to network complexity, the advantages of FD-TripletNet stem from several design choices that align with the physical and observational characteristics of lightning discharges.
5.1. Effectiveness of Design Choices for Lightning Classification
Lightning discharges exhibit extreme temporal instantaneity, often manifesting as abrupt brightness changes within a single or very few frames. By adopting adjacent frame difference matrices as input, FD-TripletNet directly targets this “frame abruptness” characteristics. This design choice is consistent with previous findings that frame difference representations are particularly effective for detecting abrupt temporal changes in dynamic scenes. In the context of lightning, the proposed adaptation further aligns with millisecond-scale discharge dynamics, enabling reliable capture of rapid optical transitions.
In addition, the use of Triplet Loss with dynamic hard example mining addresses the challenge of scarce labeled samples. By optimizing relative distances in the embedding space, the network learns compact intra-class representations while enforcing clear separation between lightning and non-lightning samples. Compared with a TripletNet using static frame input, this strategy improves features generalization and contributes to a notable increase in classification performance, particularly in terms of F1-score.
The non-consecutive frame selection strategy further enhances robustness and efficiency. By reducing temporal redundancy, it stabilizes classification performance and avoids the erratic behavior that can arise when consecutive frames are dominated by background fluctuations. At the same time, this strategy significantly reduces computational cost, which is essential for deployment on resource-constrained edge devices.
Nevertheless, the effectiveness of multi-scale temporal differencing is influenced by dataset characteristics and noise conditions, and should be interpreted as a conditional improvement rather than a universally optimal choice for all optical lightning monitoring scenarios.
5.2. Practical Value for Real-World Monitoring
From an application perspective, FD-TripletNet’s lightweight architecture enables real-time inference on low-power edge platforms such as remote weather observation stations. This capability is particularly valuable in regions where ground-based lightning monitoring infrastructure remain sparse. The achieved low false positive rate reduces unnecessary alarms and operational costs, while the low false negative rate ensures that critical lightning events are unlikely to be missed, supporting applications such as wildfire prevention, power grid protection, and severe weather monitoring.
Rather than replacing space-based lightning detection systems, the proposed framework complements them by providing high-resolution, ground-level validation and localized monitoring, thereby enhancing overall situation awareness.
5.3. Limitations Related to Signal Strength and Environmental Conditions
Despite its effectiveness, FD-TripletNet inherits inherent limitations associated with single-model optical sensing. Extremely dim lightning events produce weak frame difference signals that can be submerged by sensor noise, accounting for the majority of false negative cases observed in the experiments. This limitation reflects a fundamental constraint of optical-only approaches rather than deficiencies in the network design.
Performance degradation is also observed under adverse weather conditions such as heavy rain or fog. Atmospheric scattering and attenuation distort brightness patterns and reduce contrast, directly affecting the reliability of frame difference representations. While adaptive preprocessing may alleviate this issue to some extent, it cannot be fully resolved within a purely optical framework.
5.4. Background Interference and Motion-Induced False Positives
The model demonstrates strong robustness against slow wind-included interference, such as swaying tree branches or shaking weeds. These background motions typically generate low-intensity, temporally continuous frame differences with stable spatial distributions, which can be effectively suppressed through intensity-based screening and non-consecutive frame selection.
However, fast-moving small-scale objects, including birds or high-speed insects, remain a challenging source of false positives. Such objects produce localized, transient, and high-intensity brightness changes that partially overlap with lightning features. Their discrete motion trajectories do not satisfy temporal continuity assumptions, making them difficult to filter using frame-level screening alone. This limitation highlights the intrinsic ambiguity of short-duration optical signals in complex outdoor environments.
5.5. Adaptability to Different Optical Configurations
Another practical limitation concerns adaptability to different focal length lenses, which is crucial for large-scale deployment across heterogeneous monitoring stations. In this study, experimental validation of focal length adaptability was not feasible due to objective constraints. Lightning events are highly seasonal and concentrated within a limited collection window, making coordinated sampling across multiple lens types impractical. Moreover, capturing synchronized recordings of the same lightning event with different focal lengths at the same location and time is extremely challenging under natural conditions.
5.6. Future Directions
Future work will focus on integrating multi-modal data sources, such as atmospheric electric field measurements and radar observations, to improve detection of extremely dim lightning. Weather-aware adaptive preprocessing strategies will be explored to mitigate scattering effects under adverse conditions. In addition, multi-scale feature fusion and attention-based spatial-temporal modeling will be investigate to further suppress motion-induced false positives.
To address focal length adaptability, we plan to collaborate with regional meteorological observatories to establish a multi-focal-length lightning monitoring network. By collecting synchronized optical recordings of the same lightning events across different focal lengths, the robustness and generalization of FD-TripletNet under varying optical configurations can be systematically evaluated.
Overall, FD-TripletNet demonstrates that aligning representation learning with lightning’s physical characteristics enables a practical and deployable solution for optical lightning classification.
6. Conclusions
This study proposes a Frame Difference Triplet Network (FD-TripletNet) to address the challenges of scarce labeled samples, extreme temporal instantaneity, and diverse discharge morphologies in optical lightning classification. By leveraging adjacent frame difference representations, metric learning with Triplet Loss, and non-consecutive temporal sampling, the proposed framework effectively alleviates the limitations of traditional handcrafted methods and baseline deep learning models under few-shot conditions.
Experimental evaluation on a self-built dataset of 459 optical lightning samples demonstrates that FD-TripletNet achieves a classification accuracy of 94.8%, a false positive rate of 7.4%, and a false negative rate of 3.2%, outperforming frame difference plus SVM, MobileNetV2, and static-input TripletNet baselines. The lightweight MobileNetV2-based architecture further enables real-time inference on edge devices, supporting practical deployment in remote observation stations and severe convective weather monitoring systems.
At the same time, the study highlights inherent limitations of single-modal optical approaches, including reduced sensitivity to extremely dim lightning, performance degradation under heavy rain or fog, and susceptibility to fast-moving non-lightning objects. Addressing these challenges will require multi-modal sensing, adaptive temporal modeling, and more comprehensive interference-aware training strategies.
In summary, FD-TripletNet provides a physically informed, data-efficient, and deployable framework for optical lightning classification, laying a solid foundation for advancing real-time ground-based meteorological monitoring and early warning applications.