1. Introduction
With the growing development of modern power grids, the secure and stable operation of transmission lines has become an essential premise for ensuring the reliable supply of electricity. In actual engineering applications, transmission lines are often routed through mountainous areas, forests, and regions of high human activity, where the environmental conditions are extremely complex and dynamic. Under the joint influence of extreme weather conditions, geographical environment, and human activities, the risk of wildfires is high in the vicinity of transmission lines [
1]. In the process of wildfires, hot flames, thick smoke, and suspended particles can greatly weaken the air insulation strength around conductors. In extreme cases, this may cause electrical discharge, insulation failure, or even massive power outages [
2,
3]. These situations directly affect the security of power systems and may cause serious economic and social losses. Therefore, the development of effective wildfire identification techniques particularly suited to the monitoring of transmission corridors has great engineering value and application.
The traditional wildfire early warning system employed in the transmission corridor is based on manual patrols, fixed-point infrared temperature measurement, smoke sensors, multi-sensor fusion systems, or rule-based image processing pipelines [
4,
5,
6]. In practical implementations, the fire region of interest is extracted using heuristic methods such as color thresholding, and classification is performed using handcrafted texture, morphological, or motion features. Due to their simplicity and ease of implementation, these methods were preferred in early engineering applications.
However, the actual environments of transmission corridors are much more difficult than those considered in such approaches. The images of corridor monitoring are often affected by strong artificial lighting at night, such as street lighting, car headlights, and their reflections on conductors, towers, and other objects. Other factors of interference are welding flames in maintenance work, open burning, sunlight spots, red-orange objects, and frequent occlusion by smoke or haze. In such environments, the manually designed features are very sensitive to changes in illumination and background clutter, leading to high rates of false alarms and unstable performance for different corridors, seasons, and time periods, as it has been reported in corridor-related works [
7].
Deep learning has greatly influenced the field of visual fire and smoke recognition. Cheng et al. gave a thorough survey on deep learning-based visual fire detection and stated that the performance improvement is closely linked to the data composition and the realism of negative samples [
8]. Gragnaniello et al. gave a survey on fire and smoke detection in video sequences and gave a taxonomy that takes into account factors that are often underrepresented in benchmark datasets, such as viewpoint, background dynamics, and lighting conditions [
9].
From a system point of view, Boroujeni et al. reviewed AI-assisted unmanned aerial systems in the pre-, active, and post-wildfire phases and emphasized that sensing platforms and operational conditions greatly affect the performance of visual recognition systems [
10]. Vasconcelos et al. and Saleh et al. reviewed recent progress in deep learning-based fire detection and emphasized that multi-scale feature learning and data-driven learning have become the mainstream design approaches, but failure cases are still prevalent under complex illumination and cluttered backgrounds [
11,
12]. Özel and Elhanashi et al. further introduced the long-standing challenges of biased datasets, a lack of hard negative examples, and the trade-off between recognition performance and real-time processing, especially in early warning systems [
13,
14]. UAV-centric reviews by Bouguettaya et al. and Danish et al. also indicated that viewpoint changes, motion blur, and real-time processing capabilities are great factors that influence detection performance in real-world conditions [
15,
16]. For smoke recognition, Chaturvedi et al. indicated that smoke is often mistakenly detected as clouds, fog, and haze in distant outdoor surveillance scenarios [
17].
Within the particular context of power transmission line corridors, the task of wildfire identification is faced with challenges over and above those experienced in general outdoor fire detection. The corridor fire detection system is normally designed with fixed cameras running continuously with limited viewpoints and the need to correctly identify actual wildfires among a vast number of light sources and reflective interferences that resemble fire. Wang et al. put forward a metric learning-based improved oriented R-CNN to improve the separability of feature targets and interference patterns within the context of corridors [
18]. At the same time, Huang et al. established a physical Bayesian modeling framework to evaluate the vulnerability of transmission line tripping due to wildfires, clearly showing the relationship between wildfire identification performance and power system operation risk [
19].
Under practical transmission corridor applications, the performance of wildfire recognition systems is limited not only by model structure but also by operational environment and data characteristics. Fixed corridor cameras are always on and have limited viewpoints, and most of the collected samples are non-fire images dominated by artificial lighting and reflective surfaces. At night, fire-like interferences tend to have visual appearances that are highly similar to real fires, making discrimination much harder. Nighttime fire detection studies in urban areas have indicated that even state-of-the-art deep learning models are highly susceptible to strong illumination interference and temporal variability, causing persistent false alarms [
20]. In addition, power line inspection and wildfire risk prediction surveys have highlighted that practical monitoring systems must meet very strict real-time constraints, further limiting the direct application of computationally intensive models [
21,
22]. Consequently, the problem of accurate wildfire identification with low false alarms and stable real-time performance remains a pressing unsolved issue in transmission corridor monitoring.
In order to solve the problems mentioned above, this paper is aimed at the image recognition of wildfires in the professional context of monitoring the visualization of power transmission corridors. The key points include the following:
(I) Building a dataset of images of wildfires for power transmission corridors, combining images of monitoring on site with publicly available information, including common complex situations like strong light at night, fire-like interference, and smoke obstruction, and training and testing based on actual sample distribution;
(II) Adding multi-scale feature extraction and reparameterization architectures with the AlexNet backbone network to enhance the model’s representation capability of wildfire features at multiple scales while ensuring the efficiency of inference;
(III) Lightweight multi-scale attention and hybrid pooling attention mechanisms are designed to improve the model’s ability to distinguish wildfire targets in low contrast and complex background environments;
(IV) By means of comparative experiments, ablation experiments, and cross-dataset verification, the superiority of the proposed method in terms of recognition accuracy, robustness, and generalization ability is analyzed.
3. Transmission Line Wildfire Identification Based on Improved AlexNet
This study is based on the AlexNet architecture and integrates the RepGhost-Inception (RGI) module, Light-MSA module, and Hybrid Pooling Attention (HPA) module to construct an improved AlexNet-based wildfire recognition model for power transmission lines. As shown in
Figure 5, the proposed network backbone consists of five convolutional layers, with the RGI module embedded after the Pool2 layer for multi-scale feature fusion. The Light-MSA and HPA modules are sequentially attached after the Conv5 layer to enhance global context modeling and interference suppression capabilities. To prevent the full connection parameters from becoming excessively large due to high-resolution inputs, an adaptive average pooling layer is used to fix the feature dimensions, followed by a single-hidden-layer fully connected classification head.
3.1. Input and Task Definition
The transmission line monitoring images were resized to 640 × 640 pixels, then normalized, and data augmentation was performed using random horizontal flips, brightness disturbance, and contrast disturbance.
The preprocessed images were then used as input to the network.
The model outputs binary classification probabilities:
Here, fθ(⋅) represents the improved AlexNet model parameterized by θ, where θ denotes the set of learnable network parameters.
3.2. Shallow Feature Extraction
Let the output feature map of the l-th layer be denoted as
The shallow feature extraction stage is defined as follows:
Conv1: 11 × 11, s = 4, p = 2, C = 64;
MaxPool1: 3 × 3, s = 2;
Conv2: 5 × 5, s = 1, p = 2, C = 192;
MaxPool2: 3 × 3, s = 2.
This stage primarily extracts low-level texture information, such as edges and contours, providing fundamental features for subsequent multi-scale and attention-based modules.
3.3. RepGhost-Inception Multi-Scale Reparameterization Fusion Module
To better capture flame and smoke pattern characteristics at various spatial scales with controlled computational complexity, the RGI module is proposed after Pool2. Let the input feature map be represented as:
In the training stage, the RGI module uses a multi-branch convolutional module to learn features with varying receptive fields.
The
ϕi(·) represents the convolutional module with different kernel sizes. The multi-branch features are concatenated along the channel axis:
Later, the Ghost mechanism was proposed to produce redundant features and alleviate computational complexity:
In this equation,
g(·) represents the low-cost operator, and ⊕ represents channel concatenation or element-wise fusion. In the inference stage, the multi-branch convolution and BN can be equivalently folded and reparameterized into a single convolution kernel with a bias term, allowing RGI to perform only one standard convolution operation during inference. The RGI output is represented as
3.4. Lightweight Multi-Scale Self-Attention Module
To accommodate the scale variation from initial fire points to smoke dispersion in wildfires and enhance modeling capabilities for long-range dependencies, Light-MSA is introduced after Conv5 output. Let the input be
Light-MSA constructs attention at two scales: scale s = 1 corresponds to the original resolution, while scale s = 2 represents downsampling via 2 × 2 average pooling.
For each scale, a 1 × 1 convolution is used to generate
Q,
K, and
V:
Flatten the features into a token sequence (
Ns = HsWs), then compute the self-attention:
d denotes the projection dimension. To control computational complexity, the number of tokens
N at scale
s = 2 is reduced by
N/4.
A(2) is upsampled to the original resolution and then fused:
Among these, λ1 and λ2 are learnable weights.
3.5. Hybrid Pooling Attention Module
To suppress fire-like interference, such as strong light reflections at night and welding sparks, an HPA is introduced after the Light-MSA for joint channel and spatial recalibration. Let the input be FMSA.
After concatenating the two, apply direction-sensitive convolutions:
Its α and β are learnable weights with k = 7.
3.6. Loss Function and Optimization Strategy
The proposed wildfire recognition network is trained using the cross-entropy loss function. Given an input sample x
i with ground-truth label
yi ∈ {0,1}, the predicted probability of the wildfire class is denoted as
pi = fθ(
xi). The cross-entropy loss is defined as
where
N denotes the number of samples in a mini-batch.
Considering that the wildfire dataset exhibits an imbalanced distribution between wildfire and non-wildfire samples, a weighted cross-entropy formulation is adopted. Let
N1 and
N0 denote the number of wildfire and non-wildfire samples, respectively. The corresponding class prior probabilities can be expressed as
To alleviate the bias toward the dominant class, class-dependent weighting coefficients are introduced
so that samples from the minority class receive larger gradient contributions during optimization.
The Adam optimizer is used to optimize the network parameters. The initial learning rate is set to lr = 1 × 10−3, weight decay is set to 1 × 10−4, the batch size is fixed to 16, and the maximum number of training epochs is set to 200. A cosine annealing learning rate schedule is adopted during training. The final model is selected based on the best F1 score on the validation set. Early stopping is applied when the validation performance does not improve over several consecutive epochs, which helps prevent overfitting and improves training efficiency.
3.7. Architecture Overview
The overall architecture of the proposed improved AlexNet is illustrated in
Figure 5. The input transmission corridor image is first fed into Conv1 to extract low-level visual features, followed by a MaxPool layer to reduce the spatial resolution and retain dominant responses. The resulting feature maps are then processed by Conv2 to further capture intermediate representations.
After the early convolution stages, the Light-MSA module is introduced to enhance multi-scale contextual dependency modeling and improve the representation of weak wildfire regions under complex backgrounds. The refined features are subsequently passed through Conv3, followed by the RepGhost-Inception (RGI) module, which performs lightweight multi-scale feature extraction and feature fusion through re-parameterized branches. This design improves the network’s ability to characterize wildfire-related patterns of different scales while maintaining computational efficiency.
Next, the fused feature maps are fed into Conv4 to obtain higher-level semantic representations. To further suppress fire-like interference caused by strong light, reflections, and other background disturbances, the Hybrid Pooling Attention (HPA) module is applied after Conv4 to adaptively refine the discriminative features. Finally, the refined feature maps are sent to the fully connected classification layer, which outputs the final recognition result. Through the coordinated integration of Light-MSA, RGI, and HPA into the AlexNet backbone, the proposed model achieves enhanced feature representation and interference suppression while preserving a relatively lightweight architecture. The overall pseudocode for our method can be presented in Algorithm 1.
| Algorithm 1 Pseudocode of the proposed improved AlexNet |
| Input: Training set D = {(xi, yi)}i = 1N |
| Output: Predicted label ŷ ∈ {wildfire, non-wildfire} |
| Initialize the parameters of the improved AlexNet |
| for each training epoch do |
| for each mini-batch {(xi, yi)} in D do |
| Resize and normalize the input image xi |
| Perform data augmentation on xi |
| Extract low-level features using Conv1 |
| Reduce the spatial dimension using MaxPool |
| Extract intermediate features using Conv2 |
| Enhance multi-scale contextual information using Light-MSA |
| Further extract semantic features using Conv3 |
| Perform lightweight multi-scale feature fusion using the RepGhost-Inception module |
| Generate high-level feature maps using Conv4 |
| Refine discriminative representations using the Hybrid Pooling Attention module |
| Feed the refined features into the fully connected classification layer |
| Obtain the prediction probability pi |
| Compute the classification loss between pi and yi |
| Update the network parameters by backpropagation |
| end for |
| end for |
4. Experimental Results and Analysis
4.1. Construction of Wildfire Dataset and Experimental Environment
To validate the effectiveness of the proposed method in the visual monitoring scenario of power transmission line corridors, this study constructed a wildfire image classification dataset for power transmission lines. This dataset mainly comes from two sources:
(I) Field-collected data: A total of 1246 images were collected, captured by visual monitoring devices deployed along the transmission line corridors of the State Grid Corporation of China. These images cover various scenarios, including daytime and nighttime, sunny and cloudy weather, and include some typical false alarm sources such as strong lights and reflections from street lamps and vehicle headlights, welding or burning flames, red-orange objects, sunspots, smoke obstructions, and haze.
(II) Public supplementary data: 600 typical wildfire and non-wildfire event images were collected from publicly available online sources. These data supplement the dataset, increasing the visual diversity caused by differences in geographic terrain, vegetation types, and imaging devices, thereby enhancing the model’s cross-scenario generalization capability.
In order to guarantee the reproducibility of the split of data and the validity of the assessment, all samples were subjected to task-specific image-level cleaning and deduplication. For a series of images from the same event or video clip, sample selection was performed by a combination of similarity screening and visual review, so that similar samples would not appear together in both the training and testing sets. Then, the dataset was split randomly into training, validation, and testing sets in a ratio of 7:2:1, with the aim being to preserve the original distribution of wildfire and non-wildfire samples in each subset in order to reduce the effect of distribution drift on the assessment results.
Figure 6 illustrates some examples of wildfire and non-wildfire samples in the generated dataset.
The experiment was performed on a workstation with an Intel Xeon Silver 4214R CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA RTX 3080 Ti GPU with 12 GB VRAM (NVIDIA Corporation, Santa Clara, CA, USA), and a Linux operating system. The proposed approach was developed with Python 3.11 and the PyTorch 2.1.0. deep learning library. The input images were uniformly resized to 640 × 640 pixels, the batch size was fixed to 16, and the maximum number of training epochs was fixed to 200. To improve the training stability in the small-sample setting, we conducted transfer learning using pre-trained weights on ImageNet, and the parameters of the newly added modules were initialized with Kaiming initialization.
4.2. Training Strategy
The proposed wildfire recognition network is trained using the cross-entropy loss function as the optimization objective. Given an input sample x
i with ground-truth label
yi ∈ {0,1}, the predicted probability of the wildfire class is denoted as
pi = fθ(
xi). The cross-entropy loss is defined as
where
N denotes the number of samples in a mini-batch.
Considering that the wildfire recognition dataset contains an imbalanced distribution between wildfire and non-wildfire samples, a weighted cross-entropy formulation is adopted to reduce the bias toward the dominant class. The weighted loss function can be expressed as
where
w1 and
w0 denote the weights associated with wildfire and non-wildfire classes, respectively.
4.3. Performance Evaluation Metrics
In this paper, precision, recall, accuracy, and F1 measure are employed to assess the performance of the proposed neural network model for wildfire detection. Precision is defined as the ratio of the number of correctly predicted positive instances to the total number of predicted positive instances, which indicates the model’s capability to suppress false positives. Recall, or the true positive rate (TPR), is the ratio of the number of correctly predicted positive instances to the total number of actual positive instances, which indicates the model’s sensitivity to the wildfire event. Accuracy is defined as the ratio of the number of correctly classified instances to the total number of instances. The F1 measure is a hybrid metric that takes into account both precision and recall. The definitions of these metrics are as follows:
TP stands for the number of samples in which there are wildfires and are correctly classified as such by the model. FP stands for the number of samples in which there are no wildfires but are incorrectly classified as such by the model. FN stands for the number of samples in which there are wildfires but they are not correctly classified by the model. TN stands for the number of samples in which there are no wildfires and are correctly classified by the model.
In addition, a confusion matrix is provided to illustrate the distribution of classification results and to support the calculation of evaluation metrics such as precision, recall, and F1-score.
4.4. Model Comparison
In order to verify the effectiveness and superiority of the proposed network, we conducted comparative experiments between the proposed network and several representative models. The experiments were conducted under the same hardware environment, and the corresponding input image sizes were adopted for each network to achieve the best results [
29,
30,
31,
32,
33,
34,
35]. The experimental results are listed in
Table 1, and the comparison of the training accuracy of different models is shown in
Figure 7.
As shown in
Table 1, the proposed improved AlexNet achieves the best overall performance among the compared models in terms of accuracy, precision, recall, and F1-score. In particular, the accuracy of the proposed model is 11.3% higher than that of the original AlexNet. In addition, the proposed model also achieves higher F1-scores than lightweight architectures such as MobileNetV2 and deeper networks such as ResNet50. Compared with more recent deep learning architectures, including EfficientNetV2, Vision Transformer (ViT), and ConvNeXt, the proposed model still maintains competitive performance. These results indicate that the proposed network has strong discriminative capability and robustness for wildfire recognition in complex transmission corridor monitoring environments, especially in suppressing fire-like interference.
Although formal statistical significance testing is not included in this study, the consistent improvements across multiple evaluation metrics (accuracy, precision, recall, and F1-score) and comparisons with several representative baseline models provide empirical evidence of the effectiveness of the proposed method.
To further analyze the convergence behavior and training stability of different models, the training accuracy and loss curves during the training process are shown in
Figure 7 and
Figure 8. As illustrated in
Figure 8, the proposed model exhibits a relatively faster decrease in loss during the early stage of training, indicating effective feature learning and parameter optimization. After a certain number of training iterations, the loss gradually stabilizes, suggesting that the model converges well during the training process and demonstrates good training stability. Compared with the baseline AlexNet model, the proposed method maintains a consistently lower loss value throughout the training process. This observation further demonstrates the effectiveness of the proposed multi-scale feature extraction strategy and attention recalibration mechanism in improving feature representation under complex wildfire monitoring scenarios.
To evaluate the trade-off between recognition performance and computational complexity, we further analyzed the number of parameters and floating-point operations (FLOPs) of different models, and the results are presented in
Table 2. As shown in
Table 2, VGG16 and Vision Transformer have relatively large numbers of parameters and higher computational complexity. MobileNetV2 has the smallest model size and the lowest computational cost, but its recognition performance is relatively limited. Recent architectures such as EfficientNetV2 and ConvNeXt achieve improved efficiency but still require moderate computational resources. In comparison, the improved AlexNet proposed in this paper achieves a favorable balance between computational complexity and recognition performance. Specifically, the proposed model significantly reduces the number of parameters compared with the original AlexNet while maintaining relatively low FLOPs. At the same time, it achieves the best recognition performance among the compared models, demonstrating its effectiveness for wildfire recognition in transmission corridor monitoring scenarios.
4.5. Ablation Study
In order to verify the effectiveness of each proposed improvement module, an ablation study was performed by gradually adding the RepGhost-Inception (RGI), Light-MSA, and Hybrid Pooling Attention (HPA) modules to the AlexNet backbone network. The experimental results of the ablation study are shown in
Table 3.
The experimental results show that only by combining the RepGhost-Inception module, Light-MSA, and the hybrid pooling attention module can the highest overall performance with an accuracy of 96.9% be obtained. Compared with the combination of only the RepGhost-Inception module and Light-MSA, the accuracy has been improved by 1.8 percentage points. These experimental results further verify that the reasonable combination of multiple advanced modules can make the model’s generalization performance better.
4.6. Model Stability Validation
To further evaluate the generalization capability of the proposed model, additional experiments were conducted on the Internet Forest Fire dataset, and the results are presented in
Table 4 [
36].
As shown in
Table 4, the recognition performance of all models decreases to some extent on the Internet Forest Fire dataset compared with the results obtained on the original dataset. This phenomenon can be attributed to differences in data distribution, image quality, and environmental conditions between the two datasets. Nevertheless, the improved AlexNet model still achieves the best overall performance in terms of accuracy and F1-score among the compared methods. This result indicates that the proposed model maintains strong robustness and stable recognition capability under different wildfire scenarios, demonstrating its effectiveness in handling fire-like interference and complex monitoring environments.
4.7. Qualitative Visualization and Error Analysis
In order to further analyze the behavior of the proposed model in real transmission corridor monitoring scenarios, qualitative visualization and error case analysis are conducted in this section. Quantitative metrics such as accuracy and F1-score provide an overall evaluation of classification performance, but they do not fully reveal the characteristics of model predictions in complex environments. Therefore, confusion matrix visualization and representative misclassification cases are analyzed to better understand the strengths and limitations of the proposed method.
Figure 9 presents the confusion matrix of the proposed improved AlexNet model on the test dataset. The confusion matrix provides a comprehensive overview of classification performance, including true positives, false positives, true negatives, and false negatives. It can be observed that most wildfire and non-wildfire samples are correctly classified, indicating that the proposed model has strong discrimination ability in transmission corridor environments.
However, a small number of misclassification cases still exist. To further investigate these situations, several representative error examples are illustrated in
Figure 10. The false positive cases mainly occur in scenes containing strong artificial light sources such as vehicle headlights and street lamps, as well as the sun. These light sources often exhibit similar color distributions and high-intensity regions resembling flame characteristics, which may confuse the model.
On the other hand, false negative cases typically occur when the wildfire region is very small, partially occluded by smoke, or located at long distances from the monitoring camera. Under these conditions, the visual features of flames become weak and difficult to distinguish from the background, leading to missed detections.
Overall, the qualitative analysis demonstrates that the proposed model significantly improves the robustness of wildfire recognition in complex transmission corridor environments. Nevertheless, extremely small fire regions and intense artificial light interference remain challenging scenarios. Future work will explore temporal information from video sequences and larger-scale datasets to further enhance detection reliability.
5. Conclusions
Challenges like a high false positive rate in forest fire images when attempting to use recognition methods and non-robustness to fairly complex backgrounds are encountered in the visual monitoring of power transmission line engineering. In this regard, in this paper, we proposed an improved AlexNet-based wildfire image recognition method for transmission line corridors. By introducing multi-scale feature extraction and an attention enhancement model based on the classical AlexNet, this method can improve the ability to distinguish wildfires from fire-like interferences while keeping the computing complexity controllable.
See below for a summary of the major contributions and conclusions of this study:
(I) The image recognition network architecture for the wildfire is designed for complex transmission corridor scenarios. The designed architecture uses the AlexNet backbone network and incorporates the RepGhost-Inception multi-scale reparameterization module to enhance the representation of flame and smoke features in different scales. By incorporating the lightweight multi-scale self-attention mechanism (Light-MSA) and the hybrid pooling attention mechanism (HPA), the architecture is able to adequately identify strong night light reflections, fire-like interferences, and low-contrast fire targets.
(II) An image dataset of wildfires has been developed specifically for the cases of the transmission lines. The dataset is made up of on-site-collected data and a transmission corridor visual monitoring system with 1246 complex-condition images, as well as internet resources, namely about 600 high-resolution images differing in factors such as whether they contain strong night light, fire-like interferences, or smoke occlusions. The dataset is a solid ground truth for model training and performance testing.
(III) We carried out extensive experiments to show the effectiveness and generalization ability of the proposed approach. In the real imbalanced class conditions, the recognition accuracy reached 96.9% with the improved AlexNet on the constructed dataset, and it was significantly better than that of the original AlexNet. The ablation experiment results confirm the effectiveness of each improvement module, and the cross-dataset validation results show that the proposed method is stable under different wildfire scenarios.
In summary, the enhanced AlexNet model is able to achieve high recognition accuracy and a low false alarm rate in wildfire image recognition under complex transmission line monitoring scenarios; therefore, it is suitable for practical online wildfire monitoring and operational decision-making in transmission corridors.
On the other hand, our work is not without limitations. First, the number of real samples of wildfires is relatively small compared to the total number of samples in the monitoring data, and there are many types of extreme conditions of wildfires that cannot be included in the current monitoring data. Second, the approach only relies on the single modality of visible light images, which might also be insufficient to achieve good performance for dense smoke images and strong illuminations. In the future, it is important to consider how to enlarge the scale of real cases, develop multimodal information fusion methods, and further optimize the deployment of lightweight models to better apply the approach in engineering applications.