An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires

Zhao, Zilin; Duan, Guoyong

doi:10.3390/a19040245

Open AccessArticle

An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires

by

Zilin Zhao

¹ and

Guoyong Duan

^2,*

¹

College of Electrical and New Energy, China Three Gorges University, Yichang 443000, China

²

Key Laboratory of Geological Hazards on Three Gorges Reservoir Area, China Three Gorges University, Yichang 443002, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(4), 245; https://doi.org/10.3390/a19040245

Submission received: 20 January 2026 / Revised: 21 March 2026 / Accepted: 22 March 2026 / Published: 24 March 2026

(This article belongs to the Special Issue AI-Based Techniques in Smart Grid Operations)

Download

Browse Figures

Versions Notes

Abstract

The wildfires in the vicinity of the power transmission corridors are famous for their sudden occurrence, rapid growth, and susceptibility to interference from fire-like interferences at night, which can easily lead to line discharge and trip accidents, thus affecting the safe operation of the power system. In order to address the issue of the high false alarm rate and poor generalization performance of wildfire image recognition in complex power transmission corridor environments, a wildfire image recognition method based on an improved AlexNet is proposed in this paper. The proposed method improves the description of flame and smoke properties at different scales by designing a reparameterized multi-scale feature extraction structure, and effectively alleviates the influence of strong light reflection and fire-like interference at night by using lightweight multi-scale attention and hybrid pooling attention mechanisms. A wildfire image dataset is constructed based on 1246 on-site images of the power transmission corridor captured by a visual monitoring device and 600 wildfire images downloaded from the internet, and tested in real-world imbalanced distribution scenarios. The experimental results show that the proposed method can recognize wildfire images with an accuracy of 96.9% and an F1 value of 94.9% on the test dataset, which is much higher than that of the original AlexNet, and has a strong ability to adapt to cross-dataset tests. The research work can provide technical support for online monitoring and operation and maintenance of wildfires in power transmission corridors.

Keywords:

transmission lines; wildfire image recognition; re-parameterization; multi-scale fusion; class imbalance

1. Introduction

With the growing development of modern power grids, the secure and stable operation of transmission lines has become an essential premise for ensuring the reliable supply of electricity. In actual engineering applications, transmission lines are often routed through mountainous areas, forests, and regions of high human activity, where the environmental conditions are extremely complex and dynamic. Under the joint influence of extreme weather conditions, geographical environment, and human activities, the risk of wildfires is high in the vicinity of transmission lines [1]. In the process of wildfires, hot flames, thick smoke, and suspended particles can greatly weaken the air insulation strength around conductors. In extreme cases, this may cause electrical discharge, insulation failure, or even massive power outages [2,3]. These situations directly affect the security of power systems and may cause serious economic and social losses. Therefore, the development of effective wildfire identification techniques particularly suited to the monitoring of transmission corridors has great engineering value and application.

The traditional wildfire early warning system employed in the transmission corridor is based on manual patrols, fixed-point infrared temperature measurement, smoke sensors, multi-sensor fusion systems, or rule-based image processing pipelines [4,5,6]. In practical implementations, the fire region of interest is extracted using heuristic methods such as color thresholding, and classification is performed using handcrafted texture, morphological, or motion features. Due to their simplicity and ease of implementation, these methods were preferred in early engineering applications.

However, the actual environments of transmission corridors are much more difficult than those considered in such approaches. The images of corridor monitoring are often affected by strong artificial lighting at night, such as street lighting, car headlights, and their reflections on conductors, towers, and other objects. Other factors of interference are welding flames in maintenance work, open burning, sunlight spots, red-orange objects, and frequent occlusion by smoke or haze. In such environments, the manually designed features are very sensitive to changes in illumination and background clutter, leading to high rates of false alarms and unstable performance for different corridors, seasons, and time periods, as it has been reported in corridor-related works [7].

Deep learning has greatly influenced the field of visual fire and smoke recognition. Cheng et al. gave a thorough survey on deep learning-based visual fire detection and stated that the performance improvement is closely linked to the data composition and the realism of negative samples [8]. Gragnaniello et al. gave a survey on fire and smoke detection in video sequences and gave a taxonomy that takes into account factors that are often underrepresented in benchmark datasets, such as viewpoint, background dynamics, and lighting conditions [9].

From a system point of view, Boroujeni et al. reviewed AI-assisted unmanned aerial systems in the pre-, active, and post-wildfire phases and emphasized that sensing platforms and operational conditions greatly affect the performance of visual recognition systems [10]. Vasconcelos et al. and Saleh et al. reviewed recent progress in deep learning-based fire detection and emphasized that multi-scale feature learning and data-driven learning have become the mainstream design approaches, but failure cases are still prevalent under complex illumination and cluttered backgrounds [11,12]. Özel and Elhanashi et al. further introduced the long-standing challenges of biased datasets, a lack of hard negative examples, and the trade-off between recognition performance and real-time processing, especially in early warning systems [13,14]. UAV-centric reviews by Bouguettaya et al. and Danish et al. also indicated that viewpoint changes, motion blur, and real-time processing capabilities are great factors that influence detection performance in real-world conditions [15,16]. For smoke recognition, Chaturvedi et al. indicated that smoke is often mistakenly detected as clouds, fog, and haze in distant outdoor surveillance scenarios [17].

Within the particular context of power transmission line corridors, the task of wildfire identification is faced with challenges over and above those experienced in general outdoor fire detection. The corridor fire detection system is normally designed with fixed cameras running continuously with limited viewpoints and the need to correctly identify actual wildfires among a vast number of light sources and reflective interferences that resemble fire. Wang et al. put forward a metric learning-based improved oriented R-CNN to improve the separability of feature targets and interference patterns within the context of corridors [18]. At the same time, Huang et al. established a physical Bayesian modeling framework to evaluate the vulnerability of transmission line tripping due to wildfires, clearly showing the relationship between wildfire identification performance and power system operation risk [19].

Under practical transmission corridor applications, the performance of wildfire recognition systems is limited not only by model structure but also by operational environment and data characteristics. Fixed corridor cameras are always on and have limited viewpoints, and most of the collected samples are non-fire images dominated by artificial lighting and reflective surfaces. At night, fire-like interferences tend to have visual appearances that are highly similar to real fires, making discrimination much harder. Nighttime fire detection studies in urban areas have indicated that even state-of-the-art deep learning models are highly susceptible to strong illumination interference and temporal variability, causing persistent false alarms [20]. In addition, power line inspection and wildfire risk prediction surveys have highlighted that practical monitoring systems must meet very strict real-time constraints, further limiting the direct application of computationally intensive models [21,22]. Consequently, the problem of accurate wildfire identification with low false alarms and stable real-time performance remains a pressing unsolved issue in transmission corridor monitoring.

In order to solve the problems mentioned above, this paper is aimed at the image recognition of wildfires in the professional context of monitoring the visualization of power transmission corridors. The key points include the following:

(I) Building a dataset of images of wildfires for power transmission corridors, combining images of monitoring on site with publicly available information, including common complex situations like strong light at night, fire-like interference, and smoke obstruction, and training and testing based on actual sample distribution;

(II) Adding multi-scale feature extraction and reparameterization architectures with the AlexNet backbone network to enhance the model’s representation capability of wildfire features at multiple scales while ensuring the efficiency of inference;

(III) Lightweight multi-scale attention and hybrid pooling attention mechanisms are designed to improve the model’s ability to distinguish wildfire targets in low contrast and complex background environments;

(IV) By means of comparative experiments, ablation experiments, and cross-dataset verification, the superiority of the proposed method in terms of recognition accuracy, robustness, and generalization ability is analyzed.

2. Improved AlexNet Network Architecture Design

2.1. AlexNet Basic Structure

AlexNet has five convolutional layers and three fully connected layers, which implement the step-by-step abstraction from low-level textures to high-level semantics through a series of convolution, pooling, and nonlinear activation operations. Adding more convolutional layers can improve the extraction of features from large-scale input images, but the depth of the convolutional layers may impair the fitting capability of the model [23,24,25,26]. Considering the high resolution and limited sample size of transmission corridor monitoring images, this study uses five convolutional layers as a lightweight backbone network and simplifies the classification head to a single hidden layer fully connected structure, thereby reducing model parameters and the risk of overfitting while maintaining sufficient discriminative capability. To further improve the distinction between wildfires and fire-like interferences in complex scenarios, multi-scale feature fusion and attention recalibration mechanisms are introduced into the backbone network. The architectural parameters of the classic AlexNet network are shown in Figure 1.

2.2. RepGhost-Inception Module

To balance multi-scale feature representation capability and computational efficiency, this study introduces the RepGhost-Inception (RGI) module after the Conv2 and Pool2 layers. The RGI module combines the multi-branch Inception architecture with the Ghost feature generation concept and uses a re-parameterization strategy to ensure structural consistency between the training and inference phases.

Let the input feature map be

(F \in R^{C \times H \times W})

(1)

where C, H, and W denote the number of channels, height, and width of the feature map, respectively.

In the training stage, the RGI module is composed of four parallel branches, each representing convolution operations at different scales to extract wildfire-related visual patterns such as flames and smoke. The feature map generated by the i-th branch can be expressed as

F_{i} = σ (W_{i} * F + b_{i}), i = 1, 2, 3, 4

(2)

where W_i denotes the convolution kernel of the i-th branch, b_i is the bias term, and σ(⋅) represents the nonlinear activation function.

The outputs of the four branches are concatenated along the channel axis to build a comprehensive multi-scale primary feature map:

F_{m s} = C o n c a t (F_{1}, F_{2}, F_{3}, F_{4})

(3)

Then, the Ghost mechanism is employed to produce redundant feature maps via depthwise separable convolutions, which are computationally inexpensive.

After training, the multi-branch convolution structure and its corresponding batch normalization layers are equivalently folded and re-parameterized into a single convolution kernel with corresponding bias terms. The equivalent convolution kernel can be written as

W_{e q} = \sum_{i = 1}^{4} W_{i}

(4)

Thus, the feature extraction process during inference only requires a single standard convolution operation:

Y = W_{e q} * F + b

(5)

In this manner, the RGI module greatly improves the efficiency of inference while maintaining the ability of multi-scale representation, which is very suitable for online monitoring in power transmission corridors. The structure of the RGI module is shown in Figure 2.

2.3. Lightweight Multiscale Attention Module

In order to better fit the special attributes of the wildfires at different development stages (namely, the initial stage of ignition, the stage of active burning, and the stage of smoke diffusion), a lightweight multi-scale attention (Light-MSA) module was proposed in this paper. The proposed Light-MSA module is capable of modeling long-range dependencies at both the original spatial resolution and the downsampled spatial resolution.

Let the input feature map be

X \in R^{H \times W \times C}

(6)

where C, H, and W denote the number of channels, height, and width of the feature map, respectively.

In order to control the computational complexity, Light-MSA introduces a 2 × 2 average pooling operation in the second-scale branch, which reduces the number of tokens from N = H × W to N/4. The pooled feature map can be written as

X_{s} = A v g P o o l (X)

(7)

To capture the dependency relationships among spatial features, query, key, and value representations are first generated through linear projections:

Q = W_{q} X, K = W_{k} X, V = W_{v} X

(8)

where W_q, W_k, and W_v are learnable projection matrices.

The attention weights are computed as

A = S o f t m a x (\frac{Q K^{T}}{\sqrt{d}})

(9)

where d denotes the channel dimension used for normalization.

The refined feature map can then be obtained by

Y = A V

(10)

The input feature map is processed through two parallel branches, namely the channel attention branch and the spatial attention branch. The outputs of the two branches are fused through a feature fusion operation to generate the refined output feature map:

F_{o u t} = F_{c} + F_{s}

(11)

where F_c and F_s denote the output features of the channel attention branch and the spatial attention branch, respectively.

Figure 3 shows the architecture of the proposed Light-MSA module.

2.4. Hybrid Pooling Attention Module

Low contrast, smoke occlusion, and heavy light interference are typical difficulties in transmission line monitoring images. In order to improve the ability of models in complex situations, this paper introduces a Hybrid Pooling Attention (HPA) module, which simultaneously recalibrates feature responses in both channel and spatial dimensions.

Let the input feature map be

N \in R^{H \times W \times C}

(12)

where C, H, and W denote the number of channels, height, and width of the feature map, respectively.

The HPA module begins with the extraction of channel-level statistics via global average pooling and global max pooling. The channel descriptors can be written as

z_{a v g} = G A P (N)

(13)

z_{m a x} = G M P (N)

(14)

where GAP and GMP denote global average pooling and global max pooling operations, respectively.

The pooled descriptors are fused to generate the channel attention weights:

w_{c} = σ (W_{1 z_{a v g}} + W_{2 z_{m a x}})

(15)

where W₁ and W₂ are learnable weight matrices and σ(⋅) denotes the sigmoid activation function.

The channel-refined feature map is then obtained by

X_{c} = w_{c} ⊙ N

(16)

where ⊙ represents channel-wise multiplication.

Later, in the spatial dimension, direction-sensitive convolutions are performed on the pooled features to promote the recognition of linear structures such as flames and smoke. The spatial attention map can be expressed as

M_{s} = σ (C o n v (N_{c}))

(17)

The final refined feature map is therefore given by

F_{o u t} = M_{s} ⊙ N_{c}

(18)

By the combined action of channel attention and spatial attention, the HPA module is capable of suppressing flame interference effectively and raising the visibility of flame targets in complex background scenarios.

Figure 4 shows the architecture of the proposed Hybrid Pooling Attention module.

2.5. Transfer Learning

Given the relatively small amount of data available for wildfires in the context of power transmission lines, this paper uses the pre-trained weights of ImageNet for transfer learning. The pre-trained weights are loaded into the backbone network without changing the network structure; the newly added RGI, Light-MSA, and HPA modules are initialized with Kaiming normal initialization. In cases where the channel numbers or parameter shapes do not match, a selective parameter loading method is adopted to avoid loading weights, which helps to prevent instability during training and promotes convergence while preventing overfitting [27,28].

3. Transmission Line Wildfire Identification Based on Improved AlexNet

This study is based on the AlexNet architecture and integrates the RepGhost-Inception (RGI) module, Light-MSA module, and Hybrid Pooling Attention (HPA) module to construct an improved AlexNet-based wildfire recognition model for power transmission lines. As shown in Figure 5, the proposed network backbone consists of five convolutional layers, with the RGI module embedded after the Pool2 layer for multi-scale feature fusion. The Light-MSA and HPA modules are sequentially attached after the Conv5 layer to enhance global context modeling and interference suppression capabilities. To prevent the full connection parameters from becoming excessively large due to high-resolution inputs, an adaptive average pooling layer is used to fix the feature dimensions, followed by a single-hidden-layer fully connected classification head.

3.1. Input and Task Definition

The transmission line monitoring images were resized to 640 × 640 pixels, then normalized, and data augmentation was performed using random horizontal flips, brightness disturbance, and contrast disturbance.

The preprocessed images were then used as input to the network.

I \in R^{3 \times 640 \times 640}

(19)

The model outputs binary classification probabilities:

\hat{p} = [{\hat{p}}_{fire}, {\hat{p}}_{non - fire}] = f_{θ} (I)

(20)

{\hat{p}}_{fire} + {\hat{p}}_{non - fire} = 1

(21)

Here, f_θ(⋅) represents the improved AlexNet model parameterized by θ, where θ denotes the set of learnable network parameters.

3.2. Shallow Feature Extraction

Let the output feature map of the l-th layer be denoted as

F_{l} \in R^{C_{l} \times H_{l} \times W_{l}}

(22)

The shallow feature extraction stage is defined as follows:

Conv1: 11 × 11, s = 4, p = 2, C = 64;
MaxPool1: 3 × 3, s = 2;
Conv2: 5 × 5, s = 1, p = 2, C = 192;
MaxPool2: 3 × 3, s = 2.

This stage primarily extracts low-level texture information, such as edges and contours, providing fundamental features for subsequent multi-scale and attention-based modules.

3.3. RepGhost-Inception Multi-Scale Reparameterization Fusion Module

To better capture flame and smoke pattern characteristics at various spatial scales with controlled computational complexity, the RGI module is proposed after Pool2. Let the input feature map be represented as:

F \in R^{C \times H \times W}

(23)

In the training stage, the RGI module uses a multi-branch convolutional module to learn features with varying receptive fields.

U_{i} = ϕ_{i} (F), i \in {1, \dots, B}

(24)

The ϕ_i(·) represents the convolutional module with different kernel sizes. The multi-branch features are concatenated along the channel axis:

U = Concat (U_{1}, \dots, U_{B})

(25)

Later, the Ghost mechanism was proposed to produce redundant features and alleviate computational complexity:

V = U \oplus g (U)

(26)

In this equation, g(·) represents the low-cost operator, and ⊕ represents channel concatenation or element-wise fusion. In the inference stage, the multi-branch convolution and BN can be equivalently folded and reparameterized into a single convolution kernel with a bias term, allowing RGI to perform only one standard convolution operation during inference. The RGI output is represented as

F_{RGI} = RGI (F) .

(27)

3.4. Lightweight Multi-Scale Self-Attention Module

To accommodate the scale variation from initial fire points to smoke dispersion in wildfires and enhance modeling capabilities for long-range dependencies, Light-MSA is introduced after Conv5 output. Let the input be

F_{5} \in R^{C \times H \times W}

(28)

Light-MSA constructs attention at two scales: scale s = 1 corresponds to the original resolution, while scale s = 2 represents downsampling via 2 × 2 average pooling.

F^{(1)} = F_{5}

(29)

F^{(2)} = {AvgPool}_{2 \times 2} (F_{5})

(30)

For each scale, a 1 × 1 convolution is used to generate Q, K, and V:

Q^{(s)} = W_{Q} F^{(s)}

(31)

K^{(s)} = W_{K} F^{(s)}

(32)

V^{(s)} = W_{V} F^{(s)}

(33)

Flatten the features into a token sequence (N_s = H_sW_s), then compute the self-attention:

A^{(s)} = Softmax (\frac{Q^{(s)} {(K^{(s)})}^{⊤}}{\sqrt{d}}) V^{(s)}

(34)

d denotes the projection dimension. To control computational complexity, the number of tokens N at scale s = 2 is reduced by N/4. A⁽²⁾ is upsampled to the original resolution and then fused:

{\tilde{A}}^{(2)} = Up (A^{(2)})

(35)

F_{MSA} = F_{5} + λ_{1} A^{(1)} + λ_{2} {\tilde{A}}^{(2)}

(36)

Among these, λ₁ and λ₂ are learnable weights.

3.5. Hybrid Pooling Attention Module

To suppress fire-like interference, such as strong light reflections at night and welding sparks, an HPA is introduced after the Light-MSA for joint channel and spatial recalibration. Let the input be F_MSA.

(I) Channel Attention:

f_{avg} = GAP (F_{MSA})

(37)

f_{\max} = GMP (F_{MSA})

(38)

w_{c} = σ (MLP (f_{avg}) + MLP (f_{\max}))

(39)

F_{c} = w_{c} ⊙ F_{MSA}

(40)

(II) Spatial Attention:

f_{avg}^{s} = {Avg}_{c} (F_{c})

(41)

f_{\max}^{s} = {Max}_{c} (F_{c})

(42)

After concatenating the two, apply direction-sensitive convolutions:

M_{h} = σ ({Conv}_{1 \times k} ([f_{avg}^{s}; f_{\max}^{s}]))

(43)

M_{v} = σ (\begin{matrix} {Conv}_{k \times 1} ([f_{avg}^{s}; f_{\max}^{s}]) \end{matrix})

(44)

m_{s} = α M_{h} + β M_{v}

(45)

F_{HPA} = F_{c} ⊙ m_{s}

(46)

Its α and β are learnable weights with k = 7.

3.6. Loss Function and Optimization Strategy

The proposed wildfire recognition network is trained using the cross-entropy loss function. Given an input sample x_i with ground-truth label y_i ∈ {0,1}, the predicted probability of the wildfire class is denoted as p_i = f_θ(x_i). The cross-entropy loss is defined as

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i})]

(47)

where N denotes the number of samples in a mini-batch.

Considering that the wildfire dataset exhibits an imbalanced distribution between wildfire and non-wildfire samples, a weighted cross-entropy formulation is adopted. Let N₁ and N₀ denote the number of wildfire and non-wildfire samples, respectively. The corresponding class prior probabilities can be expressed as

P_{1} = \frac{N_{1}}{N_{0} + N_{1}}, P_{0} = \frac{N_{0}}{N_{0} + N_{1}}

(48)

To alleviate the bias toward the dominant class, class-dependent weighting coefficients are introduced

w_{1} = \frac{1}{P_{1}}, w_{0} = \frac{1}{P_{0}}

(49)

so that samples from the minority class receive larger gradient contributions during optimization.

The Adam optimizer is used to optimize the network parameters. The initial learning rate is set to l_r = 1 × 10⁻³, weight decay is set to 1 × 10⁻⁴, the batch size is fixed to 16, and the maximum number of training epochs is set to 200. A cosine annealing learning rate schedule is adopted during training. The final model is selected based on the best F1 score on the validation set. Early stopping is applied when the validation performance does not improve over several consecutive epochs, which helps prevent overfitting and improves training efficiency.

3.7. Architecture Overview

The overall architecture of the proposed improved AlexNet is illustrated in Figure 5. The input transmission corridor image is first fed into Conv1 to extract low-level visual features, followed by a MaxPool layer to reduce the spatial resolution and retain dominant responses. The resulting feature maps are then processed by Conv2 to further capture intermediate representations.

After the early convolution stages, the Light-MSA module is introduced to enhance multi-scale contextual dependency modeling and improve the representation of weak wildfire regions under complex backgrounds. The refined features are subsequently passed through Conv3, followed by the RepGhost-Inception (RGI) module, which performs lightweight multi-scale feature extraction and feature fusion through re-parameterized branches. This design improves the network’s ability to characterize wildfire-related patterns of different scales while maintaining computational efficiency.

Next, the fused feature maps are fed into Conv4 to obtain higher-level semantic representations. To further suppress fire-like interference caused by strong light, reflections, and other background disturbances, the Hybrid Pooling Attention (HPA) module is applied after Conv4 to adaptively refine the discriminative features. Finally, the refined feature maps are sent to the fully connected classification layer, which outputs the final recognition result. Through the coordinated integration of Light-MSA, RGI, and HPA into the AlexNet backbone, the proposed model achieves enhanced feature representation and interference suppression while preserving a relatively lightweight architecture. The overall pseudocode for our method can be presented in Algorithm 1.

Algorithm 1 Pseudocode of the proposed improved AlexNet

Input: Training set D = {(xi, yi)}i = 1N

Output: Predicted label ŷ ∈ {wildfire, non-wildfire}

Initialize the parameters of the improved AlexNet

for each training epoch do

for each mini-batch {(xi, yi)} in D do

Resize and normalize the input image xi

Perform data augmentation on xi

Extract low-level features using Conv1

Reduce the spatial dimension using MaxPool

Extract intermediate features using Conv2

Enhance multi-scale contextual information using Light-MSA

Further extract semantic features using Conv3

Perform lightweight multi-scale feature fusion using the RepGhost-Inception module

Generate high-level feature maps using Conv4

Refine discriminative representations using the Hybrid Pooling Attention module

Feed the refined features into the fully connected classification layer

Obtain the prediction probability pi

Compute the classification loss between pi and yi

Update the network parameters by backpropagation

end for

4. Experimental Results and Analysis

4.1. Construction of Wildfire Dataset and Experimental Environment

To validate the effectiveness of the proposed method in the visual monitoring scenario of power transmission line corridors, this study constructed a wildfire image classification dataset for power transmission lines. This dataset mainly comes from two sources:

(I) Field-collected data: A total of 1246 images were collected, captured by visual monitoring devices deployed along the transmission line corridors of the State Grid Corporation of China. These images cover various scenarios, including daytime and nighttime, sunny and cloudy weather, and include some typical false alarm sources such as strong lights and reflections from street lamps and vehicle headlights, welding or burning flames, red-orange objects, sunspots, smoke obstructions, and haze.

(II) Public supplementary data: 600 typical wildfire and non-wildfire event images were collected from publicly available online sources. These data supplement the dataset, increasing the visual diversity caused by differences in geographic terrain, vegetation types, and imaging devices, thereby enhancing the model’s cross-scenario generalization capability.

In order to guarantee the reproducibility of the split of data and the validity of the assessment, all samples were subjected to task-specific image-level cleaning and deduplication. For a series of images from the same event or video clip, sample selection was performed by a combination of similarity screening and visual review, so that similar samples would not appear together in both the training and testing sets. Then, the dataset was split randomly into training, validation, and testing sets in a ratio of 7:2:1, with the aim being to preserve the original distribution of wildfire and non-wildfire samples in each subset in order to reduce the effect of distribution drift on the assessment results. Figure 6 illustrates some examples of wildfire and non-wildfire samples in the generated dataset.

The experiment was performed on a workstation with an Intel Xeon Silver 4214R CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA RTX 3080 Ti GPU with 12 GB VRAM (NVIDIA Corporation, Santa Clara, CA, USA), and a Linux operating system. The proposed approach was developed with Python 3.11 and the PyTorch 2.1.0. deep learning library. The input images were uniformly resized to 640 × 640 pixels, the batch size was fixed to 16, and the maximum number of training epochs was fixed to 200. To improve the training stability in the small-sample setting, we conducted transfer learning using pre-trained weights on ImageNet, and the parameters of the newly added modules were initialized with Kaiming initialization.

4.2. Training Strategy

The proposed wildfire recognition network is trained using the cross-entropy loss function as the optimization objective. Given an input sample x_i with ground-truth label y_i ∈ {0,1}, the predicted probability of the wildfire class is denoted as p_i = f_θ(x_i). The cross-entropy loss is defined as

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i})]

(50)

where N denotes the number of samples in a mini-batch.

Considering that the wildfire recognition dataset contains an imbalanced distribution between wildfire and non-wildfire samples, a weighted cross-entropy formulation is adopted to reduce the bias toward the dominant class. The weighted loss function can be expressed as

L = - \frac{1}{N} \sum_{i = 1}^{N} [w_{1} y_{i} \log (p_{i}) + w_{0} (1 - y_{i}) \log (1 - p_{i})]

(51)

where w₁ and w₀ denote the weights associated with wildfire and non-wildfire classes, respectively.

4.3. Performance Evaluation Metrics

In this paper, precision, recall, accuracy, and F1 measure are employed to assess the performance of the proposed neural network model for wildfire detection. Precision is defined as the ratio of the number of correctly predicted positive instances to the total number of predicted positive instances, which indicates the model’s capability to suppress false positives. Recall, or the true positive rate (TPR), is the ratio of the number of correctly predicted positive instances to the total number of actual positive instances, which indicates the model’s sensitivity to the wildfire event. Accuracy is defined as the ratio of the number of correctly classified instances to the total number of instances. The F1 measure is a hybrid metric that takes into account both precision and recall. The definitions of these metrics are as follows:

\Pr e = \frac{T P}{T P + F P}

(52)

Re c = \frac{T P}{T P + F N}

(53)

A C C = \frac{T P + F N}{T P + T N + F N + F P}

(54)

F 1 - score = \frac{(β^{2} + 1) \times \Pr e \times Re c}{β^{2} \times \Pr e + Re c}

(55)

TP stands for the number of samples in which there are wildfires and are correctly classified as such by the model. FP stands for the number of samples in which there are no wildfires but are incorrectly classified as such by the model. FN stands for the number of samples in which there are wildfires but they are not correctly classified by the model. TN stands for the number of samples in which there are no wildfires and are correctly classified by the model.

In addition, a confusion matrix is provided to illustrate the distribution of classification results and to support the calculation of evaluation metrics such as precision, recall, and F1-score.

4.4. Model Comparison

In order to verify the effectiveness and superiority of the proposed network, we conducted comparative experiments between the proposed network and several representative models. The experiments were conducted under the same hardware environment, and the corresponding input image sizes were adopted for each network to achieve the best results [29,30,31,32,33,34,35]. The experimental results are listed in Table 1, and the comparison of the training accuracy of different models is shown in Figure 7.

As shown in Table 1, the proposed improved AlexNet achieves the best overall performance among the compared models in terms of accuracy, precision, recall, and F1-score. In particular, the accuracy of the proposed model is 11.3% higher than that of the original AlexNet. In addition, the proposed model also achieves higher F1-scores than lightweight architectures such as MobileNetV2 and deeper networks such as ResNet50. Compared with more recent deep learning architectures, including EfficientNetV2, Vision Transformer (ViT), and ConvNeXt, the proposed model still maintains competitive performance. These results indicate that the proposed network has strong discriminative capability and robustness for wildfire recognition in complex transmission corridor monitoring environments, especially in suppressing fire-like interference.

Although formal statistical significance testing is not included in this study, the consistent improvements across multiple evaluation metrics (accuracy, precision, recall, and F1-score) and comparisons with several representative baseline models provide empirical evidence of the effectiveness of the proposed method.

To further analyze the convergence behavior and training stability of different models, the training accuracy and loss curves during the training process are shown in Figure 7 and Figure 8. As illustrated in Figure 8, the proposed model exhibits a relatively faster decrease in loss during the early stage of training, indicating effective feature learning and parameter optimization. After a certain number of training iterations, the loss gradually stabilizes, suggesting that the model converges well during the training process and demonstrates good training stability. Compared with the baseline AlexNet model, the proposed method maintains a consistently lower loss value throughout the training process. This observation further demonstrates the effectiveness of the proposed multi-scale feature extraction strategy and attention recalibration mechanism in improving feature representation under complex wildfire monitoring scenarios.

To evaluate the trade-off between recognition performance and computational complexity, we further analyzed the number of parameters and floating-point operations (FLOPs) of different models, and the results are presented in Table 2. As shown in Table 2, VGG16 and Vision Transformer have relatively large numbers of parameters and higher computational complexity. MobileNetV2 has the smallest model size and the lowest computational cost, but its recognition performance is relatively limited. Recent architectures such as EfficientNetV2 and ConvNeXt achieve improved efficiency but still require moderate computational resources. In comparison, the improved AlexNet proposed in this paper achieves a favorable balance between computational complexity and recognition performance. Specifically, the proposed model significantly reduces the number of parameters compared with the original AlexNet while maintaining relatively low FLOPs. At the same time, it achieves the best recognition performance among the compared models, demonstrating its effectiveness for wildfire recognition in transmission corridor monitoring scenarios.

4.5. Ablation Study

In order to verify the effectiveness of each proposed improvement module, an ablation study was performed by gradually adding the RepGhost-Inception (RGI), Light-MSA, and Hybrid Pooling Attention (HPA) modules to the AlexNet backbone network. The experimental results of the ablation study are shown in Table 3.

The experimental results show that only by combining the RepGhost-Inception module, Light-MSA, and the hybrid pooling attention module can the highest overall performance with an accuracy of 96.9% be obtained. Compared with the combination of only the RepGhost-Inception module and Light-MSA, the accuracy has been improved by 1.8 percentage points. These experimental results further verify that the reasonable combination of multiple advanced modules can make the model’s generalization performance better.

4.6. Model Stability Validation

To further evaluate the generalization capability of the proposed model, additional experiments were conducted on the Internet Forest Fire dataset, and the results are presented in Table 4 [36].

As shown in Table 4, the recognition performance of all models decreases to some extent on the Internet Forest Fire dataset compared with the results obtained on the original dataset. This phenomenon can be attributed to differences in data distribution, image quality, and environmental conditions between the two datasets. Nevertheless, the improved AlexNet model still achieves the best overall performance in terms of accuracy and F1-score among the compared methods. This result indicates that the proposed model maintains strong robustness and stable recognition capability under different wildfire scenarios, demonstrating its effectiveness in handling fire-like interference and complex monitoring environments.

4.7. Qualitative Visualization and Error Analysis

In order to further analyze the behavior of the proposed model in real transmission corridor monitoring scenarios, qualitative visualization and error case analysis are conducted in this section. Quantitative metrics such as accuracy and F1-score provide an overall evaluation of classification performance, but they do not fully reveal the characteristics of model predictions in complex environments. Therefore, confusion matrix visualization and representative misclassification cases are analyzed to better understand the strengths and limitations of the proposed method.

Figure 9 presents the confusion matrix of the proposed improved AlexNet model on the test dataset. The confusion matrix provides a comprehensive overview of classification performance, including true positives, false positives, true negatives, and false negatives. It can be observed that most wildfire and non-wildfire samples are correctly classified, indicating that the proposed model has strong discrimination ability in transmission corridor environments.

However, a small number of misclassification cases still exist. To further investigate these situations, several representative error examples are illustrated in Figure 10. The false positive cases mainly occur in scenes containing strong artificial light sources such as vehicle headlights and street lamps, as well as the sun. These light sources often exhibit similar color distributions and high-intensity regions resembling flame characteristics, which may confuse the model.

On the other hand, false negative cases typically occur when the wildfire region is very small, partially occluded by smoke, or located at long distances from the monitoring camera. Under these conditions, the visual features of flames become weak and difficult to distinguish from the background, leading to missed detections.

Overall, the qualitative analysis demonstrates that the proposed model significantly improves the robustness of wildfire recognition in complex transmission corridor environments. Nevertheless, extremely small fire regions and intense artificial light interference remain challenging scenarios. Future work will explore temporal information from video sequences and larger-scale datasets to further enhance detection reliability.

5. Conclusions

Challenges like a high false positive rate in forest fire images when attempting to use recognition methods and non-robustness to fairly complex backgrounds are encountered in the visual monitoring of power transmission line engineering. In this regard, in this paper, we proposed an improved AlexNet-based wildfire image recognition method for transmission line corridors. By introducing multi-scale feature extraction and an attention enhancement model based on the classical AlexNet, this method can improve the ability to distinguish wildfires from fire-like interferences while keeping the computing complexity controllable.

See below for a summary of the major contributions and conclusions of this study:

(I) The image recognition network architecture for the wildfire is designed for complex transmission corridor scenarios. The designed architecture uses the AlexNet backbone network and incorporates the RepGhost-Inception multi-scale reparameterization module to enhance the representation of flame and smoke features in different scales. By incorporating the lightweight multi-scale self-attention mechanism (Light-MSA) and the hybrid pooling attention mechanism (HPA), the architecture is able to adequately identify strong night light reflections, fire-like interferences, and low-contrast fire targets.

(II) An image dataset of wildfires has been developed specifically for the cases of the transmission lines. The dataset is made up of on-site-collected data and a transmission corridor visual monitoring system with 1246 complex-condition images, as well as internet resources, namely about 600 high-resolution images differing in factors such as whether they contain strong night light, fire-like interferences, or smoke occlusions. The dataset is a solid ground truth for model training and performance testing.

(III) We carried out extensive experiments to show the effectiveness and generalization ability of the proposed approach. In the real imbalanced class conditions, the recognition accuracy reached 96.9% with the improved AlexNet on the constructed dataset, and it was significantly better than that of the original AlexNet. The ablation experiment results confirm the effectiveness of each improvement module, and the cross-dataset validation results show that the proposed method is stable under different wildfire scenarios.

In summary, the enhanced AlexNet model is able to achieve high recognition accuracy and a low false alarm rate in wildfire image recognition under complex transmission line monitoring scenarios; therefore, it is suitable for practical online wildfire monitoring and operational decision-making in transmission corridors.

On the other hand, our work is not without limitations. First, the number of real samples of wildfires is relatively small compared to the total number of samples in the monitoring data, and there are many types of extreme conditions of wildfires that cannot be included in the current monitoring data. Second, the approach only relies on the single modality of visible light images, which might also be insufficient to achieve good performance for dense smoke images and strong illuminations. In the future, it is important to consider how to enlarge the scale of real cases, develop multimodal information fusion methods, and further optimize the deployment of lightweight models to better apply the approach in engineering applications.

Author Contributions

Conceptualization, Z.Z. and G.D.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z. and G.D.; formal analysis, Z.Z.; investigation, Z.Z.; resources, G.D.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, G.D.; visualization, Z.Z.; supervision, G.D.; project administration, G.D.; funding acquisition, G.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Fund of Hubei Transmission Line Engineering Technology Research Center, project titled “Research on Structural Damage Assessment Methods for Transmission Towers Based on Fiber Bragg Grating Technology”, grant number 2024KXL05.

Data Availability Statement

Some of the publicly available online fire datasets and related documents collected for this study are available at: https://github.com/a1412060476/Improved-AlexNet-architecture (accessed on 1 March 2026). Due to privacy and secrecy laws related to power grid infrastructure security, the complete data supporting this study cannot be publicly provided, but they still may be obtained from the corresponding author upon reasonable request.

Acknowledgments

The author would like to thank all the staff, teachers, and students of the Transmission Operation and Maintenance Branch Company for their invaluable support. We are grateful for their support in the data collection process, the implementation of the research, and engineering practice. We would also like to thank the anonymous reviewers and editors for taking the time to review the paper and provide feedback that contributes to the progress of the research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, W.; Zhou, Y.; Zhou, E.; Xiang, Z.; Zhou, W.; Lu, J. Wildfire Risk Assessment of Transmission-Line Corridors Based on Naïve Bayes Network and Remote Sensing Data. Sensors 2021, 21, 634. [Google Scholar] [CrossRef] [PubMed]
Khan, I.; Ghassemi, M. A Probabilistic Approach for Analysis of Line Outage Risk Caused by Wildfires. Int. J. Electr. Power Energy Syst. 2022, 139, 108042. [Google Scholar] [CrossRef]
Zhou, F.; Geng, H.; Wen, G.; Ma, Y.; Ma, Y.; Wang, G.; Cao, J.; Xu, J.; Mei, H. Influence of Mountain Wildfires on the Insulation Properties of Air Gaps in Power Grids. Energies 2025, 18, 225. [Google Scholar] [CrossRef]
Xu, Z.; Li, J.; Cheng, S.; Rui, X.; Zhao, Y.; He, H.; Guan, H.; Sharma, A.; Erxleben, M.; Chang, R. Deep Learning for Wildfire Risk Prediction: Integrating Remote Sensing and Environmental Data. ISPRS J. Photogramm. Remote Sens. 2025, 227, 632–677. [Google Scholar] [CrossRef]
Sun, B.; Cheng, X. Smoke Detection Transformer: An Improved Real-Time Detection Transformer Smoke Detection Model for Early Fire Warning. Fire 2024, 7, 488. [Google Scholar] [CrossRef]
Faisal, M.A.A.; Mecheter, I.; Qiblawey, Y.; Fernandez, J.H.; Chowdhury, M.E.; Kiranyaz, S. Deep Learning in Automated Power Line Inspection: A Review. Appl. Energy 2025, 385, 125507. [Google Scholar] [CrossRef]
Wang, M.; Yue, P.; Jiang, L.; Yu, D.; Tuo, T.; Li, J. An Open Flame and Smoke Detection Dataset for Deep Learning in Remote Sensing Based Fire Detection. Geo-Spat. Inf. Sci. 2025, 28, 511–526. [Google Scholar] [CrossRef]
Cheng, G.; Chen, X.; Wang, C.; Li, X.; Xian, B.; Yu, H. Visual Fire Detection Using Deep Learning: A Survey. Neurocomputing 2024, 596, 127975. [Google Scholar] [CrossRef]
Gragnaniello, D.; Greco, A.; Sansone, C.; Vento, B. Fire and Smoke Detection from Videos: A Literature Review under a Novel Taxonomy. Expert Syst. Appl. 2024, 255, 124783. [Google Scholar] [CrossRef]
Boroujeni, S.P.H.; Razi, A.; Khoshdel, S.; Afghah, F.; Coen, J.L.; O’Neill, L.; Fule, P.; Watts, A.; Kokolakis, N.-M.T.; Vamvoudakis, K.G. A Comprehensive Survey of Research towards AI-Enabled Unmanned Aerial Systems in Pre-, Active-, and Post-Wildfire Management. Inf. Fusion 2024, 108, 102369. [Google Scholar] [CrossRef]
Vasconcelos, R.N.; Franca Rocha, W.J.; Costa, D.P.; Duverger, S.G.; Santana, M.M.d.; Cambui, E.C.; Ferreira-Ferreira, J.; Oliveira, M.; Barbosa, L.d.S.; Cordeiro, C.L. Fire Detection with Deep Learning: A Comprehensive Review. Land 2024, 13, 1696. [Google Scholar] [CrossRef]
Saleh, A.; Zulkifley, M.A.; Harun, H.H.; Gaudreault, F.; Davison, I.; Spraggon, M. Forest Fire Surveillance Systems: A Review of Deep Learning Methods. Heliyon 2024, 10, e23127. [Google Scholar] [CrossRef]
Özel, B.; Alam, M.S.; Khan, M.U. Review of Modern Forest Fire Detection Techniques: Innovations in Image Processing and Deep Learning. Information 2024, 15, 538. [Google Scholar] [CrossRef]
Elhanashi, A.; Essahraui, S.; Dini, P.; Saponara, S. Early Fire and Smoke Detection Using Deep Learning: A Comprehensive Review of Models, Datasets, and Challenges. Appl. Sci. 2025, 15, 10255. [Google Scholar] [CrossRef]
Bouguettaya, A.; Zarzour, H.; Taberkit, A.M.; Kechida, A. A Review on Early Wildfire Detection from Unmanned Aerial Vehicles Using Deep Learning-Based Computer Vision Algorithms. Signal Process. 2022, 190, 108309. [Google Scholar] [CrossRef]
Danish, S.; Piran, M.J.; Khan, S.U.; Khan, M.A.; Dang, L.M.; Zweiri, Y.; Song, H.-K.; Moon, H. Vision-Based Fire Management System Using Autonomous Unmanned Aerial Vehicles: A Comprehensive Survey. Artif. Intell. Rev. 2025, 59, 16. [Google Scholar] [CrossRef]
Chaturvedi, S.; Khanna, P.; Ojha, A. A Survey on Vision-Based Outdoor Smoke Detection Techniques for Environmental Safety. ISPRS J. Photogramm. Remote Sens. 2022, 185, 158–187. [Google Scholar] [CrossRef]
Wang, X.; Wang, B.; Luo, P.; Wang, L.; Wu, Y. A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors. Sensors 2025, 25, 3882. [Google Scholar] [CrossRef]
Huang, H.; Chen, K.; Song, B.; Chen, C.; Li, L.; Ling, J. Susceptibility Assessment of Wildfire-Induced Transmission Line Tripping Using a Physical-Bayesian Modeling Approach. Sci. Rep. 2025, 15, 44540. [Google Scholar] [CrossRef]
Park, M.; Ko, B.C. Two-Step Real-Time Night-Time Fire Detection in an Urban Environment Using Static ELASTIC-YOLOv3 and Temporal Fire-Tube. Sensors 2020, 20, 2202. [Google Scholar] [CrossRef]
Jenssen, R.; Roverso, D. Automatic Autonomous Vision-Based Power Line Inspection: A Review of Current Status and the Potential Role of Deep Learning. Int. J. Electr. Power Energy Syst. 2018, 99, 107–120. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A. Deep Learning Approaches for Wildland Fires Using Satellite Remote Sensing Data: Detection, Mapping, and Prediction. Fire 2023, 6, 192. [Google Scholar] [CrossRef]
Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A Review of Convolutional Neural Networks in Computer Vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
Safonova, A.; Ghazaryan, G.; Stiller, S.; Main-Knorn, M.; Nendel, C.; Ryo, M. Ten Deep Learning Techniques to Address Small Data Problems with Remote Sensing. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103569. [Google Scholar] [CrossRef]
Rather, I.H.; Kumar, S.; Gandomi, A.H. Breaking the Data Barrier: A Review of Deep Learning Techniques for Democratizing AI with Small Datasets. Artif. Intell. Rev. 2024, 57, 226. [Google Scholar] [CrossRef]
Liu, Y.; Liang, H.; Zhao, S. LMSFF: Lightweight Multi-Scale Feature Fusion Network for Image Recognition under Resource-Constrained Environments. Expert Syst. Appl. 2025, 262, 125584. [Google Scholar] [CrossRef]
Raghu, M.; Zhang, C.; Kleinberg, J.; Bengio, S. Transfusion: Understanding Transfer Learning for Medical Imaging. Adv. Neural Inf. Process. Syst. 2019, 32, 3342–3352. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. arXiv 2018, arXiv:1812.01187. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298. [Google Scholar]
Ravikiran, K.H.; Kumar, P.S.M.; Bindu, K.; Jayanth, J. Emotion Image Classification in Animals Using a Bat Algorithm-Based Hyperparameter Tuned VGG16 Model. SN Comput. Sci. 2026, 7, 239. [Google Scholar] [CrossRef]
Danyo, A.; Dontoh, A.; Aboah, A. An Improved ResNet50 Model for Predicting Pavement Condition Index (PCI) Directly from Pavement Images. Road Mater. Pavement Des. 2026, 27, 682–699. [Google Scholar] [CrossRef]
Yuliansyah, H.; Saputro, R.A.; Khoirunnisa, I.I.; Ali, W.N.S.W.; Nur, Y.S.R.; Radzuan, N.F.M. Plastic Bottle Defect Detection Based on Convolutional Neural Network with MobileNetV2 Architecture. Frankl. Open 2026, 14, 100522. [Google Scholar] [CrossRef]
Yin, H.; Cao, Y.; Liu, L.; Chen, D.; Zhang, Q. Meteorological Observation Research Based on an Improved EfficientNetV2 Model. Environ. Model. Softw. 2026, 197, 106835. [Google Scholar] [CrossRef]
Saiveena, K.; Praveen, P. Edge Intelligence-Enabled Vision Transformer with YOLOv5 for Accurate Vehicle Detection and Image Segmentation Using Adaptive Multiscale Deep Learning Mechanism. Comput. Electr. Eng. 2026, 134, 111086. [Google Scholar] [CrossRef]
Yang, Y.; Deng, X. Power Transformer Winding Fault Diagnosis Method Based on Time–Frequency Diffusion Model and ConvNeXt-1D. Appl. Sci. 2026, 16, 2528. [Google Scholar] [CrossRef]
Bouthillier, X.; Delaunay, P.; Bronzi, M.; Trofimov, A.; Nichyporuk, B.; Szeto, J.; Mohammadi Sepahvand, N.; Raff, E.; Madan, K.; Voleti, V. Accounting for Variance in Machine Learning Benchmarks. Proc. Mach. Learn. Syst. 2021, 3, 747–769. [Google Scholar]

Figure 1. AlexNet network structure. In the figure, k denotes the convolution kernel size, s denotes the stride, and n denotes the number of neurons in the fully connected layers. MaxPool represents the max pooling operation. The branches use convolution layers with different kernel sizes to capture multi-scale features. Specifically, “1 × 1”, “3 × 3”, and “5 × 5” denote convolution operations with kernel sizes of 1 × 1, 3 × 3, and 5 × 5, respectively.

Figure 2. RepGhost-Inception module. The base feature map is processed by multiple parallel branches with different operations. The labels “1 × 1”, “3 × 3”, and “5 × 5” denote convolution layers with kernel sizes of 1 × 1, 3 × 3, and 5 × 5, respectively. “Pool” represents the pooling operation, and “Filter Concat” denotes the concatenation of feature maps from different branches.

Figure 3. Light-MSA Attention Module. The input feature map is processed through two parallel branches, namely the channel attention branch and the spatial attention branch. The outputs of the two branches are fused through a feature fusion operation to generate the refined output feature map.

Figure 4. Hybrid Pooling Attention Module. The input feature map is first processed by a convolution operation to generate the feature map. The feature map is then processed by two parallel pooling operations, namely average pooling and max pooling. The resulting features are fused to form the pooled feature representation, which is further refined through a convolution operation to produce the final output feature map.

Figure 5. Improved AlexNet architecture.

Figure 6. Representative Images from the Transmission Corridor Wildfire Dataset.

Figure 7. Training Accuracy Comparison Chart.

Figure 8. Training Loss Comparison Chart.

Figure 9. Confusion matrix of the proposed wildfire image recognition model on the test dataset.

Figure 10. Visualization of typical misclassification cases in the wildfire dataset.

Table 1. Comparison results of different models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Improved AlexNet	96.9	95.4	93.5	94.9
AlexNet	85.6	84.3	89.5	87.6
VGG16	90.2	90.1	88.8	89.6
ResNet50	92.4	91.8	90.6	91.2
MobileNetV2	89.1	88.7	87.3	88.0
EfficientNetV2	94.3	93.1	91.6	92.3
Vision Transformer	93.7	92.5	91.2	91.8
ConvNeXt	94.9	93.7	92.1	92.9

Table 2. Model Complexity Comparison.

Model	Parameters (M)	FLOPs (G)
AlexNet	61.0	1.4
VGG16	138.0	15.5
ResNet50	25.6	4.1
MobileNetV2	3.4	0.6
EfficientNetV2	24.0	4.3
Vision Transformer	86.6	17.6
ConvNeXt	28.6	4.5
Improved AlexNet	24.8	2.3

Table 3. Ablation Study Results. ✗ indicates that the corresponding module is not included, while ✓ indicates that the corresponding module is included.

RGI	Light-MSA	HPA	Accuracy (%)
✗	✗	✗	85.6
✓	✗	✗	90.3
✗	✓	✗	88.7
✗	✗	✓	89.2
✗	✓	✓	92.3
✓	✗	✓	94.5
✓	✓	✗	95.1
✓	✓	✓	96.9

Table 4. Internet Forest Fire Dataset Test Results.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Improved AlexNet	95.8	94.7	92.9	94.3
AlexNet	83.0	81.5	87.8	85.1
VGG16	88.2	87.6	86.1	86.8
ResNet50	90.5	89.9	89.2	89.5
MobileNetV2	86.7	85.9	84.6	85.2
EfficientNetV2	93.2	92.0	90.8	91.4
Vision Transformer	92.5	91.3	90.1	90.7
ConvNeXt	93.6	92.4	91.2	91.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Z.; Duan, G. An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires. Algorithms 2026, 19, 245. https://doi.org/10.3390/a19040245

AMA Style

Zhao Z, Duan G. An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires. Algorithms. 2026; 19(4):245. https://doi.org/10.3390/a19040245

Chicago/Turabian Style

Zhao, Zilin, and Guoyong Duan. 2026. "An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires" Algorithms 19, no. 4: 245. https://doi.org/10.3390/a19040245

APA Style

Zhao, Z., & Duan, G. (2026). An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires. Algorithms, 19(4), 245. https://doi.org/10.3390/a19040245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires

Abstract

1. Introduction

2. Improved AlexNet Network Architecture Design

2.1. AlexNet Basic Structure

2.2. RepGhost-Inception Module

2.3. Lightweight Multiscale Attention Module

2.4. Hybrid Pooling Attention Module

2.5. Transfer Learning

3. Transmission Line Wildfire Identification Based on Improved AlexNet

3.1. Input and Task Definition

3.2. Shallow Feature Extraction

3.3. RepGhost-Inception Multi-Scale Reparameterization Fusion Module

3.4. Lightweight Multi-Scale Self-Attention Module

3.5. Hybrid Pooling Attention Module

3.6. Loss Function and Optimization Strategy

3.7. Architecture Overview

4. Experimental Results and Analysis

4.1. Construction of Wildfire Dataset and Experimental Environment

4.2. Training Strategy

4.3. Performance Evaluation Metrics

4.4. Model Comparison

4.5. Ablation Study

4.6. Model Stability Validation

4.7. Qualitative Visualization and Error Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI