A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors

Wang, Xiaole; Wang, Bo; Luo, Peng; Wang, Leixiong; Wu, Yurou

doi:10.3390/s25133882

Open AccessArticle

A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors

by

Xiaole Wang

¹,

Bo Wang

^1,*,

Peng Luo

¹,

Leixiong Wang

² and

Yurou Wu

²

¹

School of Electrical Engineering and Automation, Wuhan University, Wuhan 430072, China

²

Wuhan Power Supply Company, State Grid Hubei Electric Power Company, Wuhan 430013, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(13), 3882; https://doi.org/10.3390/s25133882

Submission received: 10 May 2025 / Revised: 9 June 2025 / Accepted: 20 June 2025 / Published: 22 June 2025

(This article belongs to the Special Issue Object Detection and Recognition Based on Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

Wildfire detection in power transmission corridors is essential for providing timely warnings and ensuring the safe and stable operation of power lines. However, this task faces significant challenges due to the large number of smoke-like samples in the background, the complex and diverse target morphologies, and the difficulty of detecting small-scale smoke and flame objects. To address these issues, this paper proposed an improved Oriented R-CNN model enhanced with metric learning for wildfire detection in power transmission corridors. Specifically, a multi-center metric loss (MCM-Loss) module based on metric learning was introduced to enhance the model’s ability to differentiate features of similar targets, thereby improving the recognition accuracy in the presence of interference. Experimental results showed that the introduction of the MCM-Loss module increased the average precision (AP) for smoke targets by 2.7%. In addition, the group convolution-based network ResNeXt was adopted to replace the original backbone network ResNet, broadening the channel dimensions of the feature extraction network and enhancing the model’s capability to detect flame and smoke targets with diverse morphologies. This substitution led to a 0.6% improvement in mean average precision (mAP). Furthermore, an FPN-CARAFE module was designed by incorporating the content-aware up-sampling operator CARAFE, which improved multi-scale feature representation and significantly boosted performance in detecting small targets. In particular, the proposed FPN-CARAFE module improved the AP for fire targets by 8.1%. Experimental results demonstrated that the proposed model achieved superior performance in wildfire detection within power transmission corridors, achieving a mAP of 90.4% on the test dataset—an improvement of 6.4% over the baseline model. Compared with other commonly used object detection algorithms, the model developed in this study exhibited improved detection performance on the test dataset, offering research support for wildfire monitoring in power transmission corridors.

Keywords:

wildfire detection; transmission corridors; metric learning; group convolutions; content-aware up-sampling

1. Introduction

The safe and stable operation of the power system is fundamental to the development of the national economy. In recent years, due to increasingly dry climatic conditions and frequent wildfires, numerous transmission lines have been disrupted by flame and smoke, leading to aerial power line faults and posing serious threats to grid safety and stability [1,2,3]. Consequently, effective monitoring and prevention strategies are urgently needed to mitigate the adverse impacts of wildfires on the power industry [4,5,6]. Traditional monitoring approaches typically rely on manual inspections of wildfire-prone areas along transmission corridors; however, such methods are inefficient, labor-intensive, and incapable of providing continuous coverage across wide regions, often resulting in delayed responses and increased risk of wildfire-related incidents [7]. Therefore, developing real-time, online monitoring techniques for wildfire detection in transmission corridors is of great significance for safeguarding the secure and stable operation of the power grid.

Currently, the primary online monitoring methods for wildfires in power grids include satellite remote sensing [8,9], LiDAR scanning [10,11,12], and image/video-based surveillance [13]. Among them, satellite remote sensing is often susceptible to interference from clouds and non-wildfire thermal sources [14], while LiDAR systems can be significantly affected by adverse weather conditions such as heavy fog. These limitations hinder their reliability and applicability in real-time wildfire monitoring scenarios.

With the ongoing digitalization of power systems, the deployment of image and video surveillance equipment along transmission lines has become increasingly widespread. Wildfire detection technologies based on image and video data offer advantages, such as rapid response and low cost, making them highly attractive for practical applications. Wildfire image and video detection typically consists of two main tasks: flame detection and smoke detection.

In recent years, deep learning-based image and video monitoring techniques have advanced rapidly. According to their detection paradigms, these methods are broadly categorized into single-stage and two-stage detection algorithms. Single-stage algorithms, such as YOLO and SSD, are known for their high detection speed. For instance, reference [15] proposed a lightweight YOLOv5 variant by simplifying the neck network structure to enable real-time wildfire monitoring in power transmission scenarios. However, single-stage methods often suffer from relatively lower detection accuracy.

In contrast, two-stage detection algorithms typically operate in two sequential steps: the first stage generates a set of candidate object proposals based on extracted features, while the second stage performs fine-grained classification and bounding box regression [16]. Although less efficient, these methods generally achieve higher accuracy and better generalization, making them more suitable for tasks like flame and smoke detection that require precise recognition.

Reference [17] embedded the weakly supervised learning strategy into the Faster-RCNN framework to improve the model’s ability to extract features of flames and smoke. Reference [18] introduced a parallel attention module (PAM) into the region proposal network (RPN), where the combined use of channel and spatial attention mechanisms significantly improved flame recognition accuracy. Moreover, Oriented R-CNN, as proposed in [19], utilizes rotated bounding boxes to better capture irregular objects with arbitrary orientations and shapes, making it well-suited for detecting smoke and flame patterns distorted by wind or other environmental factors. However, when applied directly, its recognition accuracy remains suboptimal, especially under complex background interference and feature ambiguity. In power transmission corridor scenarios, existing Oriented R-CNN models are prone to misidentifying smoke targets due to visually similar background interference and often fail to accurately detect small-scale wildfire instances.

In the task of wildfire detection within power transmission corridors, smoke targets often exhibit high visual similarity to background elements, such as clouds, water reflections, and distant blurred terrain. Since computer vision algorithms primarily rely on low-level visual cues—such as local texture, color, and transparency—while lacking high-level semantic understanding, they are particularly prone to feature confusion and misclassification when encountering smoke-like backgrounds [20,21,22]. Moreover, both smoke and flame targets exhibit highly variable and dynamic shapes throughout the combustion process, which further blurs their visual features and complicates the extraction of stable and discriminative representations. Additionally, the spatial scale of wildfire targets varies significantly [23,24]; in the early stages of a wildfire, both smoke and flame regions tend to occupy only small portions of the image, making feature extraction especially challenging. These factors collectively exacerbate the difficulty of accurate and robust wildfire detection in power transmission environments.

To address the above challenges, this paper proposed a wildfire detection method for power transmission corridors based on an improved Oriented R-CNN model enhanced with metric learning. The proposed approach integrated three key modules: a multi-feature center metric loss (MCM-Loss) module derived from metric learning principles, the group convolutional backbone ResNeXt [25], and the content-aware reassembly of features (CARAFE) up-sampling operator [26].

Taking the Oriented R-CNN model as the baseline, the major contributions and innovations of this paper are summarized as follows:

(i): A multi-feature center metric loss (MCM-Loss) module based on metric learning was proposed to enhance the model’s ability to distinguish between smoke and visually similar background samples. This effectively reduced false positives and missed detections and improved the overall recognition accuracy for smoke targets.
(ii): The original ResNet backbone in the Oriented R-CNN was replaced with the group convolution network ResNeXt, which expanded the channel capacity for feature extraction without increasing model complexity, thereby enhancing the model’s performance in detecting flames and smoke with complex shapes and varying morphologies.
(iii): An FPN-CARAFE structure was proposed by integrating the content-aware up-sampling operator CARAFE into the traditional feature pyramid network (FPN), which improved multi-scale feature fusion and preserved fine-grained information, leading to more accurate detection of small and irregular wildfire targets.

Together, these components enhanced the network’s ability to extract and represent discriminative features from complex wildfire imagery, thereby improving detection accuracy and robustness. The proposed method was trained and evaluated on a dedicated dataset of wildfire images in power transmission scenarios. Experimental results demonstrated that the model achieved superior performance in terms of accuracy, recall, and average precision in both flame and smoke detection tasks, compared to baseline and other commonly used detection algorithms.

2. The Metric Learning-Based Wildfire Detection Model for Power Transmission Corridors

2.1. A Framework for Wildfire Detection Based on Metric Learning

In wildfire detection tasks, flames and smoke often spread in different directions and at varying angles, making it difficult for horizontal bounding boxes to accurately localize targets. To address this issue, this paper adopted the Oriented R-CNN, a rotated bounding box detection algorithm, as the baseline model due to its suitability for capturing arbitrarily oriented targets. Building upon this framework, an improved Oriented R-CNN model incorporating metric learning was proposed for wildfire detection in power transmission corridors. The overall detection framework is illustrated in Figure 1.

2.2. Overview of the Oriented R-CNN Model

Oriented R-CNN is a two-stage rotated object detection model based on region proposals. Compared with other commonly used rotated bounding box detection algorithms, it adopts a midpoint offset representation to generate oriented proposals, which reduces computational redundancy and improves overall efficiency. The model primarily consists of five components: the input module, the backbone network, the FPN, the oriented region proposal network (oriented RPN), and the detection head. The overall network architecture is illustrated in Figure 2.

First, the images collected around the transmission lines are preprocessed and passed through the backbone network ResNet for feature extraction via convolutional operations. The extracted features are then fused through the FPN layer to generate multi-scale feature maps of flame and smoke targets. Subsequently, the up-sampled deep features are fused with the shallow, fine-grained features, integrating detailed information across scales. These fused features are fed into the oriented RPN, a lightweight, fully convolutional network with six regression parameters, to generate rotated proposal boxes. Meanwhile, the features are also passed to the detection head, which performs classification and refines the spatial positions of the proposals. Ultimately, the model outputs detection results that include the category, location, and confidence score of the flame and smoke targets.

Compared with traditional detection methods, Oriented R-CNN employs rotated bounding boxes, making it more suitable for detecting irregular flame and smoke regions in the wild, which are often influenced by wind and other environmental factors. However, the task of wildfire detection remains challenging. Enhancing the accuracy of wildfire recognition by improving the Oriented R-CNN model is therefore the primary objective of this study.

2.3. Multi-Feature Center Metric Loss (MCM-Loss) Module Based on Metric Learning

Metric learning is a method based on distance measurement, designed to capture the similarities or differences between samples. It aims to reduce the distance between samples of the same category while increasing the distance between samples from different categories [27,28,29]. The principle of metric learning is illustrated in Figure 3.

Wildfire images are frequently affected by severe background interference, particularly because the smoke category contains a large number of visually similar samples. Inspired by the principle of metric learning—which aims to maximize inter-class distances while minimizing intra-class distances to effectively distinguish between easily confused target categories—this paper designed a multi-feature center metric loss (MCM-Loss) module based on metric learning.

Traditional metric learning methods typically rely on a single feature center for each category [30]. However, the feature distribution of real-world smoke data is often complex and highly variable. Smoke targets exhibit strong nonlinearity and dynamic characteristics—their diffusion is influenced by factors such as wind direction, temperature, and humidity, leading to diverse distribution patterns across different temporal and spatial scales. To address this, the MCM-Loss module proposed in this paper assigned multiple initialized feature centers to each category, enabling a more accurate representation of intra-class feature variations and improving the model’s adaptability to smoke detection tasks involving highly variable target morphologies.

Specifically, the initial prototype feature center library is defined as:

P = {p_{c, j}}_{c = 1, j = 1}^{C + 1, K}

(1)

Among them,

P

is the set of proxy feature centers of all categories;

p_{c, j}

is the

j

th feature center of the

c

category;

C

is the total number of categories (class-num); and

K

is the number of feature centers of each category (num-proxies), which was set to 3 in the experiments in this paper.

These feature centers are defined as trainable parameters:

P \in ℝ^{(C + 1) \times K \times d}

(2)

Among them,

d

is the feature dimension, and the additional (

C

+ 1)th category is used for the background class feature center.

At the beginning of training, the feature centers were initialized with random values sampled from a Gaussian distribution and optimized using the Kaiming normal initialization method, ensuring a well-distributed starting point that facilitated effective convergence in the early stages of training.

To dynamically update the feature centers and better adapt to the evolving sample distribution, this paper adopted the exponential moving average (EMA) strategy for center updates:

p_{c, j}^{(t + 1)} = \frac{0.9 p_{c, j}^{(t)} + 0.1 \bar{x_{c}}}{‖0.9 p_{c, j}^{(t)} + 0.1 \bar{x_{c}}‖}

(3)

The mean feature

\bar{x_{c}}

of the sample of the

c

category is defined as shown in Equation (4):

\bar{x_{c}} = \frac{1}{N_{c}} \sum_{i \in S_{c}} x_{i}

(4)

S_{c}

represents the set of all samples of category

c

in a batch. This updating method makes the feature center gradually approach the actual data distribution and will not be affected by the drastic fluctuations of a single batch of data.

The basis of metric learning is to represent the distance between two samples. This paper used cosine similarity as the distance measurement method between the sample and feature center, as shown in Equation (5):

s_{i, j} = \cos (x_{i}, p_{i}) = \frac{x_{i} \cdot p_{i}}{‖x_{i}‖ ‖p_{i}‖}

(5)

where

s_{i, j}

represents the cosine similarity between the two, with a value range of [−1, 1].

Introducing metric learning requires determining positive and negative samples. For a sample

x_{i}

(

y

category), all feature centers of the

y

category belonged to positive samples

P_{i}

, while feature centers of all other categories belonged to negative samples

N_{i}

:

P_{i} = {p_{y, j}}_{j = 1}^{K}

(6)

N_{i} = {p_{z, j}}_{z \neq y, j = 1}^{K}

(7)

When calculating the metric learning loss function, the positive sample loss

L_{p o s}

was weighted by the similarity between the sample

x_{i}

and all positive sample feature centers, and the negative sample loss

L_{n e g}

was weighted by the similarity between the sample

x_{i}

and all negative sample feature centers:

L_{p o s} = \frac{1}{N} \sum_{i} \log (1 + \sum_{j = 1}^{K} e^{- α (s_{i, y, j} - m)})

(8)

L_{n e g} = \frac{1}{N} \sum_{i} \log (1 + \sum_{z \neq y, j = 1}^{K} e^{α (s_{i, z, j} + m)})

(9)

Among them,

α

controls the loss magnification, and

m

is the boundary hyperparameter (margin), which is used to increase the discrimination. In our implementation, the scaling factor

α

was set to 32, and the margin

m

was set to 0, based on empirical tuning on the validation set.

To enhance the model’s discriminative capability, we assigned greater weight to additional hard positive and hard negative samples during training. This strategy was motivated by the fact that these difficult samples were more prone to misclassification and thus required greater attention during learning. By explicitly focusing on these samples, the model can better learn to distinguish fine-grained differences.

For a given sample

x_{i}

, the hard positive was defined as the positive sample with the lowest cosine similarity to it, while the hard negative was the negative sample with the highest cosine similarity:

s_{i, y}^{\min} = \min_{j} s_{i, y, j}

(10)

s_{i, z}^{\max} = \max_{z \neq y, j} s_{i, z, j}

(11)

For hard positive and hard negative samples, the metric learning loss values were calculated separately. The computation followed the same principles as described in Equations (8) and (9).

L_{h a r d - p o s} = \frac{1}{N} \sum_{i} \log (1 + e^{- α (s_{i, y}^{\min} - m)})

(12)

L_{h a r d - n e g} = \frac{1}{N} \sum_{i} \log (1 + e^{α (s_{i, z}^{\max} + m)})

(13)

The final loss was composed of four components: the positive sample loss, the negative sample loss, the hard positive sample loss, and the hard negative sample loss. The weight coefficients for the hard positive and hard negative sample losses were both set to 2, with the goal of enhancing the model’s focus on difficult samples and guiding the network to pay more attention to those that are challenging to distinguish:

L = (L_{p o s} + L_{n e g}) + 2 (L_{h a r d - p o s} + L_{h a r d - n e g})

(14)

Based on this, the working principle of the multi-feature center metric loss (MCM-Loss) module based on metric learning is illustrated in Figure 4.

As shown in the figure, the multi-feature center metric loss (MCM-Loss) module updated the model’s weight parameters through the metric learning loss function, enhancing the model’s ability to distinguish between visually similar samples. This improved the recognition accuracy for easily confused targets, thereby enhancing overall detection performance in the wildfire detection task for power transmission corridors.

2.4. Group Convolutional Structure ResNeXt Replaces the Backbone Network

When processing flame and smoke targets, traditional convolution methods often result in the mixing of extracted information, such as color and texture, due to the lack of specificity among different feature channels during the feature extraction process. Moreover, flames typically exhibit localized high-brightness regions, while smoke is characterized by large-scale diffusion. A single feature extraction and fusion approach struggles to capture these distinct scales and forms simultaneously, thereby compromising detection performance.

To expand the channel domain of the feature extraction network without increasing the number of model parameters and to enhance the feature extraction capability for detecting flames and smoke in power transmission lines, this paper adopted the ResNeXt network, which is based on the concept of group convolution. ResNeXt is a significant improvement over the ResNet architecture. While retaining the advantages of ResNet’s residual connections, it successfully incorporates the multi-branch design concept from the Inception series [31], introducing a more innovative grouping strategy that offers strong architectural flexibility and superior performance. The block structure comparison between ResNet and ResNeXt is illustrated in Figure 5.

From an overall perspective, the basic unit (block) of ResNeXt is similar to that of ResNet, retaining the shortcut residual connection to alleviate the gradient vanishing problem commonly encountered in deep networks. However, by widening and grouping the network, ResNeXt achieves more efficient feature representation capabilities. In practical applications, the topological structure within each group is typically consistent, which makes the network more modular and reusable. This not only simplifies model construction but also significantly reduces the effort required for hyperparameter tuning. Group convolution divides the channels into multiple subgroups, with each group performing convolution operations independently. This effectively achieves feature decoupling and parallel learning. Specifically, this mechanism enables the network to learn features across different dimensions—such as color, texture, and shape—in various channel domains and then fuse the outputs from each group. Different groups can focus on extracting distinct types of low-level features that differentiate flames and smoke—for example, some groups capture brightness and sharp edges typical of flames, while others learn the more diffuse texture and lower contrast patterns characteristic of smoke. As a result, the model effectively reduces feature confusion between flame and smoke targets. This design allows for more efficient feature extraction without increasing the number of model parameters, thereby improving overall model performance.

In summary, in the baseline model Oriented R-CNN, the ResNeXt network was employed to replace the original ResNet backbone. This not only reduced the computational complexity of the model but also enhanced its adaptability to the complex variations in flame and smoke targets, ultimately improving detection performance.

2.5. Introducing the CARAFE Operator to Construct the FPN-CARAFE Layer Structure

In wildfire detection tasks, images typically contain objects of varying scales, particularly in the early stages of a fire, when small-scale flames and smoke targets are prevalent. Traditional feature up-sampling operators in the feature pyramid network (FPN) layer, such as nearest neighbor interpolation and bilinear interpolation [32], are commonly used. While these operators can handle multi-scale features, they often result in the loss of fine-grained details during up-sampling. This is especially problematic when detecting small and blurred targets.

To address the issue of decreased detection accuracy in traditional FPN layers due to the loss of feature information during up-sampling, especially when dealing with small targets, this paper adopted the CARAFE (Content-Aware ReAssembly of FEatures) perceptual up-sampling operator. Unlike traditional interpolation methods, CARAFE preserves details more effectively by using content-aware weight allocation during the up-sampling process. Specifically, CARAFE determines how to fill in new pixel values based on the feature content at each position, rather than relying solely on neighborhood information. This approach better preserves the fine-grained details of the feature map during up-sampling, avoids the “distortion” issues commonly seen with traditional up-sampling methods, and is more suitable for small target detection tasks. The working principle of CARAFE is illustrated in Figure 6.

The workflow of the CARAFE operator consists of two main stages. In the first stage, the feature map

χ

, which contains the target position information, is channel-compressed by the channel compressor. This compressed feature is passed through a lightweight content encoder that predicts a unique reassembly kernel (up-sampling kernel) for each spatial location. These kernels represent weights over the local neighborhood pixels around the target up-sampled position. The kernel normalizer then applies softmax normalization to these weights, forming a probability distribution that dynamically emphasizes important details—such as edges and small targets like smoke and flames—while suppressing less relevant areas. In the second stage, based on the up-sampling rate σ, the predicted reorganization kernel is used to reorganize the original feature map, producing a new feature map

χ^{'}

of size C × σH × σW, thus achieving up-sampling.

Specifically, we adopted the CARAFE configuration of up-sampling kernel size 5, encoder kernel size 3, encoder dilation 1, and group number 1. These settings followed the recommendations of the paper in [26] and were empirically validated in our experiments as a good trade-off between accuracy and computational efficiency.

In summary, the CARAFE operator was introduced as the up-sampling method in the FPN layer of the model’s feature pyramid network to construct the FPN-CARAFE layer structure. Through its content-aware up-sampling mechanism, CARAFE more effectively preserved the fine-grained feature information of small targets, enhancing the model’s detection accuracy for small target flames and smoke.

3. Experimental Results and Analysis

3.1. Dataset Creation

The dataset used in this study was a self-constructed wildfire dataset. The original images were primarily captured on-site by an affiliated maintenance company, supplemented by a small number of publicly available images from the internet, resulting in a total of 1894 wildfire images. However, due to the limited data sources and insufficient quantity, the dataset exhibited significant category imbalance, which would likely lead to poor model performance if used directly for training.

To address this issue, data augmentation techniques such as cropping, rotation, and resizing were employed to expand the dataset and improve model robustness. Specifically, a sliding window cropping strategy with a window size of 1024 × 1024 and 50% overlap was used. Only image patches containing valid rotated bounding boxes were retained, and the angle information was preserved. Additionally, some image patches were randomly augmented using rotation within ±15°, resizing, and horizontal flipping. After augmentation, the final dataset contained 22,999 images, which were used for the experiments in this paper. Example images from the dataset are shown in Figure 7.

Labeling software was used to generate the corresponding XML annotation file for each image, in which the wildfire targets were categorized into two classes: “shanhuo” (flame) and “yanwu” (smoke). The dataset was then randomly divided into training, validation, and test sets in a ratio of 8:1:1, comprising 18,399, 2300, and 2300 images, respectively. To ensure reliable evaluation of the model’s robustness, the training, validation, and test sets were strictly separated with no image overlap. For consistency with the original annotation format, these Pinyin labels were retained in the model outputs and evaluation results. The composition of the dataset is shown in Table 1.

3.2. Experimental Platform and Training Parameters

The network model for detecting wildfires in power transmission corridors was trained on a high-performance computing platform. The detailed configuration of the experimental training environment is presented in Table 2.

The training hyperparameters for the model in this study were configured as follows: The total number of training epochs was set to 50, with a batch size of 4. The optimizer used was stochastic gradient descent (SGD), with an initial learning rate of 0.005, a momentum of 0.9, and a weight decay coefficient of 0.0001. The learning rate adjustment followed the step schedule, incorporating a linear warm-up strategy. Specifically, the warm-up lasted for 500 iterations, with a warm-up ratio of 1/3. Learning rate decay was applied at the 46th and 49th epochs. Gradient clipping was employed using the maximum norm method, with the maximum norm set to 35 and the norm type defined as the L2 norm.

3.3. Evaluation Indicators and Model Training

To objectively evaluate the performance of the improved model proposed in this study for wildfire target detection in power transmission lines, several evaluation metrics were adopted, including precision (

P

), recall (

R

), average precision (

A P

), mean average precision (

m A P

), and the number of network parameters. These metrics served as the basis for performance assessment.

The average precision (AP) measured the area under the precision–recall (P-R) curve for a single class and reflected the trade-off between precision and recall. It was calculated as the integral of precision with respect to recall. The mean average precision (mAP) was the mean value of APs across all object categories.

The calculation formulas for the evaluation indicators are as follows:

P = \frac{N_{T P}}{N_{T P} + N_{F P}} \times 100 %

(15)

R = \frac{N_{T P}}{N_{T P} + N_{F N}} \times 100 %

(16)

A P = \int_{0}^{1} p (r) d r

(17)

m A P = \frac{1}{m} \sum_{i = 1}^{m} P_{A i}

(18)

Among them,

N_{T P}

is the number of correct detections predicted as positive samples that are actually positive samples;

N_{F P}

is the number of false detections predicted as positive samples that are actually negative samples;

N_{F N}

is the number of missed detections predicted as negative samples that are actually positive samples;

N_{T P}

+

N_{F P}

is the total number of samples judged as positive;

N_{T P}

+

N_{F N}

is the total number of positive samples; and

{A P}_{i}

is the average accuracy of the ith category target. The calculation method was to draw the

P

-

R

curve of the ith category target—p(r) is the value of each point on the

P

-

R

curve—and integrate the curve to obtain the average detection accuracy

A P

of Formula (11). m in Formula (12) is the number of categories of targets to be detected.

3.4. Ablation Experiment

To verify the effectiveness of the three optimization modules proposed in this study and to evaluate their impact on the wildfire detection task in power transmission corridors, ablation experiments were conducted using the custom-built dataset. Starting from the baseline model Oriented R-CNN, the multi-feature center metric loss (MCM-Loss) module, the group convolutional network ResNeXt, and the FPN-CARAFE layer structure with perceptual up-sampling were added incrementally. In total, four experimental configurations were tested by progressively integrating the optimization components. The detection performance of each model variant was evaluated, and the results are presented in Table 3.

As shown in Table 3, compared with the baseline model Oriented R-CNN, the improvements proposed in this paper enhanced the model’s recognition accuracy to varying degrees. In Experiment 1, after introducing the multi-feature center metric loss (MCM-Loss) module, the detection accuracy of the model increased by 1.7%. This module enhanced the model’s ability to distinguish smoke features through a metric learning mechanism, resulting in a 2.5% improvement in the recognition accuracy of smoke targets, effectively reducing false detections and missed detections and improving the overall detection performance.

Experiment 2 built upon Experiment 1 by replacing the original ResNet backbone with the group convolutional ResNeXt network. With nearly no increase in model parameters, the detection accuracy improved by 0.6%. This demonstrated that the group convolution structure of ResNeXt enabled the model to capture features of different dimensions—such as shape, color, and texture—from various channel domains, thereby enhancing the detection capability for flame and smoke targets with diverse morphologies.

In Experiment 3, the FPN-CARAFE layer structure incorporating the content-aware up-sampling operator CARAFE was introduced. CARAFE adaptively up-sampled features based on local content, allowing the model to better retain fine-grained information of small-scale flames and smoke in the early stages of a wildfire. This enhancement led to a 4.1% improvement in overall detection accuracy, with the average precision (AP) of flame targets increasing by 8.1%, highlighting the model’s stronger capability in small object detection.

To more intuitively demonstrate the detection performance of the improved model proposed in this paper, several test set samples were selected to compare the baseline Oriented R-CNN model with the improved version in real detection tasks, as shown in Figure 8. Subfigures (a–d) present the detection results of the baseline Oriented R-CNN model, while (a′–d′) show the detection results of the improved model.

By comparing (a) and (b) with (a′) and (b′) in Figure 8, it can be observed that the baseline model tended to misidentify background areas—such as lake reflections—as smoke targets. After introducing the multi-feature center metric loss (MCM-Loss) module and replacing the backbone with the group convolutional ResNeXt network, the improved model effectively reduced such false detections and significantly enhanced its ability to distinguish between easily confused samples.

Furthermore, from the comparison of (c) and (d) with (c′) and (d′), it can be seen that the baseline model suffered from missed detections of small-scale flame and smoke targets. By integrating the content-aware up-sampling operator CARAFE into the FPN layer, the improved model significantly boosted its capability in detecting small objects, enabling it to successfully identify most low-resolution targets in the images.

In conclusion, these experimental results further validated the effectiveness of the three proposed enhancements—MCM-Loss, ResNeXt, and FPN-CARAFE—within the Oriented R-CNN framework, demonstrating notable improvements in detection accuracy for wildfire recognition tasks in power transmission corridors.

3.5. Comparative Experiment

To evaluate the performance of the proposed model more comprehensively, in addition to comparing it with the baseline Oriented R-CNN model, we also selected several current mainstream detection algorithms for comparison. These included the mainstream single-stage detector YOLOv8-l and several high-performing rotated object detection algorithms in recent years, such as ReDet, RoI Transformer, Rotated Faster R-CNN, and Gliding Vertex. Moreover, to provide a more comprehensive comparison with other rotated object detection models, we additionally included the combination of the recent Swin Transformer backbone with the RoI Transformer detection head (denoted as Swin-T RoI Transformer) as a new comparison model. All experiments were conducted on the same self-constructed dataset. The detection results are presented in Table 4 to facilitate a thorough evaluation of the proposed model’s performance and advantages in real-world wildfire detection scenarios for power transmission corridors.

The experimental results demonstrated that the proposed model achieved the best performance in the wildfire detection task, with both precision and recall reaching the highest levels. Compared to other algorithms, the overall detection accuracy of the proposed model surpassed YOLOv8, ReDet, RoI Transformer, Rotated Faster R-CNN, Swin-T RoI Transformer, and Gliding Vertex by 14.4%, 9.2%, 9.1%, 5.7%, 5.6%, and 5.2%, respectively. The visual detection results of several test samples are illustrated in Figure 9.

Based on the results of the comparative experiments, the performance of various network models in wildfire detection tasks was analyzed and discussed. YOLOv8, as a typical single-stage detection algorithm, lacks fine-grained feature alignment mechanisms such as RoIAlign, making it difficult to accurately identify small-scale, blurred, or translucent flames and smoke targets under complex backgrounds. This results in a certain disadvantage in its overall accuracy in wildfire detection tasks. In contrast, two-stage detection models such as ReDet, RoI Transformer, Rotated Faster R-CNN, Swin-T RoI Transformer, and Gliding Vertex have stronger regional positioning and feature extraction capabilities. Specifically, after introducing the rotating anchor mechanism, these models can more effectively adapt to the detection needs of non-directional smoke and flame targets. However, these models still face limitations when dealing with the task of wildfire detection in power transmission corridors under complex backgrounds. These limitations are reflected in false detections of similar areas, such as reflective water surfaces and the omission of some real targets, which leads to lower precision, recall, and mAP scores, compared to the improved model proposed in this paper.

The improved Oriented R-CNN model introduced in this paper incorporated the multi-feature center metric loss (MCM-Loss) module, the group convolutional network ResNeXt, and the FPN-CARAFE layer structure with a perceptual up-sampling mechanism. These enhancements improved the model’s ability to distinguish similar samples and enhance feature learning for small targets, thereby improving the model’s recognition of flames and smoke. Comprehensive experimental results showed that, compared with other common algorithms, the model proposed in this paper achieved higher accuracy and recall rates in wildfire detection tasks. It significantly reduced the risk of missed detection and false detection and could accurately recognize and locate multi-type and multi-scale flame and smoke targets in complex environments.

4. Conclusions

Aiming to address the challenges in power transmission line wildfire detection, such as the presence of smoke-similar samples, diverse target morphologies, and the difficulty of detecting small target flames and smoke, this paper proposed a wildfire detection method for power transmission corridors based on an improved Oriented R-CNN with metric learning. The key conclusions were the following.

This paper constructed a high-quality dataset that included flame and smoke targets around transmission lines and proposed a wildfire detection method based on the Oriented R-CNN model. This method integrated the multi-feature center metric loss module (MCM-Loss), group convolution ResNeXt, and a fusion-aware up-sampling method. Experimental results demonstrated that the proposed model outperformed other widely used detection algorithms in terms of key indicators, such as recall and mAP.

The proposed MCM-Loss module improved the model’s ability to distinguish smoke from visually similar backgrounds. The ResNeXt backbone enhanced feature representation without significantly increasing the model complexity. The FPN-CARAFE structure further boosted small target detection by refining multi-scale feature fusion.

Moreover, the overall model maintained a relatively small parameter size, suggesting its potential for deployment on edge devices in transmission line scenarios.

However, there are still some limitations in the current study. First, the detection accuracy of smoke instances in highly cluttered backgrounds remains to be further improved. Second, although the model is relatively lightweight, its real-time inference speed on embedded devices has not been fully evaluated.

In future works, we will focus on lightweight network design (e.g., model pruning and structure simplification) and real-time performance optimization, aiming to achieve accurate and efficient wildfire detection in resource-constrained environments.

Author Contributions

Conceptualization, X.W. and B.W.; methodology, X.W. and P.L.; software, X.W., L.W. and Y.W.; validation, X.W., B.W., P.L., L.W. and Y.W.; formal analysis, X.W.; data curation, P.L.; writing—original draft preparation, X.W.; writing—review and editing, X.W.; visualization, B.W.; supervision, B.W., P.L., L.W. and Y.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not publicly available because they are self-built and contain sensitive information. Requests to access the datasets should be directed to (email: 2024202070023@whu.edu.cn).

Conflicts of Interest

Author [Leixiong Wang] was employed by the company [Wuhan Power Supply Company, State Grid Hubei Electric Power Company]. Author [Yurou Wu] was employed by the company [Wuhan Power Supply Company, State Grid Hubei Electric Power Company]. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Taylor, S.; Roald, L.A. A framework for risk assessment and optimal line upgrade selection to mitigate wildfire risk. Electr. Power Syst. Res. 2022, 213, 108592. [Google Scholar] [CrossRef]
Burke, M.; Driscoll, A.; Heft-Neal, S.; Xue, J.; Burney, J.; Wara, M. The changing risk and burden of wildfire in the United States. Proc. Natl. Acad. Sci. USA 2021, 118, e2011048118. [Google Scholar] [CrossRef]
Chen, W.; Zhou, Y.; Zhou, E.; Xiang, Z.; Zhou, W.; Lu, J. Wildfire risk assessment of transmission-line corridors based on naïve Bayes network and remote sensing data. Sensors 2021, 21, 634. [Google Scholar] [CrossRef] [PubMed]
Bayani, R.; Manshadi, S.D. Resilient expansion planning of electricity grid under prolonged wildfire risk. IEEE Trans. Smart Grid 2023, 14, 3719–3731. [Google Scholar] [CrossRef]
Marjani, M.; Mahdianpari, M.; Mohammadimanesh, F. CNN-BiLSTM: A novel deep learning model for near-real-time daily wildfire spread prediction. Remote Sens. 2024, 16, 1467. [Google Scholar] [CrossRef]
Li, J.; Tang, H.; Li, X.; Dou, H.; Li, R. LEF-YOLO: A lightweight method for intelligent detection of four extreme wildfires based on the YOLO framework. Int. J. Wildl. Fire 2024, 33, WF23044. [Google Scholar] [CrossRef]
Fang, K.; Yao, Q.; Guo, Z.; Zheng, B.; Du, J.; Qi, F.; Yan, P.; Li, J.; Ou, T.; Liu, J.; et al. ENSO modulates wildfire activity in China. Nat. Commun. 2021, 12, 1764. [Google Scholar] [CrossRef]
Lv, Z.; Wang, F.; Cui, G.; Benediktsson, J.A.; Lei, T.; Sun, W. Spatial–spectral attention network guided with change magnitude image for land cover change detection using remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4412712. [Google Scholar] [CrossRef]
Chen, J.; Li, R.; Tao, M.; Wang, L.; Lin, C.; Wang, J.; Wang, L.; Wang, Y.; Chen, L. Overview of the performance of satellite fire products in China: Uncertainties and challenges. Atmos. Environ. 2022, 268, 118838. [Google Scholar] [CrossRef]
Dowell, D.C.; Alexander, C.R.; James, E.P.; Weygandt, S.S.; Benjamin, S.G.; Manikin, G.S.; Blake, B.T.; Brown, J.M.; Olson, J.B.; Hu, M.; et al. The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part I: Motivation and system description. Weather Forecast. 2022, 37, 1371–1395. [Google Scholar] [CrossRef]
Abdusalomov, A.; Umirzakova, S.; Shukhratovich, B.M.; Mukhiddinov, M.; Kakhorov, A.; Buriboev, A.; Jeon, H.S. Drone-based wildfire detection with multi-sensor integration. Remote Sens. 2024, 16, 4651. [Google Scholar] [CrossRef]
Shu, S.; Chen, Y.; Cao, S.; Zhang, B.; Fang, C.; Xu, J. Monitoring and alarm method for wildfires near transmission lines with multi-Doppler weather radars. IET Gener. Transm. Distrib. 2023, 17, 2055–2069. [Google Scholar] [CrossRef]
Mukhiddinov, M.; Abdusalomov, A.B.; Cho, J. A wildfire smoke detection system using unmanned aerial vehicle images based on the optimized YOLOv5. Sensors 2022, 22, 9384. [Google Scholar] [CrossRef] [PubMed]
Thangavel, K.; Spiller, D.; Sabatini, R.; Marzocca, P.; Esposito, M. Near real-time wildfire management using distributed satellite system. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5500705. [Google Scholar] [CrossRef]
Huang, X.; Xie, W.; Zhang, Q.; Lan, Y.; Heng, H.; Xiong, J. A Lightweight Wildfire Detection Method for Transmission Line Perimeters. Electronics 2024, 13, 3170. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the NIPS’15: 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Pan, J.; Ou, X.; Xu, L. A collaborative region detection and grading framework for forest fire smoke using weakly supervised fine segmentation and lightweight Faster-RCNN. Forests 2021, 12, 768. [Google Scholar] [CrossRef]
Zhang, L.; Wang, M.; Ding, Y.; Bu, X. MS-FRCNN: A multi-scale Faster RCNN model for small target forest fire detection. Forests 2023, 14, 616. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Chen, G.; Cheng, R.; Lin, X.; Jiao, W.; Bai, D.; Lin, H. LMDFS: A lightweight model for detecting forest fire smoke in UAV images based on YOLOv7. Remote Sens. 2023, 15, 3790. [Google Scholar] [CrossRef]
Al-Smadi, Y.; Alauthman, M.; Al-Qerem, A.; Aldweesh, A.; Quaddoura, R.; Aburub, F.; Mansour, K.; Alhmiedat, T. Early Wildfire Smoke Detection Using Different YOLO Models. Machines 2023, 11, 246. [Google Scholar] [CrossRef]
Yang, W.; Yang, Z.; Wu, M.; Zhang, G.; Zhu, Y.; Sun, Y. SIMCB-YOLO: An Efficient Multi-Scale Network for Detecting Forest Fire Smoke. Forests 2024, 15, 1137. [Google Scholar] [CrossRef]
Luan, T.; Zhou, S.; Liu, L.; Pan, W. Tiny-Object Detection Based on Optimized YOLO-CSQ for Accurate Drone Detection in Wildfire Scenarios. Drones 2024, 8, 454. [Google Scholar] [CrossRef]
Sheng, S.; Liang, Z.; Xu, W.; Wang, Y.; Su, J. FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism. Forests 2024, 15, 1244. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware Reassembly of Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar] [CrossRef]
Wang, F.; Sun, J. Survey on Distance Metric Learning and Dimensionality Reduction in Data Mining. Data Min. Knowl. Discov. 2015, 29, 534–564. [Google Scholar] [CrossRef]
Guo, X.; Li, L.; Dang, C.; Liang, J.; Wei, W. Multiple Metric Learning via Local Metric Fusion. Inf. Sci. 2023, 621, 341–353. [Google Scholar] [CrossRef]
Chen, S.; Gong, C.; Li, X.; Zhou, Z.; Hu, Y. Boundary-Restricted Metric Learning. Mach. Learn. 2023, 112, 4723–4762. [Google Scholar] [CrossRef]
Ouyang, L.; Guo, G.; Fang, L.; Ghamisi, P.; Yue, J. PCLDet: Prototypical Contrastive Learning for Fine-Grained Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]

Figure 1. Overall detection framework.

Figure 2. The network architecture of the Oriented R-CNN model.

Figure 3. Principles of metric learning.

Figure 4. Working principle of the multi-feature center metric loss (MCM-Loss) module.

Figure 5. Comparison of the block structures of ResNet and ResNeXt.

Figure 6. Work principles of CARAFE.

Figure 7. Transmission corridor wildfire dataset example.

Figure 8. Comparison of detection effects between the baseline model Oriented R-CNN (subfigures a–d) and the improved model in this paper (subfigures a′–d′).

Figure 9. Visual comparison of different algorithms for wildfire smoke detection.

Table 1. Composition of the wildfire dataset for transmission corridors.

Dataset	Flame (Shanhuo) Number of Instances	Smoke (Yanwu) Number of Instances	Dataset Partitioning
Transmission Corridor Wildfire Dataset	1661	2182	8:1:1

Table 2. Experimental platform parameters.

Operating System	CPU	GPU	Memory
Ubuntu 20.04	Dual-core Intel Xeon E5-2640v4 2.4 GHz (Intel Corporation, Santa Clara, CA, USA)	4 Nvidia Tesla V100 16 GB NVLink (NVIDIA Corporation, Santa Clara, CA, USA)	128 GB ECC DDR4

Table 3. Ablation experiment results.

Model	MCM-Loss	ResneXt	FPN-CARAFE	P/%	R/%	AP/%		mAP/%	Parameters/Million
Model	MCM-Loss	ResneXt	FPN-CARAFE	P/%	R/%	Flame (Shanhuo)	Smoke (Yanwu)	mAP/%	Parameters/Million
Baseline Model				86.7	85.8	80.8	87.2	84.0	41.13
Exp 1	√			88.4	88.2	81.7	89.7	85.7	42.79
Exp 2	√	√		89.0	89.1	82.5	90.0	86.3	42.79
Exp 3	√	√	√	95.8	90.5	90.6	90.2	90.4	48.40

Exp 1, Exp 2, and Exp 3 refer to Experiment 1, Experiment 2, and Experiment 3, respectively.

Table 4. Comparative experiment results.

Model Name	P/%	R/%	AP/%		mAP/%	Parameters/Million
Model Name	P/%	R/%	Flame (Shanhuo)	Smoke (Yanwu)	mAP/%	Parameters/Million
YOLOv8-l	79.9	76.4	74.5	77.5	76.0	43.61
Redet	87.8	82.3	81.8	80.5	81.2	33.37
RoI Transformer	86.0	85.2	81.8	80.8	81.3	55.12
Rotated Faster R-CNN	86.5	86.3	81.9	87.4	84.7	41.14
Swin-T RoI Transformer	86.2	87.1	82.8	86.7	84.8	58.75
Gliding Vertex	89.7	87.9	83.2	87.2	85.2	41.16
Our model	95.8	90.5	90.6	90.2	90.4	48.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Wang, B.; Luo, P.; Wang, L.; Wu, Y. A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors. Sensors 2025, 25, 3882. https://doi.org/10.3390/s25133882

AMA Style

Wang X, Wang B, Luo P, Wang L, Wu Y. A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors. Sensors. 2025; 25(13):3882. https://doi.org/10.3390/s25133882

Chicago/Turabian Style

Wang, Xiaole, Bo Wang, Peng Luo, Leixiong Wang, and Yurou Wu. 2025. "A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors" Sensors 25, no. 13: 3882. https://doi.org/10.3390/s25133882

APA Style

Wang, X., Wang, B., Luo, P., Wang, L., & Wu, Y. (2025). A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors. Sensors, 25(13), 3882. https://doi.org/10.3390/s25133882

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors

Abstract

1. Introduction

2. The Metric Learning-Based Wildfire Detection Model for Power Transmission Corridors

2.1. A Framework for Wildfire Detection Based on Metric Learning

2.2. Overview of the Oriented R-CNN Model

2.3. Multi-Feature Center Metric Loss (MCM-Loss) Module Based on Metric Learning

2.4. Group Convolutional Structure ResNeXt Replaces the Backbone Network

2.5. Introducing the CARAFE Operator to Construct the FPN-CARAFE Layer Structure

3. Experimental Results and Analysis

3.1. Dataset Creation

3.2. Experimental Platform and Training Parameters

3.3. Evaluation Indicators and Model Training

3.4. Ablation Experiment

3.5. Comparative Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI