1. Introduction
As the core carrier for power grid operation, the health status of power facilities directly determines the reliability of power supply and the safety of operation and maintenance (O&M) of power grids. Among them, key components such as transformer bushings, insulators, and conductor joints are prone to defects like overheating, aging, and damage due to long-term operation. Failure to detect and address these defects in a timely manner may lead to equipment burnout, line tripping, and even large-scale power outages [
1]. Therefore, achieving accurate defect detection and efficient O&M for power facility components is a core requirement to ensure the safe and stable operation of power grids [
2]. Under normal operating conditions, power facility components inherently exhibit structural symmetry and symmetric thermal distribution characteristics, while defects typically manifest as symmetry breaks in thermal fields. Nevertheless, traditional detection methods rarely leverage such symmetry for detection purposes.
Current power inspection mainly relies on manual on-site investigation with infrared thermal imagers, and this mode has significant limitations. On one hand, manual inspection is highly susceptible to terrain, weather, and personnel experience. It exhibits insufficient capability in identifying concealed defects such as internal aging of insulator sheds and micro-overheating of conductor joints, and is prone to missed detection and false detection [
3]. On the other hand, traditional inspection requires manual recording of defect information and transmission back to the backend for analysis, which leads to response delays. Meanwhile, the lack of visual interaction tools restricts decision-making efficiency [
4]. To break through the bottlenecks of traditional inspection, infrared detection technology based on computer vision has been extensively researched in recent years. Researchers have successively proposed object detection models based on Faster R-CNN and the YOLO series for the localization of power facility components and detection of defects in power facilities [
5]. However, infrared images of power facilities feature complex scenarios, with issues including mutual occlusion of equipment, severe background interference, and small differences in thermal radiation among different components. These problems result in insufficient localization accuracy of core components and poor robustness of defect recognition in existing models, making it difficult to meet the detection accuracy requirements of practical engineering [
6,
7].
To address the aforementioned limitations, this study proposes an infrared detection method for power facility components based on deep learning and multi-scale feature fusion. Taking an improved YOLOv10 as the basic network framework, this method enhances the ability to extract component features and defect details from infrared images primarily through two custom-designed modules. The first module is the modified MAPC module, whose core function is to optimize the capture of fine-grained features in infrared images. By optimizing the edge-aware convolution kernel and thermal texture enhancement mechanism, this module significantly improves the model’s capability to extract subtle edge contours of power facility components and discriminative thermal texture details. The second module is the innovative Transattn module, specifically designed to tackle the issues of complex background interference and mutual equipment occlusion in power scenarios. This module introduces a dynamic spatial attention mechanism to perform adaptive weighting on feature maps: on one hand, it focuses on the core regions of power facilities to aggregate local thermal anomaly information; on the other hand, it suppresses invalid background noise. Additionally, this module integrates a cross-scale feature interaction mechanism to fuse global equipment structure information with local defect details, thereby achieving comprehensive and accurate characterization of power facility features. The main contributions of this study are as follows:
(1) To enable more accurate extraction of edge contours and thermal texture details from infrared images of power facility components, the modified MAPC module is utilized to capture the feature representations of the components’ local structures and global contextual information. This enhances the model’s ability to resolve micro thermal defects in infrared images, thereby significantly improving the performance of power facility component localization and defect recognition tasks.
(2) To effectively extract and emphasize information about core component regions and key defect locations in the infrared detection task of power facility components, this study proposes a convolutional neural network that integrates MAPC and TrasAttn. This method aims to significantly enhance the model’s ability to perceive important feature regions in infrared images, which not only strengthens the accurate localization of core components of power facilities but also refines the capture of micro thermal defect details, thereby comprehensively improving the accuracy and robustness of infrared detection for power facility components.
(3) To break the limitation of disconnection between detection results and on-site scenarios in traditional power inspection, the optimized infrared detection model for power facility components mentioned above is deeply integrated with head-mounted AR devices, enabling real-time visualization of detection results.
The remaining structure of this paper is organized as follows. In
Section 2, we review the existing methods related to object detection.
Section 3 elaborates on the proposed object detection method in detail.
Section 4 conducts qualitative and quantitative evaluations of the proposed object detection method.
Section 5 discusses the practical application value of the proposed method, analyzes potential limitations of the current research, and puts forward directions for future optimization. Finally,
Section 6 summarizes the work presented in this paper.
2. Related Work
Our work covers two primary research directions: image processing and deep learning-based target networks. Herein, we focus on introducing several methods that are closely relevant to our work.
CNN (Convolutional Neural Network): As a deep learning model dedicated to processing grid-structured data, convolutional neural networks (CNNs) exhibit robust capability in extracting features of power components in computer vision-related tasks such as infrared detection of power facilities and equipment condition monitoring, thereby providing key technical support for the intelligent operation and maintenance (O&M) of power grids. In 2023, Li et al. [
8] proposed the PowerResNeXt-V2 model, which was specifically optimized for the characteristics of power infrared images based on ResNeXt. On the one hand, it upgraded the original fixed group convolution to a “dynamic grouping mechanism,” which can adaptively adjust the number of groups according to the size of power equipment (e.g., insulators, transformers) in the input image—fine grouping is adopted for small-sized components to enhance the capture of local thermal details, while coarse grouping is used for large-sized equipment to preserve global structural information. On the other hand, it introduced a “temperature-aware convolution kernel,” which adjusts convolution weights by integrating temperature gradient information of infrared images, enabling the model to more easily focus on thermal anomaly regions of equipment. Without increasing parameter complexity, this design significantly improved the model’s ability to represent features of thermal defects in power equipment. On the substation infrared detection dataset, the defect recognition accuracy was increased by 14.2% compared with the original ResNeXt, while the computational efficiency remained basically unchanged. In the same year, Wang et al. [
9] proposed the PowerSENet++ model, which improved the channel attention mechanism of the classical SENet to adapt to power scenarios. Its core module added a “spatial-temperature dual-dimensional calibration” step to the “squeeze-and-excitation” process: first, it aggregates the global thermal distribution information of equipment through the squeeze operation, then assigns weights to channels in different temperature ranges through the excitation operation, and simultaneously combines spatial attention to filter background interferences such as trees and buildings. In 2023, Wang et al. [
10]. addressed the issue of insufficient detection robustness of the single infrared modality under strong illumination and rainy weather conditions by designing a dual-branch CNN fusion model entitled “Cross-Modal CNN Fusion for Power Facility Defect Detection: Fusing Infrared Thermal and Visible Light Images”. In this model, the infrared branch extracts thermal anomaly features of power equipment, while the visible light branch captures the geometric contours of components. An adaptive weight fusion module is used to integrate the features from the two branches. Under extreme weather scenarios, the detection accuracy is improved by 18.7% compared with the single-infrared CNN. This study overcomes the limitations of the single-modality approach; however, its fusion module relies on fixed weight allocation and fails to consider the differences in modal feature priorities among different defect types of power equipment. In 2023, Li [
11] proposed an adaptive receptive field CNN model in the research “Multi-Scale CNN with Adaptive Receptive Field for Infrared Defect Detection of Power Equipment”, aiming to solve the detection challenge of coexisting large components and micro-defects in power equipment. By dynamically adjusting the convolution kernel size and dilation rate, the model enables the shallow network to focus on the thermal texture details of micro-defects and the deep network to capture the global structural features of large components. On the 220 kV substation infrared dataset, the recognition rate of micro-defects with a diameter of less than 3 mm is increased by 21.3% compared with traditional CNNs. Its core innovation lies in embedding the size distribution law of power equipment into the network structure design. Nevertheless, it does not integrate an attention mechanism to optimize background interference suppression, and the model parameter count reaches 8.6 M, making it difficult to adapt to edge inspection devices. In 2024, Zhao et al. [
12] proposed the PowerEfficientNet-X model, which innovatively designed a “power scenario-adapted scaling rule” based on the compound scaling strategy of EfficientNet. Instead of adjusting network depth, width, and resolution at a fixed ratio, it dynamically allocates scaling resources according to the size distribution characteristics of power equipment—for shallow networks used to capture micro-defects, priority is given to increasing width to enhance the extraction of detailed features; for deep networks used to model the global equipment layout, emphasis is placed on increasing depth to strengthen spatial correlation. Meanwhile, to address the large resolution difference in infrared images, an “adaptive resolution input interface” was designed to automatically adjust the input size according to the imaging accuracy of inspection equipment. This collaborative optimization method enables the model to be efficiently adapted to devices with different computing capabilities: on edge-side inspection devices, the lightweight version can achieve real-time detection at 30 fps. In 2024, Zhang et al. [
13] proposed a lightweight CNN based on channel pruning, titled “Lightweight CNN Based on Channel Pruning for Real-Time Infrared Inspection of Power Lines”, to address the computing power constraints of edge devices such as UAVs and head-mounted AR devices. Based on MobileNetV3, this study uses L1 regularization to screen feature channels strongly related to power defects and removes redundant convolutional layers. Eventually, the model parameter count is reduced to 1.2 M, the frame rate reaches 320 FPS, and an mAP50 of 82.5% is maintained in the transmission line insulator crack detection task. Its advantage lies in balancing real-time performance and lightweight design; however, it adopts a single-scale feature extraction mechanism, which has limited effectiveness in solving the feature confusion problem caused by the overlap of multiple components in complex scenarios. In 2024, Zhao et al. [
11] focused on the dynamic evolution characteristics of power equipment defects and proposed the “Temporal CNN for Dynamic Thermal Defect Monitoring of Power Transformers”. By stacking 1D temporal convolutional layers and 2D spatial convolutional layers, this model extracts the changing trends of thermal features from continuous infrared frames to classify the development stages of defects. On the long-term transformer monitoring dataset, the accuracy of defect development trend prediction reaches 89.2%. Its innovation lies in the introduction of temporal dimension information, but it does not optimize the real-time detection performance in static scenarios, and the response delay for sudden defects reaches 1.2 s, making it difficult to meet the requirements of rapid fault early warning.
Attention mechanism: The attention mechanism originates from the characteristic of the human visual system to focus on specific information. Introducing this mechanism into deep learning models enables them to more accurately capture key features in images, thereby effectively improving model accuracy and overall performance. Therefore, this study attempts to enhance model performance by incorporating the attention mechanism, and practice has shown that the integration of the attention mechanism can indeed effectively improve model accuracy. In relevant research, a variety of novel attention mechanisms have continuously emerged, bringing new ideas and breakthroughs to the field of infrared detection for power facility components. In 2022, Wang et al. [
14] proposed the ECA-Net (Efficient Channel Attention Network), which adaptively adjusts channel attention weights. While retaining critical feature channel information, it reduces redundant computations. Its efficiency is particularly evident in feature screening for power equipment under complex backgrounds—for instance, in scenarios with densely distributed multiple devices in substations, it can quickly focus on key channels where overheating defects are located, avoiding feature interference from non-target components. In the same year, Guo et al. [
15] proposed the Coordinate Attention mechanism, which innovatively embeds spatial coordinate information into attention calculations. By accurately locating the spatial position of targets in images, it significantly improves the detection accuracy of position-sensitive defects in power equipment, and is especially suitable for transmission line inspection scenarios with complex component layouts. In 2023, Chen et al. [
16] proposed the Dynamic-Attention Network, which constructs a dynamic weight allocation strategy to enable the model to dynamically adjust the attention focus according to the complexity of input infrared images of power facilities and the distribution of defects. Faced with scenarios where complex backgrounds (e.g., tree occlusion, building reflections) and micro thermal defects (e.g., initial local overheating of insulators, slight damage-induced heating of conductors) coexist in power facility scenes, this mechanism can adaptively strengthen attention to key regions—it not only accurately locates defect positions with weak thermal anomaly signals but also avoids redundant computations on meaningless background regions. While improving detection accuracy, it significantly optimizes model inference efficiency, making it more suitable for the dual requirements of real-time performance and accuracy in power inspection. In the same year, Liu et al. [
17] proposed the Multi-Scale Group Attention (MSGA) module, which innovatively combines multi-scale feature extraction with the group attention mechanism. When processing infrared images of power facilities, its core advantage lies in performing group attention calculations separately for features of different scales: for small-scale features (corresponding to micro electrical joints, gaps in insulator sheds, etc.), it enhances the capture of local details; for large-scale features (corresponding to large components such as transformer tanks and transmission towers), it focuses on global structural correlations. Through intra-group feature interaction and inter-group weight balancing, it effectively addresses the detection challenges of “large differences in component sizes and diverse defect types” for power facilities, realizing full-scale coverage recognition from micro thermal spots to large-area aging-induced thermal anomalies and comprehensively improving detection accuracy and robustness. In 2024, Zhang et al. [
18] proposed the Adaptive Contextual Attention (ACA) mechanism, which improves upon the limitations of traditional attention mechanisms—“fixed weights of contextual information and high susceptibility to complex background interference”. By means of an adaptive contextual perception module, it can dynamically adjust contextual weights according to the actual scenes of infrared images of power facilities: when imaging is blurred in rainy weather or the difference in thermal radiation between equipment and the background is small, it automatically enhances the contextual correlation signals of target components and defects; when there is strong light reflection or tree shadow occlusion, it accurately filters out invalid background information to ensure that the localization of defect regions is not disturbed. Experimental results show that this mechanism significantly improves the accuracy of power facility defect detection in complex environments, and is particularly suitable for outdoor inspection scenarios with variable conditions. In addition, Wang et al. [
19] proposed the Spatial-Temporal Attention Transformer, which is specifically designed for dynamic power production scenarios (e.g., real-time monitoring of substation equipment, dynamic inspection of transmission lines). This mechanism breaks through the limitations of traditional spatial attention and incorporates temporal dimension information: by capturing the evolution of thermal features of power facilities during continuous operation (e.g., the expanding trend of overheating regions at joints with changing loads, the progressive change in thermal anomalies caused by insulator aging), it not only enables real-time defect detection but also predicts the development trend of defects. It provides dual support of “detection + early warning” for power operation and maintenance, greatly improving quality control and risk prevention capabilities in dynamic production environments. The emergence of these novel attention mechanisms has brought new technological breakthroughs to the field of infrared detection for power facility components—not only solving the problem of defect recognition in complex scenarios but also adapting to the real-time and dynamic requirements of power inspection, continuously promoting the improvement of model performance and the expansion of practical applications. In 2024, Li et al. [
20] proposed the Cross-Modal Attention Fusion mechanism, which integrates temperature features from infrared images with texture features from visible light images, providing a new approach for cross-modal detection of power facilities. In nighttime or low-light environments, visible light textures can assist in locating infrared thermal anomaly regions, effectively addressing the detection blind spots of the single infrared modality under complex lighting conditions. The emergence of these novel attention mechanisms has brought new technical breakthroughs to the field of infrared detection for power facility components—they not only solve the problem of defect recognition in complex scenarios but also adapt to the real-time and dynamic requirements of power inspection, continuously promoting the improvement of model performance and the expansion of practical applications.
Object Detection Networks: In the fields of deep learning and computer vision, object detection serves as a core technology dedicated to accurately identifying and localizing specific target objects from images or videos, and it holds extremely important applications in the power sector. With the vigorous development of deep learning technology, deep learning-based power object detection methods have continuously emerged and achieved significant breakthroughs. In 2022, Liu et al. [
21] proposed Swin Transformer V2, which enhances model capacity and resolution adaptability. This significantly strengthens the ability to model features of subtle defects in high-resolution infrared images of power facilities, providing architectural support for detecting tiny overheating spots on transformer bushing surfaces. In the same year, Zhou et al. [
22] put forward an end-to-end Transformer detection framework. By realizing direct association between targets and features through cross-attention mechanisms, it performs excellently in locating dense power components (such as multiple sets of disconnectors in switchgear), effectively reducing positioning deviations caused by traditional anchor box mechanisms. In 2023, Qiao et al. [
23] proposed MobileViT, which combines the global modeling capability of Transformers with the local feature extraction advantages of convolutional networks. Its lightweight design offers a feasible solution for real-time detection on power edge devices (e.g., UAV inspection terminals), reducing model parameters by more than 40% while ensuring detection accuracy. In 2023, Zhang et al. [
24] proposed the InfraredPowerDet model, specifically designed for infrared detection scenarios of power facilities, which innovatively introduced a “temperature-spatial dual-modal attention mechanism”. By processing the temperature feature channel and spatial structure channel of infrared images in parallel, this mechanism can not only accurately capture the temperature gradient features of equipment thermal anomaly regions in substation equipment detection but also strengthen the geometric contours of equipment through spatial attention. It effectively addresses the missed detection issue of traditional models for equipment with “weak thermal signals but complete structures” or “blurred structures but obvious thermal features” under complex backgrounds. In inspection images under harsh conditions such as fog, cloud, and backlighting, the detection accuracy is improved by more than 18%. In the same year, Zhao et al. [
25] proposed the PowerYOLOv9 model, which improved the “dynamic anchor box generation mechanism” and “cross-scale feature calibration module” based on the classical YOLO architecture. Aiming at the large size difference in power equipment, the dynamic anchor boxes can adaptively adjust the anchor ratio according to the actual distribution of equipment in the input image, avoiding the localization deviation of fixed anchor boxes for small-sized equipment. The cross-scale calibration module strengthens the correlation between micro thermal defects and global equipment structures through bidirectional interaction among layers of the feature pyramid. In transmission line inspection, the detection recall rate for defects with a diameter of less than 5 pixels is increased to 92.3%, which is 23.5% higher than that of the traditional YOLOv8. In 2024, Li et al. [
26] proposed the PowerViT-X model, which designed a “hierarchical cascaded attention” mechanism based on the vision Transformer architecture. In the encoder stage, small-window attention is used to capture local details of equipment, medium-window attention to model correlations between equipment components, and large-window attention to integrate global scene information. In the decoder stage, “defect-equipment-scene” three-level feature mapping is adopted to achieve accurate localization from global scenes to local defects. In the detection of complex substation infrared images, the model achieves a recognition accuracy of 96.7% for overlapping equipment, with a mean average precision (mAP) 12.4% higher than that of the contemporaneous Faster R-CNN, and its inference speed meets the requirements of real-time inspection. In terms of statistical validation of object detection results, Johnson et al. [
27] systematically sorted out the application specifications of statistical methods such as t-tests and ANOVA in comparing model performance in 2023, providing methodological guidance for analyzing the significant differences between different models in power equipment detection. In 2024, Zhao et al. [
28] proposed a confidence interval calculation method based on Bootstrap sampling targeting the small-sample characteristics of power defect detection, which effectively improved the accuracy of reliability evaluation for detection results on small-sample datasets.
3. Detection Methods for Defects in Power Facilities
The CMTA model is a deep learning network designed for infrared defect detection of power facility components. It addresses the limitations of traditional inspection methods, such as low accuracy in defect recognition under complex scenarios and poor adaptability to mobile devices, by integrating multi-angle feature perception, symmetry-aware design, and Transformer-based attention feature fusion.
Figure 1 presents the overall architecture of the image object detection system. In this study, the existing architecture of YOLOv10 is omitted in the diagram, while the modules designed in this work are highlighted. In the backbone network, through a series of operations including convolution, the MAPC module, and the Transattn module, multi-level and multi-scale features are extracted from the original image. The abstraction level of these features gradually increases, transitioning from simple local features to more representative high-level semantic features. This facilitates the network in capturing various features of the target at different scales, including information such as texture, shape, and edges. In the head network, features at different levels are fused via upsampling and concatenation operations to compensate for the varying capabilities of different convolutional layers in capturing features of targets with different sizes. The improved modules will be described in detail below. The overall framework diagram of the model is shown in
Figure 1.
In the CMTA network architecture illustrated in
Figure 1, Conv-pool denotes the convolution-pooling module designed in this study. Infrared images of power facility components are input into this module: convolution operations are first employed to extract basic thermal radiation features and equipment contour information of the images; subsequent pooling operations then compress the spatial dimension of feature maps while preserving key information, ultimately yielding low-level feature information of power facilities. The feature maps output by the convolution-pooling module are fed into the Multi-Angle Perceptual Lightweight Convolution Module (MAPC) module and the feature fusion module with Transattn, respectively: In Branch 1, to enhance the model’s ability to extract features of micro thermal defects in power facilities, deep processing of low-level features is achieved through the following workflow: 1 × 1 convolution for channel expansion → depthwise separable convolution for dimensionality reduction and efficiency improvement → global average pooling and 1D convolution for generating channel attention weights → residual connection for information reuse. This workflow not only adaptively calibrates feature channels strongly associated with power defects (e.g., thermal radiation intensity, temperature gradient) via channel attention but also retains the spatial structure information of equipment components. Meanwhile, a lightweight design is adopted to reduce computational complexity. In Branch 2, to integrate multi-scale features of power facilities and suppress background interference, the positional attention mechanism within this module enhances the perception of the spatial distribution of defects, while the channel attention mechanism screens key thermal feature channels. Combined with the window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) of Swin Transformer, global correlation of features of power equipment at different scales is realized—this not only accurately captures details of micro thermal defects but also correlates global equipment structure information, avoiding feature loss caused by equipment occlusion and background interference. Finally, the features extracted from the two branches are efficiently integrated using the feature fusion strategy designed in this study: first, the local defect features output by the MAPC module and the multi-scale global features output by the Transattn module are concatenated along the channel dimension; then, 1 × 1 convolution is used to adjust the channel dimension and strengthen feature correlation. Simultaneously, feature similarity modulation signals related to the FSM-CIoU loss function are introduced to calibrate the localization accuracy of power defect boundaries. The fused feature maps are input into the head network, where upsampling operations restore the resolution of feature maps. Combined with the anchor box optimization strategy, accurate localization of power facility components and determination of defect categories are achieved, and the infrared detection results of power facility components are finally output.
3.1. Anchor Box Selection Optimization Strategy
In the object detection task of infrared detection for power facility components, the annotation of data often faces the problem of a significant mismatch between the ground truth bounding boxes and the predefined anchor box sizes [
29]. On one hand, power facility components exhibit extremely large size variations, making it difficult for fixed anchor boxes to cover the full size range. On the other hand, the bounding boxes of defect regions in infrared images have irregular shapes, resulting in poor matching performance with conventional anchor boxes. This issue not only prolongs the convergence time of model training but also leads to component localization deviations and defect missed detections, exerting a significant impact on detection accuracy. The setting of initial anchor boxes usually lacks specificity for the two types of detection targets (i.e., “components-defects”) of power facilities. For instance, anchor boxes designed for large-scale equipment will miss micro thermal defects, while anchor boxes adapted to small defects fail to accurately localize large-scale components. This further reduces model inference efficiency and may even cause decision-making deviations due to matching mismatches between anchor boxes and targets. Given that datasets from different power inspection scenarios possess unique component size distributions and defect morphological characteristics, it is obviously unreasonable to set anchor boxes of fixed sizes for all datasets. Therefore, exploring an anchor box optimization strategy adapted to power scenarios is crucial for improving the infrared detection performance of power facility components. To address the above issues, this study introduces an innovative anchor box selection optimization strategy, which is a customized improvement based on the Density Peaks Clustering (DPC) algorithm combined with the annotation characteristics of the infrared detection dataset for power facilities. Unlike traditional clustering algorithms such as K-Means that require predefining the number of clusters, the DPC algorithm, based on the core concepts of “local density” and “relative distance”, can automatically identify density peaks in the bounding box dataset as cluster centers without manual intervention in the number of clusters. This makes it more suitable for the characteristics of “multiple component types and complex defect morphologies” in power-related data. Specifically, for each point in the dataset
, its local density
can be calculated using the following formula:
where
represents the distance between data points
and
,
denotes a cutoff distance, and
is a kernel function. When
,
; otherwise
. After determining the local density, the relative distance of each point is calculated. The relative distance refers to the distance from the point to its nearest neighbor with a higher local density, and its calculation formula is given as follows:
For the point with the maximum local density in the dataset (typically corresponding to the target with the highest quantity proportion and most representative size in the data distribution, such as the bounding boxes of insulator sheds that frequently appear in power scenarios), its relative distance requires a special definition. Since there is no neighboring point with a higher local density, the relative distance of this point is set as the maximum distance from this point to all other points in the dataset. This ensures that the point exhibits prominent peak characteristics in the “local density-relative distance” decision graph, avoiding the missed selection of core cluster centers due to the lack of distance definition.
The core objective of clustering the input data via the DPC algorithm is to explore the size distribution patterns of the two types of targets (“components-defects”) in power scenarios. For example, the clustering process can automatically distinguish different size categories such as large-scale components, medium-scale components, small-scale components, and micro-defects, and use the cluster centers of each category as the initial anchor box sizes of the model. This data distribution-based anchor box generation method not only avoids the localization deviation of large-scale components caused by excessively small anchor box sizes but also solves the problem of micro-defect missed detection caused by excessively large anchor box sizes, increasing the matching degree between anchor boxes and real targets by more than 40%. To ensure the stability and reliability of anchor box sizes, this study adopts a “multi-round iterative clustering + result fusion” strategy: a total of 700 independent iterations are conducted. In each iteration, 80% of the annotated data is randomly selected as the training set to perform DPC, and the cluster centers obtained in each iteration are recorded. After all iterations are completed, the size mean value of cluster centers in the same category is calculated, and finally, 6 groups of anchor box sizes adapted to different feature map scales are determined. The anchor box results after normalization are shown in
Table 1, where the 3 groups of anchor boxes under each feature map scale correspond to targets in different size ranges under that scale, providing accurate initial anchor box parameter support for subsequent model training.
3.2. Multi-Angle Perceptual Lightweight Convolution Module (MAPC)
To boost feature extraction capacity and enhance network performance, we have innovatively replaced certain original modules. Specifically, in the backbone architecture, a more sophisticated MAPC module has been employed to take the place of the conventional convolution module. In comparison with traditional feature extraction approaches, the MAPC module markedly elevates the effectiveness of feature extraction through multi-perspective feature fusion and a lightweight convolution strategy, all while cutting down on computational complexity. The MAPC module’s multi-angle feature fusion and channel attention mechanism specifically enhance the extraction of symmetry-related features, such as consistent edge contours and thermal intensity of symmetric components, to identify defect-induced asymmetry.
Traditional convolutional neural networks (CNNs) generally depend on standard convolution operations as the core of feature extraction. From the fundamental perspective of feature extraction, traditional convolution operations primarily concentrate on local edge and texture features, yet they lack adequate consideration of the spatial coordinate information of features during extraction. This trait largely undermines the model’s capability to accurately pinpoint small targets. The underlying cause is that spatial coordinate information plays a vital role in precisely identifying the position and scope of targets, and the disregard of such information in traditional methods might result in substantial inaccuracies in small target localization tasks. Moreover, traditional convolutional networks typically adhere to a single-path forward propagation pattern in their information processing pipeline. This singular information processing approach shows clear shortcomings in feature fusion, especially when handling tasks with complex feature distributions (such as detecting small and irregularly distributed defects). Due to the use of only a single information processing pathway, the network is incapable of achieving multi-perspective feature fusion, thereby failing to thoroughly explore and make use of feature information from various angles. Consequently, it performs inadequately in offering sufficient detailed information support for such detection tasks, which severely limits the model’s performance and adaptability—this limitation becomes even more pronounced, particularly in tasks with high demands for detailed information. The structure of the MAPC module is shown in
Figure 2.
In object detection, this study innovatively introduces the multi-angle perceptual lightweight convolution module (MAPC) to enhance the model’s ability to capture spatial structures. First, let the input feature map be denoted as
, where
represents the number of channels of the feature map,
denote the height and width of the feature map, respectively. In the expansion layer,
convolution is used to perform channel expansion on the input feature map; the purpose of this operation is to increase the channel dimension of the feature map and provide more abundant feature representations for subsequent depthwise separable convolution. Through this operation, the number of channels of the input feature map is expanded according to the preset expansion ratio, outputting a feature map with more channels and laying a more information-rich foundation for subsequent operations. Next, the feature map output by the expansion layer is fed into the depthwise separable convolution. Owing to the characteristics of convolution, each channel undergoes independent convolution operations, which significantly reduces the computational load. Meanwhile, according to the set stride, the spatial dimension of the output feature map changes accordingly (e.g., downsampling is performed), which facilitates the extraction of feature information at different scales. Subsequently, global average pooling is applied to the feature map after depthwise separable convolution to compress spatial information, thereby obtaining the global feature representation of each channel.
Herein,
is the bias term;
represents the channel-expanded feature map;
denotes the feature map that preserves the spatial correlation of the basic features in a single image;
stands for the detail branch feature map; and
indicates the multi-scale feature map of a single image. Building on this foundation, channel attention weights are computed using 1D convolution, where the kernel size of the 1D convolution is adaptively configured based on the number of input channels. The calculated channel attention weights are fed into a Sigmoid activation function to normalize their values within the range of [0, 1]. These normalized weights undergo element-wise multiplication with the original feature map, enabling adaptive feature recalibration. This process preserves the spatial dimensions of the feature map while enhancing the information from critical channels and attenuating that from less relevant ones—substantially improving the information representation quality of the feature map. Subsequently, a convolution operation is employed to project the channel count of the feature map to the target output channel number. This step is designed to adjust the feature map’s channel dimension to match the requirements of subsequent network layers, thereby facilitating smoother connections and more efficient information transmission between adjacent layers. The specific formula is as follows:
Herein,
represents the result of global average pooling,
denotes the result of feature concatenation,
stands for attention weights, and
indicates the feature map after attention calibration. When the residual connection conditions are satisfied (i.e., stride = 1 and the number of input channels is identical to the number of output channels), the output feature map of the projection layer is added to the original input feature map. This operation fuses information from the initial input feature map and the feature map processed through the aforementioned steps, effectively preventing information loss and boosting both the stability of feature representation and the smoothness of gradient flow. In cases where the residual connection conditions are not met, the output feature map of the projection layer is directly utilized as the input for the next stage. The specific formulas corresponding to this process are presented below:
Through the aforementioned multi-level operations, the multi-angle perceptual lightweight convolution module (MAPC) exhibits numerous advantages. In the expansion layer, the channel expansion operation enriches the information dimension of the feature map, laying a more information-rich foundation for subsequent operations. The depthwise separable convolution significantly reduces the computational load while ensuring feature extraction capability. The channel attention mechanism optimizes channel information representation through adaptive recalibration. The projection layer enables flexible adjustment of the number of channels. Residual connections facilitate information fusion and gradient flow. These operations work synergistically, endowing the model with stronger adaptability to targets of different scales and complexities in object detection tasks. It can more effectively capture spatial structure information, thereby improving the accuracy and robustness of object detection.
3.3. Transattn Module
In the infrared detection task of defects in power facility components, this study innovatively proposes the Transattn module. This module deeply integrates the advantages of the attention mechanism with the powerful feature expression capability of the Transformer architecture, aiming to accurately capture key local defect features and global equipment layout information in infrared images, thereby significantly improving the accuracy of power defect localization and classification. The attention mechanism plays a crucial role in processing image features: it can dynamically allocate computing resources based on the importance of features, and by learning the attention degree of different regions, enable the model to efficiently focus on defect features in detection tasks while weakening the interference of irrelevant backgrounds. Integrating it into a deep convolutional network can effectively enhance the model’s ability to focus on key features and its discriminative performance in power infrared images, strongly promoting the improvement of the network’s detection performance for power defects with “small samples and multi-scales”. In the process of extracting key information from power infrared images, a pre-convolutional layer module with an expanded receptive field is first used to generate multi-scale feature maps, which are then input into the convolution-attention sub-module. This sub-module mainly includes two mechanisms: positional attention and channel attention. The positional attention mechanism is committed to improving the model’s perceptual sensitivity to image spatial information, helping the model accurately capture the boundary contours and spatial distribution of defects, which is crucial for determining the shape and range of defects. The channel attention mechanism, on the other hand, focuses on channel features closely related to power defect detection (such as thermal radiation intensity and temperature gradient), and can screen out features with high discriminability, enhancing the model’s ability to identify different types of power defects. The Transattn module’s positional attention and Swin Transformer component model spatial symmetry correlations of power components, highlighting asymmetric thermal regions caused by defects while suppressing background interference. The diagram of the Transattn Module is shown in
Figure 3.
Position Attention Module (PAM): As illustrated in
Figure 3, three feature maps, denoted as A, B, and C, respectively, with {A, B, C} ∈ H × W × C, are generated. These feature maps are obtained by applying a nonlinear transformation to the output derived from processing local features via a two-dimensional convolutional pooling layer. Their dimensions are set to M × C, where M equals H multiplied by W (i.e., M = H × W). Subsequently, matrix C is transposed, followed by the execution of matrix multiplication between matrix A and the transposed matrix B. Finally, the softmax function is employed to calculate the positional attention map
∈ N × N, and its calculation formula is presented as follows:
Here,
denotes the impact exerted by the i-th position on the j-th position. The positional attention map has the same shape as the input feature map
. It is a matrix with values ranging from (0, 1), where a larger value indicates a more significant position. After this, matrix multiplication is performed between the generated feature map and C, resulting in a new feature map with dimensions H × W × C. The obtained feature map is multiplied by the learnable scale parameter
to better extract attention-related information. Subsequently, an element-wise summation operation is carried out with the input feature map
, yielding the output feature map
, which can be expressed by the following equation:
The feature map processed by the Position Attention Module (PAM) is fed into the Swin Transformer module. Firstly, three depthwise separable convolutional layers are employed to fuse feature information from different branches, thereby extracting key feature information of power equipment images. Secondly, the self-attention mechanism is utilized to capture long-range dependencies within the feature map. By partitioning the feature map into windows of fixed size and gradually reducing the number of tokens at different stages, multi-scale spatial features are captured, which enables the recognition of both fine-grained details and large-scale structures, making it suitable for detecting targets of varying sizes. In the attention calculation, relative position bias is introduced to help the model better understand the spatial layout of features and enhance the perception of positional relationships between different elements in the image. Finally, a Multi-Layer Perceptron (MLP) is used to further process the feature map, introducing nonlinear transformations to enable the model to learn more complex and abstract feature representations. The specific formula is expressed as Equation (8):
Among them, , , V denote the query matrix, key matrix, and value matrix, respectively; represents the dimension of the key; is the positional bias, which is used to enhance the positional awareness capability. refers to the relative positional bias, which is employed to model the relative positional relationships between elements within the window and improve the perception of spatial positions. , are weight matrices; , are bias terms; stands for the activation function. In addition, and represent the mean and standard deviation of features, respectively; and are learnable scaling and shifting parameters; is a small constant introduced to avoid division by zero. Such operations can effectively capture local spatial relationships and enable the recognition of local feature information (e.g., target shapes and textures) within specific regions.
3.4. FSM-CIoU Loss Function
To address the limitation of traditional CIoU, which ignores feature heterogeneity, the FSM-CIoU introduces a feature similarity modulation mechanism. It quantifies the matching degree between predicted features and ground-truth features using cosine similarity, dynamically weights the CIoU loss, and enhances the ability to distinguish defects with similar geometries but different features [
30]. On one hand, some power defects often present similar bounding box sizes and spatial positions, but there are essential differences between them in the infrared feature space. Relying solely on CIoU [
31] tends to ignore such feature information, making it difficult for the model to distinguish defects with similar geometry but heterogeneous features, thus leading to false detections. On the other hand, the coexistence of multiple defects is common in power scenarios, and there are feature correlations between defects. However, CIoU cannot incorporate such correlations into loss calculation, which prevents the model from learning the feature dependency relationships among multiple defects and further affects the accuracy of judging the overall defect status of equipment. The FSM-CIoU loss function incorporates symmetry similarity into feature modulation, increasing loss weight for samples with symmetry-breaking features to optimize defect localization and classification.
To address the above issues, this study introduces FSM-CIoU into the loss function design of YOLOv10. With CIoU as its basic framework, this loss function innovatively integrates a feature similarity modulation mechanism: different defects of power facilities have unique feature distributions in the infrared feature space, and these features are the key to accurate defect classification. The core of FSM-CIoU lies in dynamically weighting the CIoU loss by calculating the cosine similarity between predicted features and ground-truth features—specifically, increasing the CIoU loss weight for samples with low feature similarity. This guides the network to focus on learning samples with significant feature differences, promotes the model to more efficiently capture the essential features of power defects, and improves detection accuracy. Its calculation formula is as follows:
Additionally,
represents feature similarity,
denotes predicted features, and
stands for ground-truth features. Based on this, the modulation factor
is defined in this study as follows:
The final FSM-CIoU loss function can be expressed as follows:
Among them, denotes the modulation factor of the i-th sample, and represents the CIoU loss of the i-th sample.
By incorporating feature similarity into the consideration scope through the proposed FSM-CIoU, this characteristic enables the model to not only focus on the localization accuracy of bounding boxes during the learning process, but also further take into account the unique characteristics of different categories in the feature space—thereby helping the network learn and distinguish various categories more effectively. The CMTA algorithm framework is detailed in
Table 2.
4. Experimental Results
To verify the effectiveness of the improved method proposed in this paper, a refined improvement and experimental verification were carried out based on the YOLOv10s [
32] architecture. The control variable method was adopted in the experiments, and the effects of different improvement strategies were systematically compared through multiple sets of comparative experiments. To intuitively present the key results, the optimal performance indicators in all experimental data are marked in bold, which clearly highlights the advantages of the improved scheme proposed in this paper.
4.1. Data Source
The dataset utilized in this study is sourced from the Electrical Laboratory of Xidian University, comprising a total of 10,908 infrared images of power facilities, covering 12 core power component categories: bushing, arrester, breaker, clamp, conservator, current-transformer, disconnector, disconnector2, disconnector3, heat-sink, insulator, and transformer. Among these, insulators (1820 images, 16.7%) and transformer bushings (1560 images, 14.3%) rank as the top two categories by sample size, while disconnector2 (436 images, 4.0%) and disconnector3 (476 images, 4.4%) are rare components with proportions below 5%. In terms of defect types, the dataset contains 8920 images of overheating defects (81.7%, with temperatures ranging from 60 °C to 150 °C) and 2008 images of non-overheating defects (18.3%, including component cracks and mechanical failures). All samples were collected from 220 kV substations in Qingdao, covering only temperate monsoon climate scenarios. To ensure the consistency and validity of input data, the following preprocessing steps were uniformly applied to all images in the dataset: first, a 3 × 3 Gaussian filter was used to reduce high-frequency noise generated by thermal imaging equipment; subsequently, morphological closing operations with a 5 × 5 rectangular structuring element were performed to repair edge fractures of power components, preventing edge information loss during feature extraction. Second, an inverse mapping method for infrared thermal radiation values was employed to convert pixel grayscale values into actual temperatures, preserving key thermal features of overheating defects in power equipment. The specific formula is as follows:
Among them, denotes the actual temperature, and represents the pixel grayscale value of the image. Derived based on the measurement range of the infrared thermal imager used in the experiment, this formula ensures a linear correspondence between grayscale values and temperatures. The original thermal features of regions with temperatures above 60 °C are retained, while a grayscale compression factor of 0.8 is applied to non-defective regions with temperatures ≤ 60 °C. By enhancing the grayscale difference between defective regions and the background, the model’s ability to capture features of overheating defects is improved. Finally, all images are uniformly resized to 640 × 640 pixels using the bilinear interpolation method to avoid image stretching and distortion; the coordinates of target bounding boxes are normalized from pixel units to relative coordinates, ensuring that model training is not affected by differences in image size.
To effectively evaluate the model’s performance in the infrared detection task of power facility components, the infrared detection dataset of power facility components is divided into a training set, a validation set, and a test set at a ratio of 7:2:1. Specifically, the training set is used for iterative optimization of model parameters and feature learning; the validation set is employed for adjusting model hyperparameters and monitoring overfitting during the training process; and the test set is utilized to simulate real detection scenarios, so as to objectively measure the model’s generalization ability and detection accuracy, and ensure the reliability and fairness of the experimental results.
4.2. Experimental Setup
To further clarify the experimental implementation process and ensure the consistency of replication environments, the training, validation, and testing workflows of the CMTA model are visualized in a flowchart. The hardware and software environments used in the experiment are also specified to avoid ambiguity in environment configuration. The experimental workflow is shown in
Figure 4:
The validation experiments of the method proposed in this paper are all implemented in the environment of PyTorch 2.0.1 and Python 3.10. The hardware and server used include an Intel Core i7-10700K CPU and an NVIDIA GeForce RTX 4090 GPU. The experimental settings are shown in
Table 3:
4.3. Core Evaluation Metrics and Calculation Methods
In this study, to comprehensively and objectively evaluate the comprehensive performance of the improved model, mean Average Precision (mAP), model parameter count (Parameters), floating-point operations (GFLOPs), and frames per second (FPS) were selected as core evaluation metrics to construct a multi-dimensional performance evaluation system. Among these metrics, mean Average Precision (mAP) is a key indicator for measuring the model’s detection and classification effectiveness. It is obtained by calculating the Average Precision (AP) of all target categories (such as insulator defects, conductor thermal anomalies, etc.) and then taking the average value, which can comprehensively reflect the model’s recognition accuracy and localization precision for multi-category power targets. This metric is calculated based on Precision (P) and Recall (R), as specifically shown in equations:
In the formula, represents the total number of categories. In practical applications, a higher mAP value indicates that the model exhibits more excellent performance in classification tasks and can more accurately recognize and localize targets of various categories.
Parameters: Refers to the total number of all learnable parameters in the model training process. Its scale directly determines the model’s structural complexity and the demand for hardware computing resources. An excessively large number of parameters may lead to difficulties in model deployment, while an excessively small number may restrict the model’s feature expression capability, thus requiring a balance between performance and deployment costs.
GFLOPs: As a core metric for quantifying model computational complexity, it is defined as the number of floating-point operations required for the model to complete one full forward propagation. Its value directly reflects the computational overhead during the model inference process and serves as a key basis for evaluating whether the model can be adapted to edge-side inspection devices.
FPS: A core metric for measuring the model’s real-time inference performance, representing the number of image frames that the model can complete detection for per unit time. A higher FPS value indicates that the model has stronger real-time response capability, which can meet the requirements for detection timeliness in scenarios such as dynamic shooting and real-time defect early warning in power inspection, and serves as an important consideration criterion for the practical application of the model.
4.4. Ablative Analysis
To verify the effectiveness of the proposed MAPC module, the Transattn feature fusion mechanism, and FSM-CIoU loss function in this study under their synergistic effect, a series of ablation experiments with different module combinations were designed on the infrared detection dataset of power facility components. By comparing the impacts of different module combinations on model performance, the independent contributions and synergistic gains of each module were clarified. The specific settings and results of the ablation experiments are presented in
Table 4:
In the experiment, YOLOv10s was adopted as the baseline model for infrared detection of defects in power facility components. Among the proposed modules, the MAPC module focused on in-depth mining and integration of local features in power infrared images. Through a refined thermal texture and edge feature extraction mechanism, it enhances the model’s perceptual capability for subtle targets such as micro-cracks in insulators and local overheating of conductor joints. The Transattn module, by contrast, emphasizes multi-scale feature fusion and global context modeling, helping the model capture global information, including substation equipment layout and the correlation of transmission line components, thereby improving target localization accuracy in complex scenarios. From the ablation experiment data, the following observations can be made: When the model lacks only the MAPC module, the mAP value drops to 81.85, which is significantly lower than that of the complete model. This indicates that the MAPC module is crucial for extracting features of local subtle defects in power infrared images; its absence will reduce the model’s ability to recognize micro thermal anomalies and component edge defects. When the model lacks only the Transattn module, the mAP value further decreases to 80.39, demonstrating that the Transattn module plays a key role in feature fusion of multi-scale power facility components and global information capture; its absence will make it difficult for the model to correlate scattered equipment features, thus reducing detection accuracy in complex scenarios. When both the MAPC and Transattn modules are missing, the mAP value drops to the lowest of 78.21, whereas when the two modules work synergistically, the mAP value rises back to 85.01.
4.5. Ablation Comparison and Complexity Analysis Between Transattn and Standard Feature Fusion
To further verify the innovation and efficiency advantages of the Transattn module in terms of feature fusion, this section takes the standard feature fusion method of YOLOv10s (Concat + 1 × 1 Conv) as the baseline. Ablation experiments are conducted to compare the performance differences between the two methods, and the complexity overhead is quantitatively analyzed. The experimental results are shown in
Table 5.
As indicated by the comprehensive comparison table, on the infrared dataset of power facility components used in this study, the Transattn fusion method demonstrates significant advantages in balancing performance and complexity compared with the standard Concat + 1 × 1 Conv fusion method. In terms of performance, the mAP50 increases by 4.86 percentage points, the recall rate rises by 3.65 percentage points, and the precision and F1-score improve by 3.05 and 3.35 percentage points, respectively. This is primarily attributed to its ability to focus on core components and micro-defects through the attention mechanism, which avoids the issue of target features being diluted by background features in standard fusion. In terms of complexity, although the parameter count increases by 37.5% and FLOPs rise by 28%, the inference latency only increases by 0.4 ms per frame—far below the 30 ms real-time threshold of head-mounted AR devices—thus balancing accuracy and deployment requirements.
4.6. Experimental Analysis of Loss Functions
In the infrared detection of power facility components, the traditional CIoU only calculates the loss based on the geometric information of bounding boxes. When facing power defects with “geometric similarity but feature heterogeneity”, it is difficult to capture the essential differences at the feature level, leading the model to mistakenly classify different defects into the same category and affecting the learning efficiency of key features. The FSM-CIoU loss function proposed in this paper innovatively introduces a feature similarity modulation mechanism: aiming at the unique distribution of power defects in the infrared feature space, it quantifies the matching degree between the model-predicted feature vectors and the ground-truth feature vectors by calculating their cosine similarity. A cosine value closer to 1 indicates a higher feature similarity; conversely, a more significant feature difference is implied. Based on this similarity evaluation criterion, FSM-CIoU dynamically weights the traditional CIoU loss: it increases the CIoU loss weight for samples with low feature similarity, forcing the network to focus on learning the feature differences of such samples and avoiding the essence of features being obscured by geometric similarity. Thus, the model’s ability to distinguish different defects of power facilities and its detection accuracy are improved. The ablation experiment for the added loss function is shown in
Table 6:
4.7. Baseline Comparison Experiments
To further ensure the statistical reliability and stability of the experimental results, this study strictly controlled the experimental environment and training parameters to be consistent across all experimental groups. Comprehensive training and testing were conducted on the self-constructed infrared dataset of power facility components for the proposed CMTA model and mainstream comparative models, including YOLOv3-tiny [
33], YOLOv5n [
34], YOLOv7-tiny [
35], YOLOv8n [
36], YOLOv10s [
37], and YOLOv11n [
38]. To eliminate the impact of randomness on the results, each model was tested repeatedly 5 times using 5 different random seeds (123, 456, 789, 1011, 1213). For each key evaluation metric, the mean value was calculated to reflect the stable performance of the model, and the standard deviation (Std) was computed to quantify the degree of result fluctuation. The final multi-metric comparison results are presented in
Table 7.
As indicated by the data in
Table 7, after introducing repeated experiments with multiple random seeds, all key metrics of the CMTA model still outperform those of all comparative models in terms of numerical values. This result demonstrates that the performance advantages of the CMTA model are not accidentally caused by random factors, but stem from the rational design of its core modules. From the perspective of specific metrics, the recall advantage of the CMTA is attributed to the ability of the MAPC module to accurately capture features of micro-defects; moreover, the low standard deviation proves that this module can stably extract fine-grained thermal textures and edge information from infrared images under different random initialization conditions. In contrast, the stable performance of precision and F1-score is ascribed to the ability of the Transattn module to suppress complex backgrounds. In terms of model complexity and real-time performance, the CMTA achieves a detection frame rate of 252 ± 2 FPS with 2.8 M parameters and 7.3 GFLOPs, balancing performance, complexity, and real-time capability. Additionally, the standard deviation of the frame rate is only 2 FPS, which indicates that the CMTA can maintain stable real-time response capability in practical deployment.
To further verify whether the performance advantage of CMTA over baseline models is statistically reliable, paired t-tests were conducted on the mAP50 values (
n = 5, 5 independent runs with different random seeds) of CMTA and each comparative model. The experimental results are shown in
Table 8:
The results show that the t-statistics of CMTA versus YOLOv3-tiny, YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv10s, and YOLOv11n are 18.62, 15.37, 12.89, 10.54, 8.73, and 7.91, respectively, with corresponding p-values all <0.001 (far below the significance level α = 0.01). This indicates that the mAP50 improvement of CMTA over all baseline models rejects the null hypothesis, and the performance advantage is statistically significant—proving that the advantage stems from the rational design of core modules (MAPC, Transattn, FSM-CIoU) rather than random factors.
4.8. Evaluation of Generalization Ability of the CMTA Model Based on Simulated Cross-Scenario Testing
To evaluate the generalization ability of the CMTA model, this study leveraged the inherent feature differences in the self-constructed infrared dataset of power facility components and constructed three types of simulated cross-scenario test sets through targeted data partitioning and feature screening strategies, so as to simulate the variations in environmental and equipment features that may be encountered in practical power inspection. The simulated environmental interference difference test set included 1200 images with complex background interference screened from the dataset, covering typical scenarios such as equipment occlusion by trees, reflections from substation buildings, and strong light interference. These images formed a background complexity difference compared with the data characterized by no occlusion and low interference in the original training set. The simulated equipment state difference test set contained 900 images of equipment under different operating loads, including two types of infrared images captured during peak and off-peak electricity consumption periods. The thermal radiation characteristics of the same power equipment in these two types of images showed significant differences, which simulated the scenario differences across equipment operating states and were used to verify the model’s adaptability to dynamic changes in the thermal radiation characteristics of equipment. The simulated imaging condition difference test set comprised 800 images with different imaging distances and angles, divided into two categories: close-range close-ups and long-range wide-angle shots. These images corresponded to the imaging scenario differences between manual close-range shooting and UAV long-range shooting in practical inspection, and could be used to verify the model’s robustness to changes in target scale and image clarity. The training parameters in the experiment were kept consistent with those in the baseline experiment, and the results are presented in
Table 9.
The results show that in the scenarios of simulated environmental interference and imaging condition differences, although the mAP50 of the CMTA model decreases by 2.47% to 3.47% compared with that on the original test set, it still remains above 81%. Moreover, its recall rate is significantly higher than that of the comparative models. This confirms that the background suppression capability of the Transattn module and the fine-grained feature extraction capability of the MAPC module of the CMTA model remain effective under non-ideal imaging conditions and complex backgrounds.
4.9. Results on PASCAL VOC 2007
We evaluate the performance of the proposed CMTA on the PASCAL VOC 2007 dataset, one of gold standard benchmarks for Object detection. As shown in
Table 10, CMTA has achieved 82.9% mIoU.
4.10. SOTA Comparison Experiments
To further demonstrate the superiority of the proposed model, a comparison was conducted between the improved model and other models that have exhibited excellent detection performance in recent years. The experimental results are presented in
Table 11:
In the comparative experiment of infrared detection for defects in power facility components, the CMTA model proposed in this paper achieves the optimal performance with an mAP50 of 85.01%. In contrast, the accuracy of existing mainstream power defect detection models all shows gaps: among them, the mAP50 of DF-YOLOv7 is 78.63%, DE-RetinaNet is 79.85%, and EDDN is 81.27%; even for the DDN model with relatively better performance, its mAP50 is only 82.94%. This result indicates that in the power facility defect detection task, the CMTA model in this study can more accurately recognize targets such as insulator cracks and conductor thermal anomalies and achieve precise localization. It has stronger adaptability to difficulties in power scenarios, such as “small-sized defects and complex background interference”, and possesses more excellent detection capability.
Figure 5 visually compares the infrared detection effects of the CMTA model, mainstream models and advanced power defect detection models on power facility components. In scenes with small defects and complex backgrounds (such as transformer bushings overlapping with tree shadows), the CMTA model shows obvious advantages relying on its innovative modules: the MAPC module enhances the extraction of thermal texture and edge features to capture subtle thermal anomalies that are easily missed by other models; the Transattn module realizes global context modeling and multi-scale feature fusion to suppress background interference and avoid localization deviations; the FSM-CIoU loss function solves the misdetection problem of geometrically similar but feature-heterogeneous defects, fully verifying the effectiveness of the CMTA model’s innovative design in practical detection.