1. Introduction
As power systems and smart grids continue to advance, power equipment inspection is progressively transitioning from traditional manual methods toward intelligent and automated visual monitoring. Leveraging high-resolution cameras, unmanned aerial vehicles (UAVs), and mobile inspection robots, modern power systems continuously generate massive volumes of image data. These images provide critical visual information for equipment condition analysis, fault detection, and defect identification. However, the exponential growth of visual data also introduces a major challenge: data redundancy. In particular, power image datasets often contain a large proportion of redundant or highly similar samples [
1,
2]. Such redundancy primarily arises from periodic inspections that repeatedly capture the same equipment under similar viewpoints, as well as derivative images generated during preprocessing steps such as format conversion, resizing, and rotation. Redundant data not only impose significant storage burdens but also degrade the quality of downstream tasks, for instance, by causing deep learning models to overfit redundant features and reducing generalization performance. Further, some kinds of various deep learning methods are applied to vision and language applications [
3,
4,
5,
6,
7,
8,
9] and other learning-based applications [
10,
11,
12,
13,
14,
15,
16].
Existing techniques for redundancy reduction in power images can be broadly categorized into three types: conventional content-driven feature matching, perceptual hashing for rapid comparison, and deep learning models for capturing image similarity. Conventional feature matching offers fast computation but struggles with complex backgrounds, lighting variations, and geometric structures typical of power equipment images. Perceptual hashing improves efficiency and provides some robustness, yet often fails to capture deeper semantic differences. Deep learning models provide stronger representational capacity but require substantial computational resources and large labeled datasets, and may still lack sufficient semantic understanding of power-specific content. More importantly, existing approaches frequently cannot distinguish between genuine equipment state changes and superficial imaging variations, limiting their effectiveness in real-world inspection scenarios. Consequently, existing methods are limited in real-world power inspection scenarios, motivating the need for a more robust and semantically aware redundancy removal framework.
To address these challenges, we propose a dual-phase redundancy removal framework that integrates both perceptual cues and high-level semantic information. In the first stage, an improved Discrete Cosine Transform (DCT)-based hashing algorithm combined with multi-scale structural similarity (MS-SSIM) performs fast and lightweight initial filtering, emphasizing visual similarity with low computational cost. In the second stage, a Vision Transformer enhanced with a hierarchical sparse attention mechanism (top-) extracts semantic representations, and cosine similarity is employed for precise redundancy judgment. The framework introduces several key innovations: (1) a disturbance-resilient hashing algorithm that maintains stable responses under rotations and brightness variations; (2) a dynamic threshold–gated sparse attention mechanism within the Transformer that reduces computation by approximately 42%; and (3) a three-level feature processing pipeline that combines low-level MS-SSIM filtering, mid-level perceptual hashing, and high-level Transformer-based semantic analysis to balance efficiency and accuracy.
To enhance robustness under real-world conditions, we incorporate a Retinex-theory-based preprocessing module to mitigate lighting variability and an attention-guided feature enhancement mechanism to focus on critical structural regions of power equipment. The approach is evaluated on a large-scale dataset covering six categories of power equipment: substations, transmission lines, towers, switches, junction boxes, and insulators. Experimental results demonstrate that our method surpasses state-of-the-art alternatives, notably improving retrieval sensitivity, mitigating spurious detections, and enhancing responsiveness to subtle apparatus perturbations. Specifically, sensitivity to minor visual anomalies—such as small cracks or corrosion—is improved by approximately 15% over the best-performing baseline, while overall processing speed is increased by over 4× due to the two-stage design. These findings underscore the proposed framework’s strong potential for real-world deployment in engineering applications.
2. Related Work
2.1. Image Retrieval
The exigency for expeditious image retrieval across voluminous datasets has catalyzed the advent of diversified hashing paradigms, wherein intricate high-dimensional visual descriptors are transmuted into succinct binary encodings. These methods aim to preserve semantic similarity while significantly reducing computational cost. Hashing methodologies are conventionally bifurcated into two principal classes: data-agnostic schemes and data-adaptive algorithms that iteratively infer the underlying data manifold.
Data-independent hashing does not rely on training data to define its transformation functions. A classic example is Locality Sensitive Hashing (LSH) [
17], which ensures that similar inputs are more likely to be mapped to the same hash code by using random projections. Despite its efficiency in retrieving data from high-dimensional feature representations, LSH exhibits certain limitations. Its effectiveness is constrained to specific similarity metrics and lacks adaptiveness to complex data distributions.
In contrast, data-driven hashing methods derive hash functions from the training data itself, enabling improved conformity to the intrinsic structure of the data. Contingent upon the availability of annotated supervision during the training phase, these learning-centric methodologies bifurcate into unsupervised and supervised taxonomies.
In unsupervised hashing, the algorithms exploit statistical patterns without requiring class annotations. One prominent method, Spectral Hashing [
18], formulates the hash learning task as an eigen-decomposition problem based on graph Laplacians. Despite its elegance, it assumes uniform data distribution, limiting its generalization. Another variant, Kernelized LSH [
19], incorporates nonlinear transformations using kernel functions, enhancing the capacity to capture more complex structures without supervision.
Supervised hashing methods, in contrast, leverage semantic labels to guide the learning of hash codes that reflect class-level similarity. For instance, Binary Reconstructive Embedding [
20] endeavors to attenuate the reconstruction disparity between the intrinsic feature manifold and its corresponding representation within the Hamming space. Similarly, Minimum Loss Hashing (MLH) [
21] reduces quantization loss to better approximate semantic distances. Kernel Supervised Hashing (KSH) [
22] further introduces kernel-based metrics to emphasize separation between dissimilar categories while clustering similar ones. These methods often achieve higher retrieval precision but incur increased training cost and require sufficient labeled samples.
Nevertheless, supervised approaches face a major obstacle due to their reliance on large-scale annotations, which require considerable manual effort and expense. While some semi-supervised variants attempt to reduce this burden by utilizing partial labels [
23], their performance can degrade with small labeled subsets or under distribution shifts. Moreover, many conventional hashing models rely on linear projections, which are insufficient to model the intricate nonlinear relationships inherent in image data [
24].
Recently, deep learning has significantly propelled image retrieval forward by combining feature extraction and hash code creation within a single framework. Convolution-based neural networks are extensively utilized to encode semantic content into binary representations [
25,
26]. These networks are typically trained using label supervision, with the learned embeddings binarized via thresholding mechanisms applied to fully connected layers.
To overcome the dependence on annotated data, unsupervised deep hashing methods have also emerged [
27]. These approaches often employ techniques such as autoencoders, contrastive learning, or generative models to learn effective hash functions by exploiting the inherent data structure. While still an active area of research, they offer promising alternatives for large-scale retrieval scenarios where labeled data is scarce.
In summary, image hashing techniques have evolved from randomized projections to sophisticated deep models. Although supervised deep hashing attains leading accuracy, issues related to scalability, annotation expenses, and generalization persist. As such, developing lightweight, robust, and label-efficient hashing frameworks continues to be a crucial focus for modern image retrieval tasks.
2.2. Vision Transformer
Originally developed to tackle challenges in natural language processing tasks, the architecture known as the Transformer has progressively gained prominence as a powerful model in computer vision. Its self-attention mechanism facilitates flexible modeling of long-range relationships, benefiting visual representation learning significantly. Nonetheless, the computational burden increases quadratically with input size, making direct application to high-resolution images challenging.
To mitigate these computational limitations, a range of vision-specific adaptations have been proposed. The Vision Transformer (ViT) [
28] represents one of the earliest efforts in this direction. It segments images into non-overlapping patches and processes them as a sequence of tokens, effectively lowering input dimensionality. This approach has subsequently motivated numerous subsequent studies [
29,
30], focusing on improving tokenization techniques or tailoring the architecture for downstream vision applications like detection and segmentation [
31].
An auxiliary design tenet embraced by numerous architectures entails the gradual diminution of spatial resolution in feature representations, thereby regulating the token count engaged in attention operations. For instance, Pyramid Vision Transformer (PVT) [
32,
33] introduces a hierarchical encoder and incorporates a sparsified attention mechanism to improve scalability. Deformable Attention Transformer (DAT) [
34] advances this idea by leveraging a deformable attention module that learns adaptive sampling locations based on input content.
Other approaches emphasize localization of attention to limit the receptive field during early processing stages. The Swin Transformer [
35], for example, applies attention within shifted non-overlapping windows, striking a balance between efficiency and expressivity. Similarly, NAT [
36] adopts a query-wise strategy, wherein each query token computes attention independently using convolution-inspired local priors.
In addition to architectural innovations, some researchers explore hybrid designs that fuse convolutional operations with self-attention to enhance both performance and efficiency. For example, CMT [
37] integrates depthwise convolutions [
38] into Transformer blocks, capitalizing on their lightweight characteristics. ACmix [
39] further streamlines computation by sharing operations between convolution and attention modules. Such hybridization is motivated by the complementary nature of local convolution and global attention, leading to enhanced feature modeling.
Efforts have also been made to optimize training efficiency and reduce inference latency. For instance, MobileFormer [
40] utilizes dual processing streams—one for convolution and one for Transformer—to combine the benefits of both paradigms. Dynamic Perceiver [
41] implements early exit mechanisms, enabling adaptive computation paths based on input complexity. MobileViT [
42] extends the MobileNet [
38] backbone by integrating Transformer modules into a compact convolutional architecture, rendering it highly suitable for environments with limited resources.
Overall, Vision Transformers have rapidly evolved through innovations in tokenization, attention mechanisms, and hybrid architectures. Notwithstanding these advancements, harmonizing precision, computational efficiency, and scalability persists as a formidable challenge, especially concerning deployment on edge hardware and latency-sensitive real-time systems.
3. Methodology
3.1. Overview
This investigation delineates a redundancy abatement paradigm for power inspection imagery, which amalgamates semantic saliencies with perceptual cues, as schematically depicted in
Figure 1. The approach targets large-scale power image datasets and aims to efficiently filter out visually redundant samples, thereby improving data processing efficiency and enhancing the accuracy of downstream recognition tasks.
Initially, low-level visual attributes like texture, edges, and structural details are extracted from each image at the perceptual stage. Based on these features, both hash similarity and structural similarity between images are computed. Hash similarity provides a rapid assessment of overall visual similarity, while structural similarity captures spatial arrangement and detailed differences. A threshold-based filtering is then applied, where the perceptual similarity threshold is empirically set to for hash similarity and for structural similarity, as determined from validation experiments balancing recall and precision. Only image pairs exceeding both thresholds are retained as candidate redundant samples.
In the next stage, a sparse Transformer network is utilized to derive high-dimensional semantic representations from the selected candidates. These features represent abstract semantic information, including object-level understanding, contextual relationships, and scene semantics, which complement the limitations of perceptual features in capturing high-level meaning. Cosine similarity is then calculated between semantic feature vectors, and a semantic similarity threshold of is applied to distinguish samples that are perceptually similar but semantically different. Finally, a second-stage filtering based on this semantic similarity ensures that only image pairs with both high perceptual and semantic consistency are identified as truly redundant.
By explicitly defining the threshold parameters and their selection rationale, the proposed dual-path filtering mechanism combining perceptual and semantic analysis is both precise and reproducible. This not only augments the discriminability and informational richness of the preserved instances but also furnishes high-fidelity data for downstream applications, including power apparatus identification and anomaly localization.
3.2. Visual Perception Branch
To measure the low-level visual similarity between power inspection images, we employ a perceptual hashing technique based on the Discrete Cosine Transform (DCT). This approach encodes each image into a compact 64-bit fingerprint by extracting essential luminance and structural patterns, and then compares image pairs using Hamming distance. The process consists of several stages, as described below.
Given an input image
, we first resize it to a fixed resolution of
pixels to standardize the input and eliminate variations caused by image size and aspect ratio. The resized image is denoted by
Next, the resized image is converted into a single-channel grayscale image
, which discards chromatic components and focuses solely on the structural and brightness information:
We then apply the two-dimensional Discrete Cosine Transform (DCT) to
to obtain the frequency-domain representation
:
The DCT decomposes the image into a set of cosine basis functions of varying frequency, capturing both global and local visual patterns. Low-frequency components in the top-left region of D preserve most of the essential structural information, while higher frequencies correspond to finer details and noise.
To isolate these dominant features, we extract the upper-left
submatrix from the DCT coefficient matrix:
We compute the mean of all elements in
, which will serve as a threshold to binarize the coefficients:
The perceptual hash code is generated by comparing each coefficient in
with the mean value
. For each position
, the binary value is assigned as
The final hash code
is obtained by flattening the
binary matrix in row-major order:
For any two images
and
, with respective hash codes
and
, we compute their Hamming distance
to measure perceptual dissimilarity:
where
is the indicator function. A smaller
indicates higher perceptual similarity between the two images.
To normalize the distance into a similarity score, we define:
A larger
implies stronger visual resemblance. If
exceeds a predefined threshold
, the images are considered visually redundant:
This DCT-based perceptual hashing method provides several advantages. First, it yields highly compact binary descriptors that enable fast pairwise comparisons. Second, it is robust to small variations in image scale, illumination, and compression. Third, the computational cost is minimal, making it well-suited for large-scale redundancy analysis. By integrating this hashing mechanism into the visual perception branch, we ensure efficient and scalable filtering of redundant samples prior to downstream semantic processing.
3.3. Semantic Feature Extraction Branch
To extract high-level semantic representations from power grid images, we employ a Transformer-based feature extractor enhanced with a sparse attention mechanism. The process begins by converting input images into a sequence of patch embeddings, followed by semantic modeling through attention-based encoding. The resulting feature vectors are used to compute semantic similarities among images. The detailed procedure is described below.
Let the input image be denoted as
. Initially, the image is segmented into multiple non-overlapping patches of size
, yielding a total of
patches. Each patch is then reshaped into a one-dimensional vector and transformed into a
d-dimensional feature embedding via a trainable linear layer. We refer to the resultant sequence of embedded patches as:
where
represents the embedded vector of the
i-th patch. To encode positional information, we add a learnable position embedding
:
The resulting patch sequence
is input into a Transformer encoder that leverages sparse attention. As delineated in
Figure 2, each stratum integrates a multi-head sparse self-attention paradigm to encapsulate protracted contextual interdependencies among visual tokens. For every attention head, queries, keys, and values are produced through separate linear projections:
where
are learnable weight matrices. The full attention score matrix
is computed as:
To enhance the interpretability and robustness of attention, sparsity is instilled by conserving exclusively the foremost
k prominent entries in every row of the attention matrix. Concretely, a binary mask
is defined as
where
denotes the top-
k entries in the
i-th row of
A. The sparse attention matrix is then defined as:
The output of a single sparse attention head is obtained as:
and outputs from each attention head are combined by concatenation and subsequently transformed via a linear projection to yield the multi-head sparse attention result:
where
is the output projection matrix. To ensure training stability and expedite convergence, batch normalization is applied pre- and post-attention as well as feedforward sublayers. The resultant output of the Transformer encoder layer is:
where FFN denotes a two-layer position-wise feedforward network with non-linear activation, and BN is batch normalization.
After passing through multiple such Transformer encoder blocks, the sequence
Z contains contextually enriched representations for each image patch. To consolidate these into one semantic feature vector representing the entire image, average pooling is performed:
where
is the final semantic embedding of the input image. Consequently, all images are projected into a unified
d-dimensional latent feature manifold.
To quantify the semantic similarity between two images
and
, represented by embeddings
and
, we calculate their cosine similarity as follows:
A value close to 1 indicates that the semantic contents of the two images are highly similar. If the cosine similarity exceeds a predefined threshold
, the image pair is considered semantically redundant:
This semantic feature extraction pipeline, from patch embedding to sparse attention and vector-based comparison, enables the model to robustly filter redundant images based on their high-level meaning, complementing the low-level perception branch and enhancing the overall effectiveness of image redundancy governance.
4. Experiments
4.1. Implementation Details
To improve model robustness and mitigate overfitting, several data augmentation techniques are applied to the training set, including the following:
Random rotations within ;
Horizontal and vertical flips;
Random brightness and contrast adjustments up to ;
Random cropping and resizing.
These augmentations increase the diversity of the training data and ensure that the model is resilient to common variations encountered in real-world power grid imagery.
The semantic feature extraction branch is based on a Vision Transformer with sparse attention. The network consists of 12 Transformer encoder blocks, each with 8 attention heads and a hidden dimension of 512. Patch size is set to 16 × 16, yielding a sequence length of per image. Each encoder block contains the following:
Multi-head sparse self-attention layer;
Two-layer feedforward network with GELU activation;
Batch normalization before and after attention and feedforward layers.
The final image embedding dimension is 512. The total number of trainable parameters in the semantic branch is approximately 21.3 million.
The model is trained using the AdamW optimizer with an initial learning rate of and a weight decay of . A cosine annealing scheduler gradually reduces the learning rate over 100 epochs. The batch size is set to 32, and dropout with a rate of 0.1 is applied in each Transformer block to prevent overfitting. All models are implemented in PyTorch 2.7.1 and trained on an NVIDIA RTX 4090 GPU.
4.2. Dataset Construction
To appraise the efficacy of our sparse attention-infused semantic abstraction paradigm in power image analytics, we curated a bespoke dataset carefully designed for this study. A total of 295 high-resolution images were collected from multiple real-world power grid operation scenarios, including three substations and two transmission line segments. These images cover a variety of typical elements such as electric towers, maintenance personnel, operation tools, and the surrounding environmental context. The dataset is primarily intended for evaluating redundancy detection and semantic representation in power inspection imagery, supporting tasks such as anomaly identification and equipment condition assessment.
To emulate authentic redundancy scenarios within training datasets and to meticulously evaluate the model’s capacity for redundancy detection while maintaining semantic distinctiveness, we augmented the original images via a suite of transformation operations:
Cropping: Random regions of interest (ROI) were selected and cropped to introduce spatial variation.
Scaling: Images were rescaled using random scale factors to simulate different zoom levels.
Filtering: Image filters such as Gaussian blur, edge enhancement, and brightness contrast adjustment were applied to mimic sensor-induced variations.
Compression: JPEG compression with varying quality factors was applied to simulate storage or transmission artifacts.
These augmentations resulted in over 6000 redundant images with different degrees of similarity to the original set. Each original image and its augmented variants form a similar image group, yielding a total of 295 groups. The images within each group are visually and semantically similar but exhibit minor content and quality discrepancies, reflecting common scenarios in real-world power image datasets. A sample group of similar images is shown in
Figure 3, highlighting the high intra-group similarity and subtle inter-instance differences.
This dataset provides a solid foundation for evaluating both semantic feature extraction and redundancy detection tasks.
4.3. Evaluation Metrics
To holistically appraise the model, we employ evaluative criteria encompassing both regression and classification paradigms.
Recall assesses the degree to which the model effectively retrieves the entirety of pertinent (positive) targets. It is calculated as:
Here, , , , and represent the frequencies of empirically verified positive detections, valid negative classifications, erroneously flagged positives, and undetected positive instances, respectively.
Precision quantifies the proportion of accurately identified redundant images among all instances classified as redundant by the model, reflecting its efficacy in redundancy discernment. Formally, it is defined as:
The F1-score provides an equilibrated evaluation by synthesizing the model’s fidelity with its thoroughness in identifying redundant instances. The F1-score is calculated as:
ROC-AUC quantifies the model’s ability to distinguish between redundant and non redundant images by evaluating its performance at all possible classification decision thresholds, where the ROC curve plots the relationship between True Positive Rate and False Positive Rate at different thresholds, and the AUC (area under the curve) value range is between 0.5 and 1, which is commonly used to quantify the quality of the ROC curve.
4.4. Results
4.4.1. Redundant Image Detection
We further test the model’s capability in identifying similar images. Given the vectorized representations produced by the sparse attention semantic branch, we compute the cosine similarity between pairs of images. When the similarity exceeds a predefined threshold, the pair is marked as redundant.
The performance in this classification task is shown in
Table 1. The model achieves a precision of 0.9961 and a recall of 0.8605, indicating its robust ability to capture visual redundancy even when differences are subtle.
We compare the algorithm sensitivity between the proposed method and other methods, including ResNet, VGG, which are commonly used for image similarity detection based on semantic features. Note that the sensitivity is the percentage of targets that the algorithm can identify out of the total number of targets.
Through comparison from
Table 2, it can be found that the image similarity detection model based on the proposed method is better than other mainstream models, mainly because the self-attention mechanism of Transformer can build global dependencies and capture image features better.
4.4.2. Ablation Study
This study shows the effect of sparsity level in sparse attention. To analyze the role of attention mechanism sparsity in image similarity calculations, we designed a comparative ablation experiment to examine how varying sparsity levels affect computational outcomes. The topk parameter was set at 25%, 50%, and 75%, respectively, representing different degrees of sparsity. Through four experimental groups, we compared the impacts of sparsity and its degree on computational results. As shown in
Table 3, the attention mechanism with sparsity-enhanced modifications showed a 5–6 percentage point improvement in algorithm sensitivity. This demonstrates that sparsity filtering effectively removes distracting semantic features, allowing semantic expressions to concentrate on critical information. Notably, setting the sparsity parameter at 50% demonstrated slightly better performance compared to 75%, indicating that excessive sparsity may lead to information loss. Therefore, practical implementations typically adopt a 50% sparsity parameter configuration.
4.4.3. Visualization
To substantiate the efficacy of the proposed approach, as shown in
Figure 4, we show the similarity score and determination result of some redundant image pairs and non-redundant image pairs. The cosine similarity between the two images in the first row is 0.874, and they are judged as redundant image pairs. The cosine similarity between the two images in the second row is 0.153, and it is determined as a non-redundant image pair. The cosine similarity between the two images in the third row is 0.419, which is determined as a non-redundant image pair and belongs to a missed judgment. The main reason is that some enhanced data are overexposed, resulting in distortion of image feature extraction and misjudgment.
5. Discussion
The findings substantiate that our proposed model proficiently evaluates image quality and identifies redundant images within power grid visual data. The use of sparse attention improves the discriminative capability of feature representations, making them more resilient to minor perturbations introduced by augmentations. Furthermore, the cosine congruity between semantic embeddings serves as a potent heuristic for affinity quantification, demonstrating strong alignment with human cognitive appraisal of pictorial similarity. These characteristics are essential for ensuring data diversity and minimizing overfitting during training on real-world datasets.
Nevertheless, the current dataset size is relatively limited, which may constrain the assessment of the model’s generalization capacity. Future work will involve validating the proposed method on larger and more diverse datasets, encompassing various power grid environments, imaging conditions, and equipment types, to further ascertain its robustness and applicability in broader real-world scenarios.
In addition, although the dual-path filtering mechanism enhances redundancy detection accuracy, it inevitably introduces extra computational overhead due to the combined use of perceptual and semantic feature extraction. For deployment in real-time or resource-constrained power grid monitoring systems, trade-offs between accuracy, processing speed, and hardware requirements must be carefully considered. Potential optimizations, such as model compression, hardware acceleration, and adaptive filtering strategies, will be explored to ensure efficient integration into large-scale operational environments.
Beyond these points, other limitations should be noted. While the model demonstrates robustness to moderate rotations and lighting variations, its performance under extreme environmental conditions or severe image distortions has not been fully evaluated. Additionally, the current framework is designed for static images, and extending it to continuous video streams or multi-temporal monitoring scenarios requires further investigation. Addressing these limitations will guide future improvements and facilitate deployment in more diverse operational settings.
Future work will address these limitations by expanding the evaluation to larger and more diverse datasets, encompassing a wider variety of power grid environments, imaging conditions, and equipment categories. We also plan to investigate lightweight or real-time variants of the model, explore adaptive attention mechanisms for dynamic scene understanding, and integrate multi-modal data sources, such as infrared or LiDAR imagery, to further enhance redundancy detection accuracy and applicability in operational power grid monitoring systems.
6. Conclusions
We put forth a semantic-aware sparse attention paradigm, meticulously devised to tackle the intricate tasks of redundancy discernment and perceptual fidelity appraisal within power grid visual corpora. The proposed framework initiates by decomposing high-resolution inspection images into a structured sequence of patch-level tokens, which are then embedded into a latent space that retains both local detail and global context. This embedded representation is subsequently refined through a sparse self-attention mechanism, wherein computational emphasis is selectively allocated to the most salient interactions across token pairs. Concretely, by retaining only the top-k attention coefficients in each query row, the model circumvents the inefficiencies of dense attention and suppresses the propagation of semantically inconsequential signals.
Such a design not only mitigates the redundancy inherent in large-scale visual datasets but also augments the model’s capacity to focus on discriminative regions—such as damaged components, abnormal textures, or anomalous patterns—crucial for downstream analytic tasks. The sparse attention module thus serves a dual role: as a computational economizer and as a semantic distiller, ensuring that only the most informative visual dependencies are preserved for subsequent layers. Through this synergy of tokenization, semantic abstraction, and structured sparsity, our method delivers enhanced discriminability and robustness in scenarios demanding both precision and interpretability.
To further improve semantic discrimination, a feature extraction backbone incorporating batch normalization, multi-head sparse attention, and a feedforward module was designed. This architecture enables the network to learn robust and informative features from power operation scenes with high precision, even in the presence of noise and low-quality inputs.
We constructed a dataset of 295 original power images and generated over 6000 redundant samples using various augmentation techniques. Experimental findings on this dataset compellingly corroborate the efficacy and robustness of our proposed framework. Specifically, for binary classification tasks such as redundancy detection, the model achieves a precision of 0.9961 and a recall of 0.8605, verifying its utility in identifying redundant samples effectively.
Overall, our approach provides a practical and efficient solution for semantic representation learning and redundancy management in power image datasets. By reducing data redundancy and emphasizing high-quality, informative samples, the proposed method lays the foundation for more scalable and intelligent visual inspection systems in power grid operations. Future work will explore the integration of cross-modal attention to incorporate text or sensor metadata and the application of our approach to larger-scale industrial image analysis tasks.