LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection

Zhang, Linghao; Kuang, Junwei; Teng, Yufei; Xiang, Siyu; Li, Lin; Zhou, Yingjie

doi:10.3390/pr13082341

Open AccessArticle

LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection

by

Linghao Zhang

¹,

Junwei Kuang

^2,*,

Yufei Teng

¹,

Siyu Xiang

¹,

Lin Li

² and

Yingjie Zhou

^3,4

¹

State Grid Sichuan Electric Power Research Institute, Chengdu 610072, China

²

State Grid Luzhou Electric Power Supply Company, Luzhou 646000, China

³

Chuanshen Hongan Intelligent (Shenzhen) Co., Ltd., Shenzhen 518055, China

⁴

School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(8), 2341; https://doi.org/10.3390/pr13082341

Submission received: 4 May 2025 / Revised: 21 May 2025 / Accepted: 30 May 2025 / Published: 23 July 2025

(This article belongs to the Special Issue Smart Optimization Techniques for Microgrid Management)

Download

Browse Figures

Versions Notes

Abstract

Substation equipment defect detection is a critical aspect of ensuring the reliability and stability of modern power grids. However, existing deep-learning-based detection methods often face significant challenges in real-world deployment, primarily due to low detection accuracy and inconsistent anomaly definitions across different substation environments. To address these limitations, this paper proposes the Language-Guided Enhanced Anomaly Power Equipment Detection Network (LEAD-Net), a novel framework that leverages text-guided learning during training to significantly improve defect detection performance. Unlike traditional methods, LEAD-Net integrates textual descriptions of defects, such as historical maintenance records or inspection reports, as auxiliary guidance during training. A key innovation is the Language-Guided Anomaly Feature Enhancement Module (LAFEM), which refines channel attention using these text features. Crucially, LEAD-Net operates solely on image data during inference, ensuring practical applicability. Experiments on a real-world substation dataset, comprising 8307 image–text pairs and encompassing a diverse range of defect categories encountered in operational substation environments, demonstrate that LEAD-Net significantly outperforms state-of-the-art object detection methods (Faster R-CNN, YOLOv9, DETR, and Deformable DETR), achieving a mean Average Precision (mAP) of 79.51%. Ablation studies confirm the contributions of both LAFEM and the training-time text guidance. The results highlight the effectiveness and novelty of using training-time defect descriptions to enhance visual anomaly detection without requiring text input at inference.

Keywords:

substation defect detection; multi-modal learning; text-guided; detection transformer; anomaly detection

1. Introduction

Substations are pivotal components of power grids, functioning as essential nodes for voltage transformation, current distribution, and power management [1]. They play a critical role in ensuring the stability and efficiency of electricity transmission and distribution [2]. In modern power systems, the reliable operation of substations is indispensable in preventing power outages, equipment failures, and disruptions to critical infrastructure [3]. However, the increasing complexity of substation configurations, the diversity of the installed equipment, and the growing demand for uninterrupted power supply have introduced new challenges in their operation and maintenance [4,5]. Substation equipment is prone to various types of defects, such as insulator cracks, corrosion, loose connections, overheating, and oil leaks [6]. These anomalies, if not detected and addressed promptly, can escalate into major failures, causing significant economic losses and threatening grid stability. Traditional methods for inspecting and maintaining substation equipment primarily rely on manual inspections conducted by skilled personnel [7]. While manual inspections are still widely used, they suffer from several limitations, including high labor costs, time inefficiency, and subjective evaluations [8]. Moreover, the increasing scale and automation of power systems have made manual inspections insufficient to meet the growing demand for efficient and accurate equipment monitoring [9]. Thus, intelligent substation inspections using computer vision and deep learning are gaining traction.

The advent of computer vision and deep learning has brought about new possibilities for automating substation equipment inspection [10,11]. Numerous studies have explored deep learning techniques for fault detection, achieving remarkable progress in image-based anomaly detection [12]. For example, convolutional neural networks (CNNs) have been employed for insulator crack detection [13], and advanced object detection frameworks such as Faster R-CNN [14], YOLO [15], and SSD [16] have been adapted for defect localization. More recently, transformer-based detection models, such as DETR and Deformable DETR [17], have demonstrated superior performance on generic object detection benchmarks. However, despite these advancements, several challenges remain unresolved in the domain of substation defect detection.

Substation defect datasets are often highly imbalanced, with very few samples of rare but critical fault types. Furthermore, due to data confidentiality and security concerns, acquiring large, annotated datasets from substations can be difficult. This scarcity of labeled data limits the applicability of data-hungry deep learning models, resulting in suboptimal performance.

The definition of “defect” or “abnormality” is often context dependent and varies between substations. Factors such as the equipment age, model, operational environment, and maintenance history can influence what is considered a defect. For instance, a crack on a newer insulator may be deemed critical, while a similar crack on older equipment might be considered acceptable due to different tolerance levels. This variability poses challenges for designing generalizable defect detection models. While various deep learning models have been proposed for defect detection, effectively enhancing the subtle and diverse anomaly features in complex substation environments remains a significant hurdle. Challenges such as the low visual distinctiveness of certain defects (e.g., incipient cracks, slight corrosion), the high variability in defect appearance, and the influence of operational context on defect definition all complicate the task of designing robust feature enhancement mechanisms. Existing generic feature enhancement techniques, including standard attention modules, often struggle to adapt to these specificities, sometimes amplifying irrelevant dominant features or failing to capture context-dependent anomaly cues. Integrating multi-modal information, such as text, images, audio, video, structured data, etc., into deep learning frameworks remains an underexplored area.

To address this, we propose the Language-Guided Enhanced Anomaly Power Equipment Detection Network (LEAD-Net), a novel framework built upon the DETR (De-tection Transformer) architecture. Specifically, unlike most existing methods which rely solely on image annotations during training, LEAD-Net uniquely leverages textual descriptions of equipment conditions as auxiliary guidance during its training phase. This approach effectively helps the network to learn intricate target feature information, thereby enhancing its defect detection performance. A key component of our framework is the LA-FEM module, which dynamically adjusts channel attention in the image encoder based on these text-derived features. This overcomes the limitations of standard channel attention, which can be biased towards dominant, non-anomalous features, allowing LA-FEM to dynamically focus on the image channels most relevant to text-described anomalies, thus improving detection accuracy. Crucially, during the inference (testing) phase, LEAD-Net operates solely on image data and does not require any additional text input. This distinguishes LEAD-Net from other multi-modal models by requiring fewer inputs during testing, significantly increasing its practical applicability and real-world deployment value in substation environments.

To operationalize this semantic-guided feature enhancement, we propose the Language-Guided Enhanced Anomaly Power Equipment Detection Network (LEAD-Net), built upon the DETR architecture. A key advantage of our framework is that, while textual descriptions guide the learning of enhanced anomaly features during training, LEAD-Net retains fully image-driven during inference, ensuring its feasibility for real-world deployment in substation environments where contemporaneous text input may not be available.

The remainder of this study is organized as follows: Section 2 provides a review of the research background for anomaly defect detection in power systems. Section 3 describes the architecture of LEAD-Net and details the design of LAFEM. Section 4 presents the experimental setup, results, and analysis. Section 5 shows the ablation experiment and analysis of the innovation module. Section 6 concludes the study with a discussion of future research directions.

2. Research Background

Ensuring the reliability and stability of power systems necessitates the accurate and timely detection and analysis of equipment defects, particularly within critical infrastructure such as substations. This section provides a focused review of existing research methodologies applied to this domain, highlighting key advancements, inherent limitations, and remaining research gaps that motivate the present study. We review the evolution from traditional approaches to deep learning paradigms and discuss emerging trends relevant to anomaly and defect analysis.

2.1. Traditional Approaches and Early Automation

Historically, substation equipment inspections and fault diagnosis heavily relied on the experience of skilled personnel and rudimentary, often visual, detection methods [18]. Early attempts at automation leveraged traditional image processing techniques such as edge detection, thresholding, and morphological analysis to identify visible defects such as cracks or corrosion [19]. While these methods offered preliminary automation, their effectiveness was severely limited by sensitivity to environmental factors including lighting, the complexity of substation scenes, and the intricate nature of many equipment failures [20]. These approaches struggled to capture the diverse and subtle visual cues associated with various fault types and lacked the ability to conduct deeper fault analysis or root cause identification [21].

2.2. Deep-Learning-Based Defect Detection

The advent of deep learning marked a significant leap forward, quickly becoming the dominant paradigm for image-based defect detection [22,23].

2.2.1. CNN-Based Methods

CNNs have been extensively applied, with researchers adapting object detection frameworks to localize common defects in substation equipment and transmission lines. Advanced CNN-based methods incorporated techniques including feature pyramid networks for multi-scale detection or attention mechanisms to improve robustness and capture local/global features [24]. Despite their success in many visual tasks, standard CNN-based supervised learning methods face fundamental challenges in the power system domain, primarily stemming from data limitations and difficulties in handling subtle or context-dependent anomalies.

2.2.2. Addressing Data Scarcity and Imbalances

Training high-performing deep learning models typically requires vast amounts of labeled data, which are difficult to acquire for substation defects due to security, confidentiality, and the rare occurrence of critical faults [25]. This data scarcity leads to highly imbalanced datasets, where minority classes (rare but critical defects) are underrepresented, resulting in models that are biased towards common defects and exhibit poor performance on rare critical ones [26]. To mitigate these issues, techniques such as traditional data augmentation and metric learning [27] have been explored, although they often fall short in fully leveraging potential semantic information or adapting to specific application scenes. Methods specifically addressing imbalances through loss function re-weighting [28] or advanced feature modeling under data constraints, such as the HSFE model’s use of class covariance matrices and Gumbel noise for feature amplification, represent efforts to improve performance despite data limitations.

2.2.3. Transformer-Based Models and Multimodal Analysis

More recent advancements in deep learning include the application of Transformer-based architectures (e.g., DETR, Deformable DETR) for object detection, leveraging their ability to model global relationships. Concurrently, Multimodal Large Models (MLLMs) such as GPT-4 [29] are being explored for more comprehensive fault analysis, integrating information from various modalities (text, images, etc.) to generate detailed reports and analyze root causes. However, general-purpose MLLMs lack the specialized domain knowledge required for accurate interpretation within the nuanced context of substation operations. Fine-tuning MLLMs with domain-specific datasets and knowledge bases is crucial for their effective application in fault analysis, as demonstrated by SubstationAI [30], which focuses on generating fault analysis reports based on integrated knowledge.

2.3. Research Gaps and Motivation for This Work

Based on the review of existing literature, several key challenges persist in the field of automated substation equipment defect detection and analysis. While significant progress has been made with deep learning, it remains challenging to effectively handle data limitations and detect subtle, context-dependent anomalies. Emerging MLLMs show promise for comprehensive fault analysis, but their adaptation to specialized domains and focus on report generation differs from the core task of the robust detection of visual anomalies. A notable gap exists in leveraging the rich semantic information available in textual descriptions (e.g., maintenance logs, inspection reports) to directly guide the visual feature learning and enhancement process for anomaly detection during training, in a way that improves detection accuracy without requiring text input at inference for practical deployment. Our proposed LEAD-Net aims to fill this void by introducing a novel, training-time language-guided feature enhancement approach for improved substation equipment defect detection.

3. Proposed Method

3.1. Framework

The proposed Language-Guided Enhanced Anomaly Power Equipment Detection Network (LEAD-Net) is illustrated in Figure 1. LEAD-Net builds upon the DETR framework and extends its capabilities by introducing a novel text-guided learning mechanism to address key challenges in detecting anomalies in substation equipment. During the training phase, LEAD-Net incorporates textual information, such as maintenance records and inspection reports, as auxiliary guidance to enhance the network’s ability to learn subtle anomaly features. This integration of multi-modal information enables LEAD-Net to overcome the limitations of vision-only approaches, such as difficulties in detecting rare or subtle anomalies. Importantly, during the inference phase, LEAD-Net operates solely on image data, ensuring its practicality in real-world applications where textual descriptions may not always be available.

LEAD-Net consists of three main components: an image encoder, a text encoder, and the LAFEM module. The image encoder employs a CNN-Transformer architecture identical to the standard DETR framework, enabling the network to extract both local and global contextual information from substation equipment images. This ensures robust feature representation of both coarse and fine-grained visual patterns. Simultaneously, textual descriptions are processed by a BERT-based text encoder, which generates semantic feature representations. These textual features provide critical contextual information about equipment conditions and defects, which are leveraged during training to guide anomaly detection. LAFEM serves as the core innovation of LEAD-Net, refining image feature representations by dynamically incorporating text-derived guidance. This module ensures that anomaly-related features are effectively emphasized, addressing the limitations of traditional attention mechanisms that often fail to prioritize subtle anomalies in complex visual data.

In the training phase, LEAD-Net extracts feature from images using an image encoder. This process can be described as follows:

f_{i m g} = E n c o d e r_{i m g} (X)

(1)

where

f_{i m g}

represents the extracted image features, and

X

is the input image. Here,

E n c o d e r_{i m g} (\cdot)

serves as the image feature extractor. In LEAD-Net, the image encoder employs the same CNN-Transformer architecture as DETR, allowing it to capture both local and global contextual information.

Additionally, the textual input is processed by a text feature extractor to generate corresponding textual features. This process is defined as:

f_{t e x t} = E n c o d e r_{t e x t} (T)

(2)

where

f_{t e x t}

denotes the extracted textual features, and

T

is the input text. The text encoder

E n c o d e r_{t e x t} (\cdot)

in LEAD-Net adopts BERT as the feature extractor, which is well-suited for capturing semantic representations from textual inputs.

Channel attention enhances feature representation by emphasizing informative channels. However, in power equipment anomaly detection, unguided attention can prioritize dominant components over subtle anomalies, hindering accuracy. We address this with the LAFEM, detailed in Section 2.2. LAFEM uses text to guide channel attention, focusing on anomaly-related features. A significant innovation in LEAD-Net is the Language-Guided Anomaly Feature Enhancement Module (LAFEM), which addresses the limitations of standard channel attention mechanisms. Traditional attention mechanisms, such as SE-Net, often prioritize dominant visual features in an image, inadvertently suppressing subtle anomaly patterns that are crucial for effective defect detection. This is particularly problematic in substation environments, where many anomalies, such as small cracks, discoloration, or surface corrosion, are visually inconspicuous and can be easily overshadowed by larger, non-anomalous regions. LAFEM overcomes this limitation by dynamically guiding attention based on textual features. Specifically, LAFEM projects textual features into the same feature space as the image features and computes cross-modal attention to emphasize the image channels most relevant to the described anomalies.

The enhanced features generated by LAFEM are passed through the DETR decoder, which predicts the categories, positions, and dimensions of the detected anomalies. The decoding process is expressed as follows.

The enhanced features

f_{e}

generated by LAFEM are decoded using the DETR decoder to predict the categories, positions, and dimensions of anomalies. This process is expressed as:

(c l s, b b o x) = D e c o d e r (f_{e})

(3)

where

c l s

represents the predicted anomaly categories (e.g., cracks, corrosion), and

b b o x

denotes the bounding box parameters, including the center coordinates and dimensions. The DETR decoder, following the standard DETR structure, employs a set-based matching algorithm to directly predict the final anomaly detections without requiring post-processing steps such as non-maximum suppression (NMS). This end-to-end detection process ensures efficiency and accuracy in anomaly detection.

LEAD-Net operates differently during training and inference, ensuring both robustness and practicality. In the training phase, both image and textual inputs are utilized. The image encoder extracts visual features, while the text encoder generates semantic features from the accompanying textual descriptions. LAFEM combines these features to dynamically guide the network’s attention toward anomaly-related patterns. During the inference phase, the textual input is removed, and the network relies solely on image data. This decoupling of textual guidance during training and inference makes LEAD-Net practical for deployment in real-world substation environments, where textual descriptions may not always be available.

In summary, LEAD-Net introduces a novel approach to anomaly detection by integrating textual guidance into the DETR framework. The multi-modal learning process enables the network to effectively capture subtle and rare anomaly features that vision-only methods often miss. LAFEM plays a central role in this process by dynamically refining channel attention based on text-derived information, ensuring that anomaly-related features are emphasized while suppressing dominant but less relevant patterns. With its ability to operate solely on image data during inference, LEAD-Net strikes a balance between enhanced detection abilities and practical applicability, making it a promising solution for intelligent substation equipment inspection.

3.2. LAFEM Module

The LAFEM module is the core innovation in LEAD-Net, designed to overcome the inherent limitations of traditional channel attention mechanisms in visual anomaly detection. Standard attention mechanisms are often biased toward dominant, non-anomalous features, especially in complex substation equipment images, where anomalies such as small cracks or surface discoloration may be visually subtle. LAFEM addresses this issue by dynamically guiding channel attention using textual features, enabling the network to amplify responses to anomaly-related features described in the accompanying text. By embedding text-guided contextual information into the image feature space, LAFEM significantly enhances the network’s ability to detect subtle and rare anomalies.

As shown in Figure 2, the architecture of LAFEM consists of three key stages: (1) the aggregation of textual features, (2) the aggregation of image features, and (3) the generation of channel guidance to refine image feature representations. During the training phase, textual features

f_{text}

, extracted using a BERT-based text encoder, are aggregated through a global pooling operation to distill the most salient semantic information. This process can be described as:

a f_{text} = MLP (Avgpool (f_{text}))

(4)

where

a f_{text}

represents the aggregated textual features,

Avgpool (\cdot)

denotes global average pooling, and

MLP (\cdot)

is a multi-layer perceptron. This aggregation ensures that the textual information is condensed into a compact representation, retaining only the most relevant semantic cues related to potential anomalies. This resulting feature vector

a f_{text}

provides a high-level semantic summary of the textual description, ready for integration with visual features.

Similarly, LAFEM aggregates the image features

f_{img}

, extracted from the CNN-Transformer-based image encoder, to derive channel-wise feature embedding. The aggregation process for image features is defined as:

a f_{img} = Conv (Avgpool (f_{img}))

(5)

where

a f_{img}

denotes the image features embedded with channel guidance and

Conv (\cdot)

is, again, a 1 × 1 convolution. This operation reduces the dimensionality of the image feature map while retaining its spatial structure, facilitating the subsequent fusion of textual and image features. The output provides a spatially aware yet channel-condensed representation of the image features, summarizing their content for cross-modal interactions.

Once the textual features

a f_{text}

and image features

a f_{img}

are aggregated, LAFEM combines them to generate channel guidance. This fusion process is performed by computing a weighted combination of the two modalities, embedding the textual semantic information into the image feature space. The combined features

p f_{img}

are expressed as:

p f_{img} = Conv (a f_{text} + a f_{img}) \cdot f_{img}

(6)

where

p f_{img}

denotes the image features embedded with channel guidance. This operation first fuses the aggregated text and image features to derive channel-wise attention weights, which are then applied in an element-wise manner to the original image features. This process effectively re-weights the image features based on the combined textual and visual cues, enhancing anomaly-relevant information.

The enhanced features

p f_{img}

amplify the response of anomaly-related channels, allowing the network to focus on features relevant to power equipment anomalies. This enhancement process can be described as:

S = Softmax (A \otimes A^{T})

(7)

e f_{img} = Conv (S \otimes A) + f_{img}

(8)

where

A \in ℝ^{C \times N}

represents the reshaped version of

p f_{i m g} \in ℝ^{C \times H \times W}

, and

e f_{img}

denotes the enhanced features. Here,

\otimes

refers to matrix multiplication, and

S

is the self-attention weight calculated based on channel correlations.

Through this process, LAFEM utilizes textual features during training to guide the network’s focus on anomaly information while reducing overfitting to non-anomalous regions of the equipment.

During the testing phase, since textual features are not available, LAFEM generates channel guidance solely from image features. This simplifies the formulation to:

p f_{img} = Conv (a f_{img}) \cdot f_{img}

(9)

LAFEM introduces a novel mechanism for text-guided channel attention, addressing the limitations of standard attention mechanisms in visual anomaly detection. By dynamically fusing textual and image features during training, LAFEM enables the network to focus on subtle, text-described anomalies while reducing overfitting to dominant, non-anomalous regions. The ability to operate effectively in both the training and testing phases makes LAFEM a versatile and practical component of LEAD-Net, significantly contributing to its overall performance in substation equipment defect detection.

4. Experiments and Analysis

4.1. Dataset and Setup

To evaluate the performance and robustness of our proposed LEAD-Net framework, we conducted experiments on a real-world substation equipment defect dataset, meticulously collected from a major power grid company in China. This dataset was curated to encapsulate the complexities and challenges of practical substation equipment inspections. It comprises a total of 8307 high-quality image–text pairs, where each image corresponds to a visible-light photograph of substation equipment, accompanied by a relevant textual description. The dataset was carefully designed to provide a comprehensive benchmark for anomaly detection, encompassing a diverse range of scenarios and conditions.

The dataset includes images with diverse original resolutions and aspect ratios. To ensure a uniform input for our deep learning model and facilitate efficient batch processing, all images were subjected to a preprocessing pipeline involving resizing and padding to a fixed input size of 800 × 1333 before being utilized for training and inference. In the experiment, we adopted random flipping, random scaling and cropping, random color jitter, and random rotation as methods for data augmentation.

The visible defects captured in the images span multiple categories, including cracks, discoloration, rust, corrosion, insulation damage, loose components, and other signs of wear and tear. The dataset also includes images of normal equipment without defects, ensuring a balanced representation of anomalous and non-anomalous samples.

The textual descriptions paired with each image complement the visual data by providing additional semantic context related to the equipment’s operational state, potential defects, and maintenance history. These descriptions were meticulously compiled from multiple sources, including inspection reports, maintenance logs, equipment manuals, and annotations provided by experienced substation inspectors.

The dataset focuses on eight distinct types of equipment defects, providing a fine-grained categorization of anomalies. To ensure consistency and facilitate text-guided learning, we established a standardized textual annotation protocol for each defect type. Table 1 provides representative examples of these annotations, showcasing the format and level of detail used to describe each anomaly, along with corresponding example images. The text annotations are concise and precise, focusing on the visual characteristics. For instance, a blurred meter dial is consistently described as “This meter has blurred dial”, while an oil leakage is denoted as “This equipment has oil leakage/oil stain on the ground”.

Figure 3 presents the distribution, by train/validation/test, of examples of each of the eight defect types. This carefully curated dataset, with its standardized textual annotations, is crucial for training and evaluating the text-guided feature enhancement capabilities of LEAD-Net.

The experiments are conducted using PyTorch (version 2.1.0) with Python 3.12 and CUDA version 12.6. DETR served as the base architecture. The textual features are extracted using BERT-base. The model is trained with the AdamW optimizer (10⁻⁴ learning rate, 10⁻⁴ weight decay), and a cosine annealing schedule is applied. Training runs for 50 epochs with a batch size of 16 on two NVIDIA A6000 GPUs.

We evaluate the detection performance using average precision (AP) and mean average precision (mAP). The AP is calculated for each class, and the mAP is computed by averaging the AP values across all classes. The formula for AP is given by:

A P = \sum_{k = 1}^{n} P (k) Δ R (k)

(10)

where

P (k)

is the precision at the

k

-th threshold, and

Δ R (k)

is the change in recall from the

k - 1

-th threshold to the

k

-th threshold. The mAP is then given by:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(11)

where

N

is the number of classes.

4.2. Comparative Experiments

Table 2 presents a detailed performance comparison between the proposed LEAD-Net and several state-of-the-art object detection algorithms, including Faster R-CNN, YOLOv9, DETR, and Deformable DETR. The evaluation metrics include average precision (AP) for eight distinct defect categories and the overall mean average precision (mAP) across all categories. LEAD-Net achieves the highest mAP of 79.51%, surpassing all baseline models, which include Faster R-CNN (68.79%), YOLOv9 (75.88%), DETR (72.64%), and Deformable DETR (77.94%). This improvement underscores the effectiveness of our proposed method, which integrates a text-guided feature enhancement mechanism during training to optimize anomaly detection in complex substation equipment images.

A closer examination of the results reveals that LEAD-Net exhibits balanced and robust performance across all defect categories, achieving the highest AP scores in IR (incorrect reading), OLS (oil leakage/stain), and DMC (damaged/missing cover), with AP values of 82.80%, 80.30%, and 84.00%, respectively. These scores significantly outperform other methods, demonstrating the model’s ability to detect subtle and context-dependent anomalies with high precision. Notably, YOLOv9 and Deformable DETR also perform well in these categories; however, LEAD-Net’s ability to outperform these strong baselines by 3.63% and 1.57% in mAP, respectively, highlights its superior feature extraction and anomaly localization capabilities.

This superior performance can be attributed to the Language-Guided Anomaly Feature Enhancement Module (LAFEM), which uses textual descriptions during training to guide the network’s attention toward anomaly-relevant features. This mechanism helps the model focus on subtle defect patterns that are otherwise overshadowed by dominant, non-anomalous features, which is a common limitation in traditional object detection algorithms. For instance, defects such as oil stains or cracked covers can often blend with the background or normal surface textures, making them harder to detect without additional semantic guidance. This is reflected in the notably high AP scores achieved by LEAD-Net in categories such as OLS (80.30%) and DMC (84.00%), which involve challenging visual characteristics, demonstrating LAFEM’s effectiveness in highlighting subtle visual cues.

While LEAD-Net achieves exceptional results in most categories, certain challenges remain in categories such as FO (foreign object), where the AP score is slightly lower at 72.20%. Although this score is still higher than the corresponding results from other algorithms (e.g., YOLOv9 achieves 66.20%, Deformable DETR achieves 70.20%), it indicates potential room for improvement. The slightly lower performance in this category can likely be attributed to the inherent ambiguity in the visual features of foreign objects, which may vary significantly in shape, size, and appearance. Addressing these issues through enhanced data augmentation strategies or category-specific feature refinement could further boost performance in future work.

Another interesting observation is LEAD-Net’s performance in categories such as ADC (abnormal door closure) and BN (bird’s nest), where it achieves AP scores of 84.30% and 79.10%, respectively. These categories often involve irregular shapes or non-standard equipment states, making anomaly detection more challenging. The results highlight LEAD-Net’s ability to generalize well across diverse and complex defect types, ensuring robust performance even in non-standard scenarios. The strong performance in ADC (84.30%) and BN (79.10%) suggests that LEAD-Net effectively leverages both visual and textual context to identify anomalies defined by relative positioning or unusual configurations.

YOLOv9 and Deformable DETR stand out among the baseline algorithms, achieving competitive mAP values of 75.88% and 77.94%, respectively. These models are well-known for their efficiency and ability to capture spatial relationships, with Deformable DETR providing additional flexibility in handling object deformations. However, LEAD-Net’s integration of multi-modal learning, particularly its use of textual guidance during training, gives it a clear edge. By leveraging textual information to refine channel attention and enhance anomaly-related features, LEAD-Net introduces a significant advancement over these baselines, particularly in scenarios where defects are context-dependent or visually subtle. The consistent performance gain across categories, culminating in the highest overall mAP, validates the efficacy of incorporating semantic guidance for object detection in complex industrial settings.

The consistent improvement of LEAD-Net across all defect categories validates its effectiveness in real-world substation inspection scenarios. The incorporation of textual features as guidance during training addresses a critical limitation in existing object detection algorithms, which often struggle with subtle defects or low inter-class variance. Moreover, the use of transformer-based architectures, such as DETR, provides LEAD-Net with the ability to model long-range dependencies and global context, further improving its performance in complex images with dense equipment layouts.

5. Ablation Study

Figure 4 and Table 3 present the results of the ablation study, which evaluates the contributions of the LAFEM module and the training-time text-guided feature mechanism to the overall performance of LEAD-Net. The study systematically removes these components to quantify their individual impacts on the model’s ability to detect anomalies across various defect categories. The results provide valuable insights into the effectiveness and synergy of these components in improving the robustness and accuracy of LEAD-Net.

The removal of LAFEM leads to a substantial 2.5 percentage point decrease in mAP (from 79.51% to 77.01%), demonstrating its critical role in enhancing the detection of both structural and visually subtle anomalies. As shown in Figure 4, the absence of LAFEM has the most pronounced effect on categories including FO (foreign object) and DMC (damaged/missing cover), where AP scores drop by 2.7 and 2.6 percentage points, respectively. These categories often involve structural irregularities or abnormalities with inconsistent visual patterns, making them particularly challenging to detect. LAFEM addresses these challenges by dynamically refining channel attention and amplifying anomaly-specific features, enabling the model to focus on subtle defects that may otherwise be overlooked.

Interestingly, even categories such as IR (incorrect reading) and OLS (oil leakage/stain), which generally exhibit more consistent visual characteristics, experience performance degradation without LAFEM. This further underscore its general importance in learning robust and discriminative features across a wide range of defect types. The significant mAP drops upon removing LAFEM highlights its fundamental role in feature refinement and anomaly localization accuracy across diverse visual characteristics.

While text input is not used during inference, the removal of training-time text guidance results in a 1.47 percentage point mAP reduction (from 79.51% to 78.04%), highlighting the successful knowledge transfer from the textual modality to the visual features during training. As seen in Figure 4, categories such as ADC (abnormal door closure) and OLS are most affected, with AP drops of approximately 1.4 percentage points each. These categories often require contextual understanding, where defects are defined by the relationships between different visual elements rather than standalone features. The textual descriptions during training provide crucial semantic cues to guide the model in learning these nuanced, context-dependent relationships, ultimately enhancing its ability to detect such anomalies during inference. The performance degradation without text guidance confirms its value in providing essential semantic context during training, particularly benefiting categories that are reliant on relational information.

For example, in the ADC category, text guidance might describe a “partially closed door” or “misaligned hinges”, which directs the model’s attention to specific spatial relationships and edge patterns, enabling better feature learning. The degradation in performance when text guidance is removed suggests that visual features alone may not sufficiently capture such contextual anomalies without the semantic grounding provided by text.

The highest performance is achieved when both LAFEM and text guidance are applied together, yielding a mAP of 79.51%. This demonstrates the strong synergy between these components, as LAFEM amplifies the anomaly-relevant features identified through textual guidance during training. The complementary nature of these mechanisms ensures that LEAD-Net is both structurally aware (via LAFEM) and contextually grounded (via text guidance), making it highly effective in handling a diverse range of anomalies.

The results also reveal that LAFEM contributes more significantly to overall performance than text guidance when evaluated individually. This is evident from the larger mAP drop (2.5 vs. 1.47 percentage points) upon its removal. However, the persistent impact of text guidance on inference-time performance highlights its critical role in enabling effective modality transfer, where knowledge from the textual modality enhances the learning of visual features.

Figure 5 provides a visual comparison of defect detection performance under different configurations of LEAD-Net, illustrating the impact of removing LAFEM and training-time text guidance. These qualitative results complement the quantitative findings by vividly demonstrating how each component contributes to more accurate and precise defect detection.

From Figure 5, it is evident that LAFEM plays a critical role in refining features and improving localization precision for defect detection. Without LAFEM (column d), the detection results show significant localization errors. For example, in the top image, the bounding box for the “damaged/missing cover” is excessively large, covering a substantial portion of non-defective areas. Similarly, in the middle image, the bounding box for the “damaged dial” is misaligned, extending beyond the actual damaged region. The absence of LAFEM hinders the model’s ability to suppress irrelevant features and focus on defect-specific areas, leading to imprecise and overly broad detections. This highlights LAFEM’s importance in enhancing the detection of structural anomalies, such as missing or damaged components.

Training-time text guidance (column c) contributes significantly to the model’s understanding of semantic context. Without text guidance, the model’s performance improves compared to when LAFEM is absent, but it still falls short of optimal precision. For instance, in the top image, the bounding box for the “damaged/missing cover” is slightly smaller than in column d but remains misaligned, capturing portions of non-defective areas. In the middle image, the “damaged dial” is detected with reasonable accuracy; however, the model struggles with more context-dependent defects, where semantic relationships (e.g., “partially missing” or “misaligned”) are critical for accurate detection. These results indicate that text guidance provides meaningful semantic cues during training, enabling the model to better capture contextual relationships and achieve superior performance during inference.

The complete LEAD-Net model (column b), which incorporates both LAFEM and text guidance, achieves the best detection performance. The bounding boxes are tightly aligned with the ground truth annotations (column a) across all examples, demonstrating the model’s robustness in detecting both subtle defects (e.g., damaged dials) and structural anomalies (e.g., damaged powerline components). LAFEM enhances feature refinement and precise localization, while text guidance provides global semantic context, ensuring the model can handle both visually subtle and contextually complex defects. The synergy between these components addresses the dual challenges of feature precision and context awareness, making LEAD-Net highly effective for real-world substation equipment inspection tasks.

6. Conclusions

This study presents LEAD-Net, a novel and effective framework for substation equipment defect detection. Unlike traditional methods, LEAD-Net uniquely leverages training-time text guidance, in the form of defect descriptions, to significantly enhance anomaly detection accuracy. The core innovation, the Language-Guided Anomaly Feature Enhancement Module (LAFEM), uses these descriptions to refine channel attention within the image encoder. This enables the network to focus on subtle visual cues related to anomalies that might otherwise be missed, effectively addressing a key limitation of existing approaches. A crucial advantage of LEAD-Net is its ability to operate without text input during inference, ensuring practicality for real-world deployment and minimizing computational overheads. The experimental results when the model was used on a challenging real-world dataset demonstrate LEAD-Net’s superior performance compared to state-of-the-art object detection methods, including Faster R-CNN, YOLOv9, DETR, and Deformable DETR.

Comprehensive ablation studies further validated the individual and combined contributions of both LAFEM and the training-time text guidance, emphasizing the synergistic benefits of this approach. Future work will focus on expanding the range of textual information, exploring methods to further enhance LAFEM, and investigating the deployment of LEAD-Net on edge computing devices for real-time inspection. The presented methods have shown significant potential for related anomaly detection tasks, paving the way for more advanced and practical defect detection solutions in industrial applications.

Author Contributions

Methodology, L.Z. and J.K.; validation, S.X. and L.L.; formal analysis, Y.T.; investigation, Y.Z.; writing—original draft preparation, L.Z.; writing—review and editing, J.K.; visualization and supervision, L.Z. and Y.T.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of State Grid Sichuan Electric Power Company, grant number 521997240003.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. The datasets and code generated and/or analyzed during the current study are not publicly deposited at this time but are available from the corresponding author upon reasonable request. Further inquiries, including requests for data and code, can be directed to the correspondence author.

Conflicts of Interest

Authors Junwei Kuang and Lin Li were employed by the State Grid Luzhou Electric Power Supply Company. Author Yingjie Zhou was employed by the Chuanshen Hongan Intelligent (Shenzhen) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yu, H.; Niu, S.; Shang, Y.; Shao, Z.; Jia, Y.; Jian, L. Electric Vehicles Integration and Vehicle-to-Grid Operation in Active Distribution Grids: A Comprehensive Review on Power Architectures, Grid Connection Standards and Typical Applications. Renew. Sustain. Energy Rev. 2022, 168, 112812. [Google Scholar] [CrossRef]
Li, X.; Zhang, L.; Zhang, J.; Li, J. Research and application of joint inspection technology for box type intelligent substations. In Proceedings of the 2024 International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China, 16–18 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 147–153. [Google Scholar]
Mohanty, A.; Ramasamy, A.K.; Verayiah, R.; Bastia, S.; Dash, S.S.; Cuce, E.; Khan, T.M.Y.; Soudagar, M.E.M. Power System Resilience and Strategies for a Sustainable Infrastructure: A Review. Alex. Eng. J. 2024, 105, 261–279. [Google Scholar] [CrossRef]
Zhou, Y.; Li, S.; Li, Q.; Wei, F.; Yang, D.; Liu, J.; Yu, D. Energy Savings in Direct Air-Side Free Cooling Data Centers: A Cross-System Modeling and Optimization Framework. Energy Build. 2024, 308, 114003. [Google Scholar] [CrossRef]
Han, R.; Liu, L.; Liu, S.; Jiang, P.; Han, Y.; Yang, Z. Research and application of substation intelligent inspection technology based on multi spectral image recognition. In Proceedings of the 2020 IEEE International Conference on High Voltage Engineering and Application (ICHVE), Beijing, China, 6–10 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–4. [Google Scholar]
AJ, C.; Salam, M.A.; Rahman, Q.M.; Wen, F.; Ang, S.P.; Voon, W. Causes of Transformer Failures and Diagnostic Methods—A Review. Renew. Sustain. Energy Rev. 2018, 82, 1442–1456. [Google Scholar] [CrossRef]
Lu, S.; Zhang, Y.; Su, J. Mobile Robot for Power Substation Inspection: A Survey. IEEE/CAA J. Autom. Sin. 2017, 4, 830–847. [Google Scholar] [CrossRef]
Qin, Z.; Xu, Z.D.; Sun, Q.C.; Poovendran, P.; Balamurugan, P. Investigation of Intelligent Substation Inspection Robot by Using Mobile Data. Int. J. Human. Robot. 2022, 20, 2240003:1–2240003:21. [Google Scholar] [CrossRef]
Dai, K.; Shao, J.; Gong, B.; Jing, L.; Chen, Y. CLIP-FSSC: A Transferable Visual Model for Fish and Shrimp Species Classification Based on Natural Language Supervision. Aquac. Eng. 2024, 107, 102460. [Google Scholar] [CrossRef]
Nguyen, V.N.; Jenssen, R.; Roverso, D. Automatic Autonomous Vision-Based Power Line Inspection: A Review of Current Status and the Potential Role of Deep Learning. Int. J. Electr. Power Energy Syst. 2018, 99, 107–120. [Google Scholar] [CrossRef]
Oliveira, B.A.S.; De Faria Neto, A.P.; Fernandino, R.M.A.; Carvalho, R.F.; Fernandes, A.L.; Guimaraes, F.G. Automated Monitoring of Construction Sites of Electric Power Substations Using Deep Learning. IEEE Access 2021, 9, 19195–19207. [Google Scholar] [CrossRef]
Albuquerque Filho, J.E.D.; Brandao, L.C.P.; Fernandes, B.J.T.; Maciel, A.M.A. A Review of Neural Networks for Anomaly Detection. IEEE Access 2022, 10, 112342–112367. [Google Scholar] [CrossRef]
Tulbure, A.-A.; Tulbure, A.-A.; Dulf, E.-H. A Review on Modern Defect Detection Models Using DCNNs—Deep Convolutional Neural Networks. J. Adv. Res. 2021, 35, 33–48. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Chen, W.; Huang, Z.; Mu, Q.; Sun, Y. PCB Defect Detection Method Based on Transformer-YOLO. IEEE Access 2022, 10, 129480–129489. [Google Scholar] [CrossRef]
Kang, L.; Ge, Y.; Huang, H.; Zhao, M. Research on PCB defect detection based on SSD. In Proceedings of the 2022 IEEE 4th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 12–14 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1315–1319. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Wróbel, J.; Bury, P.; Zając, M.; Kierzkowski, A.; Tubek, S.; Blaut, J. Fault Detection and Diagnostic Methods for Railway Systems—A Literature Survey. Adv. Sci. Technol. Res. J. 2024, 18, 361–391. [Google Scholar] [CrossRef]
Lu, X.; Quan, W.; Gao, S.; Zhang, G.; Feng, K.; Lin, G.; Chen, J.X. A Segmentation-Based Multitask Learning Approach for Isolating Switch State Recognition in High-Speed Railway Traction Substation. IEEE Trans. Intell. Transport. Syst. 2022, 23, 15922–15939. [Google Scholar] [CrossRef]
de Jong, E.C.W.; Vaessen, P.T.M. European experiences with an intelligent substation intended for smart distribution networks. In Proceedings of the IEEE Power and Energy Society Conference and Exposition in Africa: Intelligent Grid Integration of Renewable Energy Resources (PowerAfrica), Johannesburg, South Africa, 9–13 July 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1–5. [Google Scholar]
Clinchant, S.; Csurka, G.; Chidlovskii, B. A domain adaptation regularization for denoising autoencoders. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; Volume 2: Short Papers, pp. 26–31. [Google Scholar]
Du, F.-J.; Jiao, S.-J. Improvement of Lightweight Convolutional Neural Network Model Based on YOLO Algorithm and Its Research in Pavement Defect Detection. Sensors 2022, 22, 3537. [Google Scholar] [CrossRef]
Li, Y.; Huang, H.; Xie, Q.; Yao, L.; Chen, Q. Research on a Surface Defect Detection Algorithm Based on MobileNet-SSD. Appl. Sci. 2018, 8, 1678. [Google Scholar] [CrossRef]
Zhao, H.; Gao, Y.; Deng, W. Defect Detection Using Shuffle Net-CA-SSD Lightweight Network for Turbine Blades in IoT. IEEE Internet Things J. 2024, 11, 32804–32812. [Google Scholar] [CrossRef]
Yang, L.; Jin, R.; Sukthankar, R.; Liu, Y. An efficient algorithm for local distance metric learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Boston, MA, USA, 16–20 July 2006; pp. 543–548. [Google Scholar]
Moutafis, P.; Leng, M.; Kakadiaris, I.A. An Overview and Empirical Comparison of Distance Metric Learning Methods. IEEE Trans. Cybern. 2017, 47, 612–625. [Google Scholar] [CrossRef]
Yun, S.; Park, J.; Lee, K.; Shin, J. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 13873–13882. [Google Scholar]
Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A Modified YOLO for Detection of Steel Surface Defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
Sanderson, K. GPT-4 is here: What scientists think. Nature 2023, 615, 773. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Song, Q.; Qian, L.; Li, H.; Peng, Q.; Zhang, J. SubstationAI: Multimodal Large Model-Based Approaches for Analyzing Substation Equipment Faults. arXiv 2024, arXiv:2412.17077. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]

Figure 1. The proposed lead-net framework for text-guided anomaly detection. During the training phase, textual information is incorporated to guide the network in learning defect features.

Figure 2. The architecture and components of the proposed LAFEM. Textual information assists the network in learning the relationships between feature channels.

Figure 3. Data distribution of defect types. The approximate ratio of the training, validation, and test sets for various types of samples is 7:1:2.

Figure 4. Ablation study results for LEAD-Net; “w/o” denotes “without”. The involvement of text and LAFEM significantly improved the proposed LEAD-Net’s detection accuracy on various targets.

Figure 5. Qualitative results of the ablation study on sample images from the substation defect dataset. The proposed method achieved the best performance compared to other advanced methods. (a) Ground truth (GT) annotations. (b) LEAD-Net (Proposed) detections. (c) Detections without training-time text guidance. (d) Detections without LAFEM.

Table 1. Examples of image–text pairs. The text annotation for each sample describes which device underwent what changes.

Image (Example)	Text
	This meter has a blurred dial.
	This meter has a damaged dial.
	This meter has an incorrect reading.
	This equipment has oil leakage/oil stains on the ground.
	This cabinet door has an abnormal door closure.
	This equipment has a foreign object.
	There is a bird’s nest on the equipment.
	This equipment has a damaged or missing cover plate.

Table 2. Comparative performance of object detection algorithms on various defect categories. The proposed LEAD-Net achieved the best overall performance and exhibited optimal accuracy across most categories.

Algorithm	AP for Each Category (%)								mAP (%)
Algorithm	BLD *	DAD	IR	OLS	ADC	FO	BN	DMC	mAP (%)
Faster R-CNN [14]	61.20	69.80	70.30	67.10	75.90	58.60	76.40	71.00	68.79
YOLOv9 [31]	70.30	74.70	80.30	75.90	77.30	66.20	78.90	83.40	75.88
DETR [32]	63.10	72.40	77.90	73.60	75.80	63.40	75.20	79.70	72.64
Deformable DETR [17]	72.50	79.90	81.00	76.30	81.90	70.20	79.60	82.10	77.94
LEAD-Net (Proposed)	73.30	80.10	82.80	80.30	84.30	72.20	79.10	84.00	79.51

* (BLD = blurred dial; DAD = damaged dial; IR = incorrect reading; OLS = oil leakage/stain; ADC = abnormal door closure; FO = foreign object; BN = bird’s nest; DMC = damaged/missing cover).

Table 3. Ablation study: Impact of LAFEM and text guidance on overall mAP. Text and LAFEM notably improved the proposed LEAD-Net’s detection accuracy for various targets.

Configuration	mAP
LEAD-Net (Full)	79.51
Without LAFEM	77.01
Without Text Guidance	78.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Kuang, J.; Teng, Y.; Xiang, S.; Li, L.; Zhou, Y. LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection. Processes 2025, 13, 2341. https://doi.org/10.3390/pr13082341

AMA Style

Zhang L, Kuang J, Teng Y, Xiang S, Li L, Zhou Y. LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection. Processes. 2025; 13(8):2341. https://doi.org/10.3390/pr13082341

Chicago/Turabian Style

Zhang, Linghao, Junwei Kuang, Yufei Teng, Siyu Xiang, Lin Li, and Yingjie Zhou. 2025. "LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection" Processes 13, no. 8: 2341. https://doi.org/10.3390/pr13082341

APA Style

Zhang, L., Kuang, J., Teng, Y., Xiang, S., Li, L., & Zhou, Y. (2025). LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection. Processes, 13(8), 2341. https://doi.org/10.3390/pr13082341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection

Abstract

1. Introduction

2. Research Background

2.1. Traditional Approaches and Early Automation

2.2. Deep-Learning-Based Defect Detection

2.2.1. CNN-Based Methods

2.2.2. Addressing Data Scarcity and Imbalances

2.2.3. Transformer-Based Models and Multimodal Analysis

2.3. Research Gaps and Motivation for This Work

3. Proposed Method

3.1. Framework

3.2. LAFEM Module

4. Experiments and Analysis

4.1. Dataset and Setup

4.2. Comparative Experiments

5. Ablation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI