MDCNet: A Multi-Neighborhood Dense Connectivity Network for Infrared Transmission Line Clamp Segmentation

An, Guocheng; Lu, Wanrong; Zhai, Guohua; Wang, Xiaolong; Zhang, Yanwei

doi:10.3390/electronics15091926

Open AccessArticle

MDCNet: A Multi-Neighborhood Dense Connectivity Network for Infrared Transmission Line Clamp Segmentation

by

Guocheng An

¹,

Wanrong Lu

^1,*

,

Guohua Zhai

²,

Xiaolong Wang

^1,3 and

Yanwei Zhang

¹

Industry Digital Intelligence Division, ECCOM Network System Co., Ltd., Shanghai 200127, China

²

School of Information and Electronic Engineering, East China Normal University, Shanghai 200062, China

³

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1926; https://doi.org/10.3390/electronics15091926

Submission received: 2 April 2026 / Revised: 24 April 2026 / Accepted: 29 April 2026 / Published: 2 May 2026

(This article belongs to the Special Issue AI-Based Image Processing Detection and Classification Analysis for Multidisciplinary Approaches)

Download

Browse Figures

Versions Notes

Abstract

Advancements in infrared imaging technology have introduced a novel perspective for inspecting power transmission lines. Nevertheless, the inherent low contrast and indistinct edges of infrared images present significant challenges, rendering the direct application of traditional semantic segmentation algorithms unsatisfactory. To mitigate this problem, we propose a multi-neighborhood densely connected network architecture. This framework incorporates two pivotal modules: the Multi-Head Squeeze-and-Excitation (MHSE) module and the Multi-Neighborhood Feature Fusion (MNFF) module. The MHSE enhances local feature representations by capturing nuanced feature interactions, thereby alleviating the issue of imbalanced global feature weight distribution. The MNFF aggregates feature data from multiple adjacent nodes at each node’s input, which not only facilitates the integration of multi-scale target features but also leverages neighborhood information to precisely localize and amplify features within specific regions. Furthermore, we have built the first Infrared Dataset of Power Transmission Line Suspension Clamp (CLAMPTISS) to substantiate our approach. Empirical evidence demonstrates that our proposed network surpasses state-of-the-art networks across three key metrics: the mean Intersection over Union (mIoU) and localization accuracy (Pd) have increased by 8.3% and 13.3%, respectively, while the false alarm rate (Fa) has decreased by 38.2%.

Keywords:

infrared imaging; transmission line inspection; semantic segmentation; attention mechanism; feature fusion

1. Introduction

Transmission lines, serving as the vital infrastructure for power delivery, play an absolutely essential role in guaranteeing the reliability and quality of the power supply [1,2,3]. Nevertheless, these lines are frequently subjected to harsh natural environments over prolonged periods, rendering them susceptible to a variety of external factors. One of the most prevalent issues is the abnormal temperature rise in suspension clamp, which can result in local overheating, insulation deterioration, and even line failures, thereby posing a severe threat to grid safety.

Traditional inspection methods predominantly rely on manual patrols, which are not only inefficient but also entail significant safety risks, especially in complex terrains or adverse weather conditions. In recent years, the swift advancement of Unmanned Aerial Vehicle (UAV) technology has offered a revolutionary solution for transmission line inspection. UAVs have advantages such as high mobility, strong adaptability, and the capability to operate in complex environments, enabling close-range detection across the transmission network and effectively compensating for the limitations of manual methods [4,5]. When equipped with infrared thermography, UAVs can capture the thermal distribution of clamp equipment irrespective of lighting conditions or atmospheric interference, facilitating the rapid identification of abnormal temperature zones and providing an efficient technical means for safety monitoring [6,7,8].

Despite these technological strides, accurately detecting infrared target pixels remains a formidable challenge in practical applications. Factors such as long imaging distances, complex shooting environments, and the typically small proportion of clamp target pixels within the overall image contribute to this difficulty [9,10]. These issues not only heighten the complexity of detection tasks but may also lead to misjudgments or missed detections, thereby undermining the reliability and efficiency of transmission line inspections. Consequently, achieving high-precision infrared target detection under complex conditions has become a critical research focus in both academia and industry.

Traditional solutions to this problem, including filtering methods, local contrast measures, and low-rank representations, often hinge on manually designed features or are highly sensitive to hyperparameter tuning. While these approaches perform adequately for salient targets, they exhibit substantial limitations in robustness and accuracy when applied to infrared images characterized by low contrast, high noise, or small target sizes [11]. A growing body of research substantiates that such conventional techniques fall short of meeting the high-precision requirements of modern transmission line inspection [12].

In contrast, semantic segmentation methods based on deep learning networks can learn hierarchical features in a data-driven manner, establishing themselves as the mainstream baseline for infrared pixel-level detection [13,14,15,16]. For example, Zhang et al. [17] proposed an attention-guided pyramid context network that simultaneously harnesses global information and multi-scale features, designing a context attention module to capture contextual information and a context pyramid module to address multi-scale feature fusion. Similarly, Dai et al. [18,19] explored the challenges of fusing high- and low-level semantic information and minimizing data-driven features. They developed an asymmetric context modulation module and a novel model-driven deep network for infrared small target detection, integrating discriminative networks with traditional model-driven approaches to overcome the limited receptive fields imposed by convolution kernels and encode long-range contextual interactions. Li et al. [16] tackled the issues of limited target pixels and insufficient detail in infrared images by designing a densely nested interaction module that enables progressive interaction between high-level and low-level features. In addition, the research of infrared small target detection based on transformer architecture is developing rapidly in recent years. Hu et al. [20] simulated central differential convolution (CDC) using dynamic attention Transformer to extract target gradient details, abandoning the use of CNN and utilizing CDC combined with global feature extraction module (GFEM) to balance local details and global background, effectively improving the detection accuracy of weak targets in complex backgrounds. In addition, the HAFNet [21] network based on Transformer has designed a dual branch semantic perception Transformer module, a hierarchical feature fusion encoder, and attention guided skip connections, which improve the detection accuracy of infrared small targets in complex backgrounds and reduce false alarms from three aspects: global local feature modeling, multi-scale feature fusion, and accurate transmission of target features. It has achieved excellent performance on mainstream open-source datasets. However, the Transformer architecture has shortcomings such as high training costs and difficulty in optimization, requiring higher video memory, longer training cycles, sensitivity to hyperparameters, and lower stability and training efficiency compared to mature convolutional networks.

Although these recent advances have yielded impressive results on general open-source benchmarks, their direct application to transmission line inspection scenarios, particularly on our infrared suspension clamp dataset, results in suboptimal performance. This performance gap is primarily due to the increased environmental complexity in inspection scenarios, where background interference is more pronounced and targets are more easily obscured.

To address these specific challenges, this paper introduces a Multi-Neighborhood Dense Connectivity Network (MDCNet) for semantic segmentation of suspension clamp equipment in infrared transmission line images. Our network incorporates two innovative modules:

A Multi-Head Squeeze-and-Excitation (MHSE) module, designed to capture fine-grained channel-wise interactions and enhance local features while alleviating the imbalanced weight distribution commonly associated with global attention mechanisms.

A Multi-Neighborhood Feature Fusion (MNFF) module, which aggregates feature information from multiple adjacent nodes to effectively integrate multi-scale contextual information and focus on target regions for enhanced feature representation.

The primary contributions of this work are twofold:

Dataset Contribution: Given the scarcity of publicly available infrared transmission line data, we have constructed and released CLAMPTISS, which is the first publicly available high-quality infrared dataset for pixel-level detection of suspension clamps. This dataset is specifically designed to train and evaluate deep learning models in this field, addressing a critical gap in the research community.
Methodological Contribution: We propose MDCNet, a novel semantic segmentation architecture tailored for infrared transmission line images. By integrating the MHSE and MNFF modules, the network effectively leverages limited infrared details and fuses semantic information across different scales and levels, achieving robust performance in complex inspection environments.

To validate the effectiveness of MDCNet, we conducted extensive experiments comparing it with recent state-of-the-art methods in infrared target segmentation. Comprehensive ablation studies were performed to evaluate the individual contributions of the MHSE and MNFF modules, including an analysis of variant structures at different depths. Experimental results on the CLAMPTISS dataset demonstrate that our proposed network achieves superior performance across multiple evaluation metrics.

2. Related Works

This section initially delineates the practical context and technical approach for utilizing infrared technology to inspect the clamped segments of transmission lines, subsequently emphasizing the introduction of CLAMPTISS, the pioneering infrared dataset specifically designed for detecting clamped segments of suspension lines in real-world applications.

2.1. Pixel Segmentation of Infrared Clamp Equipment for Transmission Lines

Statistical data from a Chinese provincial power company shows that from 2019 to 2021, overheating in transmission line clamps caused over a hundred outages, with economic losses totaling tens of millions of yuan [22,23,24,25]. These incidents underscore the importance of reliable clamp monitoring for grid safety and efficiency.

Typical clamps include suspension, tension, and parallel groove clamps. This study focuses on suspension clamp due to their widespread use and thermal vulnerability.

Heat in clamps usually comes from current-induced heating, with notable temperature fluctuations enabling thermal anomaly detection. Thus, infrared thermography is the primary method for non-contact temperature monitoring.

Figure 1 shows the standard workflow for temperature extraction from infrared imaging, comprising three main stages:

Infrared Image Capture: Thermal images of transmission lines and hardware, including clamps, are taken using specialized infrared equipment on UAVs or handheld devices.

Target Localization via Semantic Segmentation: Semantic segmentation algorithms pinpoint pixel coordinates of clamp regions in infrared images. This pixel-level localization is vital for accurate temperature measurement, isolating targets from background.

Temperature Analysis and Diagnosis: Temperature values are extracted from identified coordinates using SDK tools from the infrared equipment maker. Operators analyze this data to assess equipment condition, detect anomalies, and guide maintenance.

This workflow highlights the crucial role of accurate semantic segmentation, which directly affects the reliability of temperature analysis. Inaccuracies in target localization can lead to misdiagnosis and overlooked failures.

2.2. Infrared Dataset of Suspension Clamp for Transmission Lines

To tackle the shortage of publicly accessible datasets for infrared transmission line clamp segmentation, we created CLAMPTISS (Infrared Suspension Clamp Dataset for Transmission Line Inspection). It includes 1475 infrared images, divided into 1046 training and 429 test samples, all captured by drones with high-resolution infrared cameras during routine main grid transmission line inspections.

As shown in Figure 2, CLAMPTISS datasets captures the complexity of real-world inspections. The background often includes various interfering elements—dense vegetation, buildings, mountains, and artificial lighting—that can be mistaken for target clamps. This background noise challenges accurate semantic segmentation and sets our dataset apart from simpler, lab-collected infrared datasets.

Target Size Distribution: Clamps in CLAMPTISS are typically small targets in infrared imagery [26]. Among 2158 annotated clamps across all images, the pixel proportion distribution (ratio of target to total image pixels) is:

<0.5%: 577 instances (proportion 26.7%);
[0.5%, 1%): 901 instances (proportion 41.8%);
[1%, 5%): 583 instances (proportion 27.0%);
≥5%: 97 instances (proportion 4.5%).

Note: [a, b) means values ≥ a and <b.

This shows over 68% of targets occupy under 1% of image pixels, confirming small objects’ dominance. These traits make CLAMPTISS a tough benchmark for evaluating infrared small-target segmentation algorithms in real-world transmission line inspections.

3. Materials and Methods

This section shows the overall architecture of MDCNet, attention module MHSE and feature fusion module MNFF.

3.1. MDCNet Overall Structure

As shown in Figure 3, the proposed MDCNet takes a single infrared thermal image as input and processes it through a hierarchical feature extraction backbone based on ResNet [27]. To enhance the representational power, a MHSE module is inserted at the output of the backbone. This module aggregates fine-grained features across different channel groups, enabling the model to capture local channel-wise interactions and focus on subtle details within a constrained receptive field. The resulting feature maps are then fed into the MNFF module, which performs cross-node feature interaction and aggregation. By leveraging information from multiple neighborhood nodes, the MNFF module effectively integrates multi-scale contextual cues and refines the target representations. Finally, the fused multi-scale features are passed to a segmentation head to generate the pixel-level prediction map.

3.2. Multi-Head Squeeze-and-Excitation Module

Although the residual connections in ResNet enable deeper network architectures and the extraction of high-level abstract features, they struggle to simultaneously capture global semantic information and local fine-grained details—both of which are critical for accurate infrared target segmentation.

To address this limitation, we introduce the Multi-Head Squeeze-and-Excitation (MHSE) module into the ResNet backbone, as shown in Figure 4. Unlike the traditional SE module [28], which relies solely on global average pooling to extract channel-wise features, MHSE explicitly models fine-grained interactions between channels. The conventional SE mechanism applies a simple weighting strategy that often overlooks intra-group feature correlations, leading to a loss of discriminative capacity.

MHSE addresses these shortcomings by partitioning the channels into multiple subgroups, with each subgroup computing attention independently. This design enables the module to:

Capture finer feature interactions across channels within localized groups;
Enhance local feature representations that are essential for small target detection;
Mitigate the imbalanced distribution of global feature weights commonly observed in standard SE blocks.

By preserving intra-group feature characteristics while enabling cross-group information flow, MHSE achieves a more balanced and expressive channel attention mechanism tailored for infrared transmission line inspection scenarios.

The motivation behind MHSE is to perform channel grouping on the input feature map, with each group (i.e., a head) independently computing attention. The feature map X with C channels undergoes channel transformation to obtain the feature map U with

\tilde{C}

channels, i.e.,

F_{c t} (X_{B \times C \times H \times W}) \Rightarrow U_{\tilde{B} \times \tilde{C} \times H \times W}

(1)

where

F_{c t} (•)

represents the channel conversion, and the input X is converted to U. Both of them have the format B × C × H × W and satisfy

\tilde{B} = B \times h

and

\tilde{C} = C / h

, and represents the number of packets (headers).

F_{s q} (•)

uses global average pooling to compress features in the spatial dimension.

F_{s q} (U) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U (i, j)

(2)

The

F_{e x} (•)

operation includes two fully connected layers and a ReLU activation. The two full connections realize channel compression and recovery. Finally, a sigmoid activation function generates the same weight value as the number of features.

F_{e x} (F_{s q}, W) = s i g m o i d (W_{2} δ (W_{1} F_{s q}))

(3)

In the formula,

W_{1} F_{s q}

represents the first full connection,

δ (•)

represents the ReLU activation function, and

W_{2} δ (•)

represents the second full connection.

Then, the weight vector generated in the previous step is multiplied by the input feature map U through formula

F_{s c a l e} (•, X)

to obtain the feature map

\tilde{U}

.

\tilde{U} = F_{s c a l e} (F_{e x}, U)

(4)

Finally, the shape of the feature map

\tilde{U}

is restored to obtain the final weighted feature map

\tilde{X}

.

3.3. Multi-Neighborhood Feature Fusion Module

Infrared thermal images inherently lack the rich textural details present in visible-light images while exhibiting significant background noise. These characteristics pose substantial challenges for accurately segmenting small targets such as suspension clamp. To enhance the model’s sensitivity to target regions and improve defect perception, we design a progressive multi-neighborhood feature fusion (MNFF) structure, as shown in Figure 5.

The core innovation of the MNFF module lies in its progressive aggregation strategy. As the network deepens, the input to each node progressively incorporates feature information from multiple neighboring nodes at different scales and levels. This design enables two key advantages:

Multi-Scale Target Fusion: By aggregating features from adjacent nodes, the module effectively integrates contextual information across different receptive fields, enabling more robust handling of target scale variations.
Target-Focused Feature Enhancement: The incorporation of neighborhood information guides the network to focus more attentively on target regions, suppressing background interference while capturing higher-level and semantically richer representations.

In contrast to conventional feature fusion methods that typically aggregate information from a single previous layer or a fixed set of neighbors, MNFF dynamically leverages a broader neighborhood context. This progressive multi-neighborhood aggregation ensures that both fine-grained details and high-level semantics are preserved and reinforced throughout the network, ultimately leading to more precise segmentation of infrared clamp targets in complex backgrounds.

Each node in the MNFF is equipped with a ResNet feature extraction network incorporating MHSE.

X_{n}^{C}

denotes the convolution configuration parameters of conv2, C represents the number of convolution channels for conv2, and n signifies the repetition count of conv2. Then, the update of a single node can be expressed as:

L (i, j) = F_{i, j} (Z (i, j)) + Z (i, j)

(5)

Among them, i represents the oblique layer index (represented by different node colors), j represents the horizontal layer index, Z(i, j) represents the multi nearest neighbor input, i.e.,

Z (i, j) = P_{i - 1, i} (L (i - 1), j) + Q_{i + 1, i} (L (i + 1, j + 1)) + C o n c a t_{i^{'} > j, j^{'} > j} (R_{i^{'}, i} (L (i^{'}, j^{'})))

(6)

where i and j are the “source node” coordinates of the multi nearest neighbor skip connection (red dashed line).

P_{i - 1, i}

represents horizontal input at the same level,

Q_{i + 1, i}

represents diagonal input to the parent node, and

R_{i^{'}, i}

represents multi nearest neighbor skip input.

Finally, the multi-scale feature map of the last node in the diagonal second and fifth layers is input into the loss function for network optimization. The calculation process can be expressed as:

L_{t o t a l} = \frac{1}{N} \sum_{i = 2, j = 2}^{5} S o f t I o u L o s s (U (L (i, j)), Y_{g t})

(7)

N represents the number of nodes,

U (•)

represents upsampling, and

Y_{g t}

represents the true label.

4. Experiment and Discussion

4.1. Dataset and Evaluation

A series of comparative experiments were carried out on the CLAMPTISS dataset, including 1046 training sets and 429 test sets. In addition, to better validate the generalization of MDCNet, we conducted quantitative analysis on two other publicly available datasets in the field of infrared small target detection. The IRSTD-1k dataset contains 1001 infrared images, while the NUAA-SIRST dataset contains 427 infrared images collected in real environments. The resolution of infrared images in these two datasets are 512 and 256, respectively.

The evaluation metrics are consistent with the mainstream SOTA method, using the average intersection union ratio (mIoU), target detection accuracy (P_d) and pixel false detection rate (F_a). These three metrics provide a holistic assessment of segmentation quality: mIoU evaluates contour fitting accuracy, P_d measures target localization completeness, and F_a assesses background suppression capability. This multi-faceted evaluation ensures that both the positive detection performance and the false positive control of our model are rigorously examined.

Mean Intersection over Union (mIoU): This metric measures the pixel-wise overlap between the predicted segmentation mask and the ground truth, defined as the ratio of the intersection area to the union area, averaged over all target classes. mIoU serves as a primary indicator of the model’s ability to accurately fit the target contours, as expressed by the following formula:

m I o U = \frac{1}{n} \sum \frac{A_{inter}}{A_{u n i o n}}

(8)

In this formula, n represents the total number of labels, A_inter indicates the area of intersection between the predicted and true labels, and A_union signifies the area of union between them.

Detection Precision (P_d): Pd evaluates the model’s capability to correctly locate target regions at the pixel level. It is defined as the ratio of correctly detected target pixels to the total number of ground truth target pixels. A higher P_d indicates better target capture sensitivity, i.e.,

P_{d} = \frac{T_{c o r r e c t}}{T_{a l l}}

(9)

where T_all denotes the total number of labels, while T_correct indicates the number of correctly predicted labels. A prediction is deemed correct if the distance between the centroid pixel coordinates of the predicted label and those of the ground truth label is below the threshold D_threshold. The threshold is linearly scaled according to the input image edge length ratio to maintain relative distance consistency.

False Alarm Rate (F_a): F_a quantifies the model’s tendency to mistakenly classify background pixels as targets. A lower F_a reflects stronger robustness against background interference. F_a defined as the ratio of pixels P_incorrect with erroneously predicted labels to the total number of pixels P_all in the input image, i.e.,

F_{d} = \frac{P_{i n c o r r e c t}}{P_{a l l}}

(10)

4.2. Experimental Details

Comparison Methods: We compare MDCNet with several state-of-the-art infrared segmentation networks: ACM [18], ALCNet [19], DNANet [16], MSHNet [29], HCFNet [30], and HAFNet [21].

Data Preprocessing: In addition to our self-constructed CLAMPTISS dataset, the experimental data also incorporates two publicly available datasets in the field of infrared small target detection: IRSTD-1k and NUAA-SIRST. In the comparative experiments, a uniform input size was specified for each of the three datasets: 640 × 640 for CLAMPTISS, 512 × 512 for IRSTD-1k, and 256 × 256 for NUAA-SIRST. All models were evaluated using the corresponding input sizes for each dataset.

Training Details: MDCNet is trained for 500 epochs using the AdaGrad optimizer with a loss-oriented learning rate adjustment. The loss function is Soft-IoU Loss. For other models, all parameters are set to their default values, except for the epoch count and input size, which are aligned with those of MDCNet. All experiments are conducted on the same hardware for fair comparison.

4.3. Quantitative Results

As shown in Table 1, on the CLAMPTISS, MDCNet outperforms other SOTA methods across all three metrics. Compared to the baseline, it achieves a 8.3% improvement in mIoU, a 13.3% increase in P_d, and a 38.2% reduction in F_a. Notably, the substantial gain in Fa indicates that MDCNet learns object shapes more meticulously, enabling superior discrimination of surrounding pixels. This improvement stems from the MHSE module’s capacity to capture fine-grained features and the MNFF module’s effective cross-layer interaction between semantic and detailed information, allowing MDCNet to leverage both global context and local details efficiently.

Considering that the heat detection model for suspension clamps may be deployed at the edge, the experiment also compared indicators such as the model’s Params, FLOPs, and FPS. The ACM model demonstrates a significant advantage over all other models, owing to its lightweight backbone network and single-scale context modulation strategy. MDCNet ranks in the middle among all models and can generally meet real-time requirements, while the MNFF module achieves substantial improvements in overall performance at the cost of some reduction in speed.

To demonstrate the generalization capability of MDCNet on datasets from other domains, we conducted comparative experiments on two publicly remote sensing datasets, with the experimental results presented in Table 2. As observed from the table, MDCNet achieved the highest mIoU on the IRSTD-1k dataset and the highest P_d on the NUAA-SIRST dataset. Although MDCNet did not attain optimal performance across all metrics, its overall performance consistently ranked among the top in comparison with all other models.

4.4. Qualitative Results

As shown in Figure 6, we present a visual comparison of segmentation results between MDCNet and other methods on the CLAMPTISS test set. The selected samples encompass diverse scenarios with varying target quantities, target scales, target shapes, brightness levels, and background complexities. From the three distinctly different test samples shown in Figure 6a,b, it is evident that MDCNet achieves superior detection performance for small targets. Although occasional low overlap still exist, MDCNet identifies the most suspension clamp targets compared to other methods, demonstrating its enhanced adaptability to scene variations.

Furthermore, it is worth noting that the MHSE module, through its group-wise attention mechanism, enables the model to focus on fine-grained feature interactions within local regions, thereby improving its capability to model target shapes. As depicted in Figure 6c,d, the predicted target contours by MDCNet closely align with the ground truth, validating the effectiveness of this module in shape preservation. In addition, the multi-neighborhood and cross-node incremental connections within the MNFF module facilitate progressive feature fusion, selectively amplifying deep features associated with targets. This, in turn, empowers the network to more effectively learn and predict weak targets. As shown in Figure 6e,f, even when the target is almost entirely submerged in the background, MDCNet still achieves accurate localization and segmentation.

During the process of infrared image acquisition, certain infrared images exhibit issues such as “occlusion”, “flip”, or “severe tilt” of suspension clamps due to limitations in shooting angles or equipment installation space. Figure 7 illustrates the recognition results for such test images, wherein four algorithms from Table 1 are selected for comparative analysis. As shown in the figure, MDCNet accurately detects the target positions but encounters a “low overlap” problem when predicting the “flip” type. All algorithms are capable of detecting “occlusion” samples; however, only HAFNet and MDCNet successfully predict the “severe tilt” samples. Evidently, for the first “severe tilt” sample, although HAFNet achieves accurate localization, its predictions regarding shape and pixels are notably inadequate. This may be attributed to the Transformer architecture requiring a larger volume of data, whereas the training dataset contains only 1046 samples. The aforementioned results indicate that MDCNet also demonstrates superior performance on such “hard cases”. The MDFF feature fusion module effectively retains and leverages the limited fine-grained features to the greatest extent possible. The results of HAFNet further underscore the potent feature extraction capability of the Transformer architecture, suggesting substantial room for future improvement.

4.5. Ablation Experiment

To validate the efficacy of each module within MDCNet, with a particular focus on the complementarity between MHSE and MNFF within the MDCNet framework, a series of ablation studies were conducted.

As a core feature fusion component of MDCNet, we conducted ablation experiments on MNFF modules with different node depths (specifically, 3, 4, and 5 layers). The corresponding network architectures are shown in Figure 8, and the experimental results are summarized in Table 3. As the network depth increases, model performance exhibits a consistent improvement trend. This observation further confirms that deeper architectures enhance the model’s feature representation capability, which aligns with the design motivation of the MNFF module. Specifically, with increasing network depth, the number of neighboring nodes involved in feature fusion correspondingly increases, enabling the integration of more diverse feature information from multiple sources.

Key modules of MDCNet network, the MNFF module is designed for progressive multi-neighborhood feature fusion. Figure 9 shows four variants of the MNFF architecture, and Table 4 reports their experimental results in comparison with our proposed MNFF module. Similar to the findings from the depth ablation experiments, the four variants achieve results that closely approximate those of MNFF. The performance differences are modest, primarily because all four variants have a depth of 5, ensuring a sufficient number of nodes in the network. Their slightly inferior performance relative to MNFF can be attributed to the relatively simplistic and linear nature of the node inputs, as seen in variants b and c. In contrast, variants a and d incorporate non-linear input sources and thus exhibit marginally better overall performance than variants b and c.

Figure 10 presents the experimental results of three feature fusion modules commonly used in semantic segmentation scenarios—FPN [31], ASPP [32], and U-Net [33]—along with the MNFF module. Overall, the performance metrics of all methods are relatively close. Among them, MNFF remains the most suitable feature fusion module for the MDCNet architecture, followed by U-Net. The commonality between these two modules lies in their adoption of skip connections, which establish a direct information highway between deep semantic information and shallow texture information, thereby enhancing feature reusability and improving the localization accuracy of object boundaries in segmentation tasks.

Having validated the effectiveness of the MNFF module through the aforementioned ablation studies, we further investigate the contribution of the MHSE module. Specifically, by integrating the same MNFF architecture with different classical attention modules, we conducted comparative experiments to evaluate MHSE. The results are presented in Table 5. As can be observed, the proposed MHSE module achieves the best overall performance. Notably, while maintaining a F_a comparable to that of SE and CA, MHSE yields substantial improvements in both mIoU and P_d. This advantage can be attributed to its multi-head grouping mechanism, which captures fine-grained channel-wise interactions without compromising background suppression capability. In addition, the standard SE module performed poorly in this task, with a mIoU decrease of 3.8% compared to the baseline. This phenomenon can be attributed to the dilution effect of the SE’s global average pooling mechanism on infrared small target features—when the proportion of target pixels is extremely low (1%), it is difficult for global statistical characteristics to retain the channel response of sparse targets.

Further analysis was conducted on the performance variations in MHSE under different grouping (head) scenarios. The experimental results are detailed in Table 6. Fa remained at a consistent level overall. It is noteworthy that when the number of heads is set to 2, mIoU and P_d achieve optimal performance, while F_a peaks when the number of heads is 4. This observation indicates that there is no direct linear relationship between the number of groupings and performance. This may be because the infrared small target segmentation task exhibits an extreme foreground-background imbalance characteristic. The design of the attention mechanism should prioritize the fidelity of local features over the integration of global context. The heads = 2 configuration precisely meets this requirement.

5. Conclusions

This paper tackles the challenging task of semantic segmentation for transmission line clamp equipment in infrared images, a scenario where low contrast, background clutter, and diminutive target sizes significantly impair the performance of traditional methods. We propose MDCNet, a multi-neighborhood densely connected network specifically designed for this context. The network integrates two novel modules: MHSE, which leverages a grouped attention mechanism to capture fine-grained channel-wise interactions and enhance local feature representations, and MNFF, which progressively aggregates multi-scale information from adjacent nodes to reinforce target region features. Additionally, we have constructed the dataset CLAMPTISS, the first infrared dataset tailored for suspension clamp segmentation in real-world inspection scenarios, providing a benchmark for future research in this domain.

Extensive experiments demonstrate that MDCNet consistently outperforms existing state-of-the-art methods across all evaluation metrics, achieving notable improvements in mIoU and P_d while substantially reducing F_a. Ablation studies further validate the individual contributions of the MHSE and MNFF modules, confirming their effectiveness in enhancing segmentation accuracy and robustness. Qualitative visual analyses illustrate that MDCNet exhibits strong adaptability to varying target scales, shapes, and background complexities, particularly excelling in detecting weak and small targets under challenging imaging conditions.

Future work will explore the generalization of MDCNet to other infrared inspection tasks, such as tension clamp and damper detection, and investigate lightweight network architectures to facilitate deployment on edge devices for real-time UAV-based inspections.

Author Contributions

Conceptualization, W.L., G.A. and G.Z.; methodology, W.L., G.Z. and G.A.; software, W.L.; validation, W.L., G.A. and G.Z.; formal analysis, W.L. and G.Z.; investigation, W.L.; resources, G.A.; data curation, W.L. and X.W.; writing—original draft preparation, W.L.; writing—review and editing, W.L. and G.A.; visualization, W.L. and Y.Z.; supervision, G.A. and X.W.; project administration, G.A.; funding acquisition, G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China, NO. 2023YFC3006700; Topic Five NO. 2023YFC3006705.

Data Availability Statement

We have made public the professional dataset used in this article and have provided high-quality mask, https://github.com/donglu110/InfraredData- (accessed on 10 February 2026).

Conflicts of Interest

Authors Guocheng An, Wanrong Lu, Xiaolong Wang, Yanwei Zhang are employed by the company Industry Digital Intelligence Division, ECCOM Network System Co., Ltd. The authors declare that this study received funding from National Key R&D Program of China. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Hamaguchi, R.; Fujita, A.; Nemoto, K. Small Object Segmentation Using Dilated Convolutions with Increasing-Decreasing Dilation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19016–19034. [Google Scholar] [CrossRef]
Hao, J.; Tao, Y. Power grid inspection based on multimodal foundation models. EAI Endorsed Trans. Energy Web 2025, 12. [Google Scholar] [CrossRef]
Sun, S.; Song, C.; Cong, X.; Fei, L.; Ding, L. Intelligent Inspection and Deicing System for Power Transmission Lines Based on UAV Platform. In Proceedings of the 2025 4th International Conference on Cyber Security, Artificial Intelligence and the Digital Economy, Kuala Lumpur, Malaysia, 7–9 March 2025. [Google Scholar]
Wang, H.; Shang, X.; Xiong, Q.; Liang, J.; Huang, Y. An Innovative Approach to Multi-Objective Infrared Temperature Measurement for Electrical Power Equipment Using K-Means. In Proceedings of the 2023 8th International Conference on Control, Robotics and Cybernetics (CRC), Changsha, China, 22–24 December 2023. [Google Scholar]
Zhai, Y.; Chen, R.; Yang, Q.; Li, X.; Zhao, Z. Insulator fault detection based on spatial morphological features of aerial images. IEEE Access 2018, 6, 35316–35326. [Google Scholar] [CrossRef]
Jin, L.; Xia, J.; Yan, S.; Duan, S.; Yao, S.; Zhao, L. Temperature rising recognition of IR image of electrical equipment based on seeded region growing. In Proceedings of the 2013 2nd International Conference on Electric Power Equipment—Switching Technology (ICEPE-ST), Matsue, Japan, 20–23 October 2013. [Google Scholar]
Liu, X.; Zhang, Z.; Hao, Y.; Zhao, H.; Yang, Y. Optimized OTSU segmentation algorithm-based temperature feature extraction method for infrared images of electrical equipment. Sensors 2024, 24, 1126. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Yan, J.; Huang, H. Substation Inspection Method Based on Air-Ground Collaboration. In Proceedings of the 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT), Jilin, China, 26–28 April 2024. [Google Scholar]
Cai, Z.; Wang, T.; Han, W.; Ding, A. PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion. Electronics 2024, 13, 2738. [Google Scholar] [CrossRef]
Chen, Y.; Song, B.; Du, X.; Guizani, M. Infrared Small Target Detection Through Multiple Feature Analysis Based on Visual Saliency. IEEE Access 2019, 7, 38996–39004. [Google Scholar] [CrossRef]
Yasaswi, V.; Keerthi, S.; Jainab Begum, S.; Krishna Sravan, Y.; Sridhar, B. Infrared thermal image segmentation for fault detection in electrical circuits using watershed algorithm. Int. J. Eng. Trends Technol. 2015, 21, 423–429. [Google Scholar] [CrossRef]
Wang, D.; Shen, T. Research on infrared weak and small target detection algorithm in complex sky background. Acta Opt. Sin. 2020, 40, 0512001. [Google Scholar] [CrossRef]
Zhou, J.; Liu, G.; Gu, Y.; Deng, J.; Wen, Y.; Chen, S. A box-supervised instance segmentation method for insulator infrared images based on shuffle polarized self-attention. IEEE Trans. Instrum. Meas. 2023, 72, 5026111. [Google Scholar] [CrossRef]
Li, T.; Wang, S.; Hu, Q.; Liu, L.; Zhou, L.; Deng, Y. Temperature distribution characteristics and heat defect judgment method based on temperature gradient of suspended composite insulator in operation. IET Gener. Transm. Distrib. 2021, 15, 2554–2566. [Google Scholar] [CrossRef]
Andrei, A.-T.; Grigore, O. Development of a very low-cost deforestation monitoring system based on aerial image clustering and compression techniques. Adv. Electr. Comput. Eng. 2024, 24, 73–84. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-Guided Pyramid Context Networks for Detecting Infrared Small Target Under Complex Background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Hu, C.; Huang, Y.; Li, K.; Zhang, L.; Long, C.; Zhu, Y.; Pu, T.; Peng, Z. DATransNet: Dynamic attention transformer network for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 7001005. [Google Scholar] [CrossRef]
Zhang, Y.; Bao, W.; Yang, Y.; Wan, W.; Xiao, Q.; Zou, X. HAFNet: Hierarchical attention fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5007316. [Google Scholar] [CrossRef]
Huang, D.; Yang, J.; Yan, Y.; Sun, X.; Huang, X. Typical wire clamps segmentation of transmission lines based on infrared image. In Proceedings of the International Conference on Artificial Intelligence, Virtual Reality, and Visualization (AIVRV 2022), Chongqing, China, 23–25 September 2022. [Google Scholar] [CrossRef]
Yao, Y.; Du, Z.; Zhou, G.; Wang, Q. Multi-spectral fusion power equipment fault recognition based on prompt learning. J. Northwest. Polytech. Univ. 2025, 43, 410–417. [Google Scholar] [CrossRef]
Siraj, F.M.; Ayon, S.T.K.; Samad, M.A.; Uddin, J.; Choi, K. Few-Shot Lightweight SqueezeNet Architecture for Induction Motor Fault Diagnosis Using Limited Thermal Image Dataset. IEEE Access 2024, 12, 50986–50997. [Google Scholar] [CrossRef]
Yang, X.; Tu, Y.; Yuan, Z.; Zheng, Z.; Chen, G.; Wang, C.; Xu, Y. Intelligent overheating fault diagnosis for overhead transmission line using semantic segmentation. High Volt. 2024, 9, 309–318. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2011–2023. [Google Scholar] [CrossRef]
Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared small target detection with scale and location sensitivity. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical context fusion network for infrared small object detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]

Figure 1. Scheme for temperature extraction from infrared imaging. The thermal matrix is acquired through the SDK supplied by the manufacturer, while the coordinate set is predicted by the model.

Figure 2. CLAMPTISS datasets.

Figure 3. Overall structure of MDCNet. Each node within the MNFF module extracts features via ResNet with MHSE attention. The outputs from each level of MNFF (denoted by distinct colors) are upsampled via “up” sampling to achieve a uniform shape for utilization in subsequent stages.

Figure 4. Schematic diagram of MHSE attention. Output location added to ResNet.

Figure 5. Schematic diagram of MNFF structure. Different colors represent different levels.

Figure 6. Qualitative comparison between MDCNet and other networks. The red box within the figure denotes correct detection, yellow signifies missed detection, blue indicates false detection, and green represents low overlap with the true target pixels. (a,b) verify small target detection capability. (c,d) verify the target shape fitting ability. (e,f) verify the detection ability under complex background.

Figure 7. Comparison of recognition performance on “hard cases” samples. The enlarged target in the red box in the figure represents the “hard case” target defined by us, and the unmarked target is the common target.

Figure 8. MNFF structures at different depth, purple represents node depth of 3, green represents node depth of 4, and front view color represents node depth of 5.

Figure 9. Four variant structures of MNFF. The solid arrow indicates the connection between adjacent nodes. The dashed arrow indicates a cross-node connection. (a), (b), (c) and (d) represent the four connection types of MNFF respectively.

Figure 10. Comparison of four feature fusion modules: FPN, ASPP, U−Net, and MNFF. In the experiment, only the MNFF module was replaced, and other settings remained unchanged.

Table 1. Comparison of quantitative experimental results between MDCNet and other SOTA methods in CLAMPTISS dataset.

Metrics	$mIOU$ ↑	$P_{d}$ ↑	$F_{a} / 10^{- 5}$ ↓	Params/M	GFLOPs	FPS
ACM [18] (baseline)	0.6476	0.7983	0.34	0.39	1.78	147.0
ALCNet [19]	0.6672	0.8163	0.30	1.45	6.63	94.0
HCFNet [30]	0.6596	0.8367	0.27	14.40	135.42	16
DNANet [16]	0.6652	0.8522	0.26	4.70	89.66	1.5
MSHNet [29]	0.6448	0.8143	0.33	4.07	24.43	62.5
HAFNet [21]	0.6683	0.8659	0.28	13.46	122.50	5.9
MDCNet (ours)	0.7012	0.9051	0.21	4.68	88.92	6.3