1. Introduction
Image segmentation has long been a core task in computer vision, serving as the foundation for applications in diverse fields such as medical diagnostics [
1], remote sensing [
2], and industrial inspection [
3]. Accurate segmentation enables fine-grained image analysis, providing reliable structural information that supports downstream recognition, detection, and decision-making tasks. Nevertheless, segmentation in real-world environments remains highly challenging due to factors such as noisy backgrounds, low-resolution imagery, blurred boundaries, and imbalanced class distributions. These challenges are particularly evident in industrial scenarios, where defects are often small, irregular, and visually ambiguous, making their precise localization critical for ensuring the reliability of automated monitoring and fault diagnosis. Although the presence of fewer defective samples reflects a desirable industrial outcome, such imbalance leads to limited training diversity and model bias toward dominant ‘normal’ classes. Moreover, even within a single image, the defective regions occupy only a small fraction of pixels, posing an additional pixel-level imbalance challenge for precise segmentation.
Traditional approaches based on handcrafted features, including edge- and texture-based descriptors [
4], region-growing strategies [
5,
6], and graph-cut algorithms [
7,
8], have shown limited robustness when applied to noisy and complex data. With the advent of deep learning, significant progress has been achieved in semantic segmentation, with networks such as FCN [
9], U-Net [
1], PSPNet [
10], and DeepLab variants [
11,
12,
13,
14] demonstrating impressive performance across multiple domains. For example, traditional texture and edge based segmentation methods typically achieve less than 70% pixel accuracy on noisy industrial images, while deep convolutional models exceed 85% on the same datasets. This gap illustrates the limited robustness of handcrafted approaches under complex illumination and texture conditions. More recently, Transformer-based models have introduced powerful global context modeling through self-attention mechanisms, complementing the local feature extraction capabilities of convolutional neural networks (CNNs). Hybrid architectures that combine CNNs and Transformers, such as TEC-Net and Next-ViT, attempt to leverage the strengths of both paradigms. However, these models often suffer from high structural complexity and substantial computational costs, which hinder their scalability and deployment in resource-constrained environments. Moreover, existing solutions still struggle with fine-grained boundary delineation and class imbalance when applied to industrial fault datasets. Fine-grained boundary delineation is particularly difficult in industrial inspection due to subtle gray-level transitions between defective and normal areas and inherent annotation noise at object borders, which often cause traditional edge detectors to fail in distinguishing true defect boundaries from background texture.
In this work, we address these limitations by proposing a Defect-Aware Fine Segmentation Framework (DAFSF) designed to achieve accurate and robust fault localization in challenging industrial datasets while maintaining efficiency. The key contributions of this paper are as follows:
We design a multi-scale hybrid encoder that integrates CNN-based local feature extraction with Transformer-based global context modeling, enabling balanced representation of fine-grained structures and long-range dependencies.
We introduce a boundary refinement module that explicitly enhances edge representation, mitigating boundary ambiguity and improving segmentation accuracy in small or blurred defect regions.
We propose a defect-adaptive loss function that dynamically reweights pixel contributions by considering boundary proximity, classification difficulty, and class imbalance, thereby improving robustness in noisy and imbalanced datasets.
We conduct extensive validation on both a proprietary infrared electrolyzer dataset and public benchmarks including Aeroscapes [
15], Magnetic Tile Defect [
16], and MVTec AD [
17], demonstrating significant improvements over state-of-the-art methods in segmentation accuracy, boundary precision, and cross-domain generalization.
5. Results
Table 3 summarizes the model complexity and segmentation accuracy (mIoU) of various methods across three datasets: Aeroscapes, Magnetic Tile Defect, and MVTec AD. From the table, it is clear that traditional CNN-based architectures such as U-Net and PSPNet maintain relatively low parameter counts (31 M and 46 M, respectively) and moderate computational costs (FLOPs 255 G and 280 G), yet they achieve limited segmentation performance, with mIoU values ranging from 69.4% to 73.5% across datasets. This indicates that while these models are efficient, their ability to capture the global context and subtle defect structures is insufficient, particularly under noisy or heterogeneous industrial conditions.
Transformer-based methods, including Swin Transformer and TransUNet, achieve higher mIoU scores (up to 77.5%), demonstrating the advantage of global self-attention for modeling long-range dependencies. However, these gains come with substantially increased computational costs (FLOPs up to 315 G for TransUNet) and parameter counts, which may hinder deployment in resource-constrained industrial settings. Hybrid architectures such as Next-ViT, FA-HRNet, and UIU-Net further improve mIoU, benefiting from the combination of convolutional local feature extraction and Transformer-based global reasoning. Nevertheless, lighter networks such as Patch-netvlad and BiSeNetV2, despite very low FLOPs in some cases, fail to achieve competitive mIoU, reflecting a trade-off between efficiency and fine-grained segmentation capability.
Our proposed DAFSF model strikes a favorable balance between accuracy and efficiency. With a moderate parameter count of 36.2 M and FLOPs of 276 G, DAFSF attains mIoU values of 80.5%, 81.4%, and 82.6% on Aeroscapes, Magnetic Tile Defect, and MVTec AD, respectively, outperforming all baselines by a significant margin. The consistent cross-dataset improvements indicate that the multi-scale hybrid encoder effectively captures both local details and global context, while the boundary refinement and defect-adaptive loss mechanisms enhance segmentation of fine-grained defect structures. In addition, the qualitative results illustrated in
Figure 2 and
Figure 3 further demonstrate the capability of DAFSF to accurately delineate fine-grained defect regions and maintain consistent generalization across different datasets. These results suggest that DAFSF not only delivers high segmentation precision but also generalizes robustly across heterogeneous domains, making it highly suitable for practical industrial applications where both accuracy and computational efficiency are critical.
Table 4 presents inference latency, mean F1-score (mF1), and pixel accuracy (PA) for all compared methods across three datasets. Lightweight CNN models such as U-Net and PSPNet exhibit the lowest latency (22–36 ms), which is beneficial for real-time applications, but their segmentation quality is relatively modest, with mF1 values ranging from 77.8% to 81.0% and PA between 85.7% and 87.5%. This a reflects limited capacity to capture fine-grained defect structures and the global context, especially in heterogeneous or high-noise industrial environments.
Transformer-based architectures, including Swin Transformer and TransUNet, improve mF1 (up to 83.7%) and PA (up to 90.2%) by modeling long-range dependencies. However, this performance comes at the cost of substantially higher latency (68–76 ms), which may constrain their deployment in scenarios requiring rapid inspection or real-time monitoring. Hybrid models such as Next-ViT, FA-HRNet, and UIU-Net strike a better balance, achieving both moderate latency and improved segmentation metrics, demonstrating the benefit of combining convolutional local feature extraction with global attention.
Lightweight hybrid networks such as Patch-netvlad and BiSeNetV2 achieve very low latency (12–19 ms), yet their mF1 and PA remain lower than those of larger hybrid or Transformer models, revealing the trade-off between efficiency and robust defect modeling. FreeSeg also exhibits reduced latency (28–30 ms) with moderate performance, further illustrating this compromise.
The proposed DAFSF model demonstrates an effective balance between inference efficiency and segmentation quality. With latency around 25–26 ms comparable to lightweight CNNs, DAFSF attains the highest mF1 (85.3–87.2%) and PA (91.5–92.0%) across all datasets. These results indicate in
Figure 4 that the multi-scale hybrid encoder, boundary refinement module, and defect-adaptive loss not only enhance fine-grained defect recognition but also maintain computational efficiency suitable for real-time industrial applications. Importantly, the consistent performance gains across heterogeneous datasets reflect strong cross-domain generalization, confirming DAFSF’s robustness to varying defect types, noise levels, and imaging conditions.
5.1. Ablation Studies
Table 5 presents the ablation results of different components in DAFSF on the infrared electrolyzer dataset. The baseline model, implemented with a standard U-Net backbone, achieves an mIoU of 72.5% and a BIoU of 64.2%, which reflects limited capability in handling boundary ambiguity and small-scale defects. Incorporating the multi-scale hybrid encoder (MHE) significantly improves performance, raising mIoU to 77.3% and BIoU to 68.9%. This demonstrates the effectiveness of combining convolutional local feature extraction with Transformer-based global context modeling, which enhances representation quality under noisy industrial conditions.
Adding the boundary-aware refinement module (BARM) on top of MHE further increases BIoU from 68.9% to 72.4%, while also improving the F1-score by 2.2%. These gains highlight the importance of explicit boundary supervision in capturing fine-grained defect contours and reducing misclassification along blurred edges. Alternatively, combining MHE with the defect-adaptive loss (DAAL) achieves a similar boost in performance, with mIoU reaching 80.1% and the F1-score improving to 84.0%. This improvement is attributed to the dynamic reweighting of boundary and minority-class pixels, which alleviates the class imbalance problem and emphasizes difficult samples during training. The full DAFSF model, which integrates MHE, BARM, and DAAL, achieves the highest performance across all metrics, with an mIoU of 83.5%, BIoU of 76.2%, and F1-score of 87.1%. Compared to the baseline, this corresponds to gains of +11.0% in mIoU, +12.0% in BIoU, and +9.0% in the F1-score, demonstrating the complementary effects of the three modules. These results confirm that each component contributes uniquely to fine-grained segmentation, and their joint integration yields superior robustness and precision in industrial fault localization tasks.
To further interpret the effectiveness of the proposed multi-scale hybrid encoder (MHE), we visualize the intermediate feature responses from different layers of the backbone, as shown in
Figure 5. The results reveal a clear progression of attention from shallow to deep layers. Specifically, the features from layer2 mainly emphasize low-level textures and local contrast, which are useful for capturing surface irregularities. Moving to layer3, the activations begin to highlight more coherent defect regions while suppressing irrelevant background noise, indicating the emergence of mid-level semantic understanding. Finally, layer4 exhibits highly focused responses along defect areas and their structural boundaries, confirming that deeper layers integrate both local details and long-range contextual cues. This progressive refinement validates the design of the hybrid encoder, which combines convolutional local descriptors with Transformer-based global reasoning to achieve robust and fine-grained defect segmentation.
To evaluate the robustness of DAFSF under challenging real-world conditions, we conducted experiments with augmented images simulating variations in illumination, motion blur, and sensor noise. The augmentations are consistent with
Table 2, with additional condition-specific perturbations applied:
Illumination: Overexposure and underexposure to simulate varying lighting conditions.
Motion Blur: Gaussian blur with kernel size 3–7 to mimic fast camera motion.
Sensor Noise: Gaussian noise with standard deviation –0.05 to represent real sensor perturbations.
The results in
Table 6 demonstrate that DAFSF is robust under diverse real-world conditions. Under illumination variations, although minor degradation occurs in shadowed or overexposed regions, the model is able to maintain accurate object boundaries across all datasets. Motion blur slightly smooths edges, yet small and medium-sized objects are still correctly segmented, which can be attributed to the multi-scale hybrid encoder’s ability to aggregate contextual information effectively. When sensor noise is introduced, boundary precision is slightly reduced under higher noise levels; however, the overall segmentation structure and defect localization remain largely unaffected. These observations indicate that DAFSF consistently preserves segmentation quality even under challenging conditions such as varying lighting, motion blur, and sensor perturbations, complementing the quantitative results reported in
Table 3 and demonstrating its practical applicability in industrial and aerial scenarios.
5.2. Parameter Sensitivity Analysis
The experimental results provide comprehensive evidence of the effectiveness and robustness of the proposed DAFSF framework. As shown in
Figure 6, DAFSF consistently outperforms all baseline models across three representative datasets, including Aeroscapes, Magnetic Tile, and MVTec AD, in terms of mIoU. This indicates that the framework not only achieves higher segmentation accuracy but also generalizes well across diverse domains characterized by small, irregular, and noisy defect patterns. Furthermore, the parameter sensitivity analysis in
Figure 7 highlights the influence of the boundary weighting factor
and the hard-sample focusing parameter
in the defect-adaptive loss function. Specifically, the model performance improves as
increases up to an optimal point, beyond which the gain diminishes, suggesting that moderate boundary emphasis is critical for capturing structural details without introducing noise. Similarly, the variation of
demonstrates that carefully calibrated hard-sample weighting enhances the model’s ability to learn from challenging regions, whereas excessive focusing may lead to instability and performance degradation. The presence of optimal parameter points, marked by red dots, further confirms the balanced trade-off between boundary precision and sample difficulty. Overall, these results not only validate the superiority of DAFSF over existing approaches but also demonstrate its stability and adaptability under different parameter settings, reinforcing its practical applicability to real-world fault segmentation scenarios.
Table 7 summarizes the parameter sensitivity analysis of the proposed DAFSF on the infrared electrolyzer dataset. We first vary the boundary weighting factor
while fixing
. Results show that increasing
from 0 to 2 progressively improves both mIoU and BIoU, with the best performance obtained at
(mIoU = 83.5%, BIoU = 76.2%). This indicates that moderate boundary emphasis helps the model focus on fine-grained defect contours, thereby improving structural fidelity. However, further increasing
beyond 2 leads to slight performance degradation, suggesting that overemphasizing boundary pixels may destabilize region-level predictions by underweighting interior pixels. Next, we vary the focusing parameter
while fixing
. The results reveal that setting
achieves the best balance between hard-sample emphasis and overall stability. Lower values (e.g.,
) provide weaker discrimination, resulting in reduced accuracy, whereas higher values (e.g.,
or
) overly emphasize difficult samples, leading to marginal drops in performance. These findings confirm that the proposed DAAL is robust to parameter variation within a broad range, and that its optimal configuration corresponds to
and
.
5.3. Training Cost and Resource Consumption
DAFSF contains 36.2M parameters and requires 276G FLOPs per forward pass, representing a moderate model size compared to large Transformer-based baselines (e.g., TransUNet and Next-ViT). On the A100 GPU with a batch size of 16, each epoch requires approximately 4.5 min of wall-clock time for DAFSF, while larger Transformer-based models require 6–8 min per epoch. This demonstrates that DAFSF achieves competitive segmentation accuracy while maintaining moderate computational demand. DAFSF converges faster than heavier baselines, reaching stable validation mIoU within roughly 120–140 epochs, whereas larger models require up to 180–200 epochs to achieve comparable performance. The faster convergence reduces the total training time, facilitating model tuning and iteration. The moderate parameter size and FLOPs of DAFSF allow training on a single A100 GPU with a batch size of 16, avoiding the need for multi-GPU setups or gradient accumulation. Additionally, lower inference FLOPs result in faster per-image processing speed, which is advantageous for real-time deployment or resource-constrained environments. Overall, the proposed DAFSF model achieves a favorable balance between segmentation accuracy, computational cost, and training efficiency, highlighting its practical advantage for industrial and aerial datasets.
5.4. Applicability to Medical Image Segmentation
Although the primary focus of this work is on industrial and aerial datasets, the proposed DAFSF framework is naturally extendable to medical image segmentation tasks that involve fine-grained defects or subtle structural variations. For instance, in ultrasound imaging, speckle noise and low contrast can obscure small lesions, making boundary delineation challenging; in PET or low-resolution MRI scans, the inherent resolution limitations affect functional–anatomical image fusion, complicating reliable segmentation. These challenges are analogous to industrial defects, where small or low-contrast anomalies must be localized accurately despite background clutter or sensor noise. The multi-scale hybrid encoder and boundary-aware supervision in DAFSF allow effective aggregation of local textural cues and global contextual information, which can improve the detection of micro-structures in medical images. Furthermore, medical imaging datasets often suffer from biased annotations or limited atlas coverage, analogous to missing or corrupted samples in industrial settings. By leveraging the adaptive feature fusion and boundary weighting mechanisms of DAFSF, such biases can be mitigated, improving robustness and reliability in segmentation tasks. Previous studies have highlighted the challenges of accurate fine-structure segmentation in medical applications, and the principles demonstrated in DAFSF can be adapted to these contexts, potentially supporting improved diagnosis or treatment planning without necessitating extensive retraining on each new dataset.
6. Limitations and Future Work
Despite the strong performance of DAFSF across multiple datasets and challenging conditions, several limitations remain that warrant further investigation. First, the use of hybrid multi-scale encoders and explicit boundary supervision contributes to increased computational complexity and memory consumption during training, which may limit scalability to extremely large datasets or prolonged training regimes. To address this, future work will investigate model compression strategies such as pruning, knowledge distillation, and low-rank factorization, aiming to reduce both training and inference costs while preserving segmentation accuracy. Second, DAFSF relies on high-quality pixel-wise annotations, particularly for boundary regions, which are labor-intensive and time-consuming to produce. Exploring weakly supervised or semi-supervised learning paradigms, potentially leveraging image-level labels, scribbles, or synthetic data, could substantially reduce annotation dependency and facilitate broader deployment. Third, although DAFSF demonstrates robustness under various illumination, motion blur, and noise conditions, performance may still degrade when faced with domain shifts such as previously unseen sensors, lighting conditions, or environmental contexts. Incorporating domain adaptation or domain generalization techniques will be essential to maintain consistent accuracy in diverse real-world scenarios. Finally, while the current implementation achieves competitive inference speed on high-end GPUs, deploying DAFSF on edge devices or embedded hardware for real-time applications remains challenging. Future work will focus on lightweight model design, hardware-aware optimization, and efficient inference strategies to enable practical edge deployment without compromising segmentation quality. Collectively, addressing these limitations will further enhance the applicability and efficiency of DAFSF in industrial, aerial, and resource-constrained environments.
7. Conclusions
In this work, we proposed the Defect-Aware Fine Segmentation Framework (DAFSF) to address the challenges of industrial defect segmentation, where defects are small, irregular, and embedded in noisy backgrounds. The framework integrates a multi-scale hybrid encoder for capturing both local and global contextual features, a boundary-aware refinement module for enhancing structural fidelity, and a defect-adaptive loss function to mitigate class imbalance and emphasize hard-to-segment regions. Extensive experiments on both proprietary and public datasets, including infrared electrolyzer images, Aeroscapes, Magnetic Tile Defect, and MVTec AD, demonstrate that DAFSF consistently outperforms state-of-the-art and hybrid architectures in terms of mIoU, mF1, and pixel accuracy, while maintaining competitive inference latency suitable for real-time industrial deployment. The ablation studies further confirm the effectiveness of each individual component, highlighting the importance of multi-scale feature fusion, boundary supervision, and adaptive loss weighting. Overall, DAFSF provides a robust, efficient, and generalizable solution for precise defect segmentation across heterogeneous domains, offering strong potential for practical applications in industrial inspection and quality control.