Next Article in Journal
Overview of Deadbeat Predictive Control Technology for Permanent Magnet Synchronous Motor System
Next Article in Special Issue
A Substation Image Inspection Method Based on Visual Communication and Combination of Normal and Abnormal Samples
Previous Article in Journal
Study on the Influence of Different Particle Sizes of Kaolin Blending with Zhundong Coal Combustion on the Adsorption of Alkali Metal Sodium and Ash Fusion Characteristics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Segmentation of 220 kV Cable Insulation Layers Using WGAN-GP-Based Data Augmentation and the TransUNet Model

1
Chengdu Power Supply Company, State Grid Sichuan Electric Power Company, Chengdu 610041, China
2
School of Electrical and Electronic Engineering, Chongqing University of Technology, Chongqing 400054, China
*
Authors to whom correspondence should be addressed.
Energies 2025, 18(17), 4667; https://doi.org/10.3390/en18174667
Submission received: 28 July 2025 / Revised: 27 August 2025 / Accepted: 1 September 2025 / Published: 2 September 2025
(This article belongs to the Special Issue Fault Detection and Diagnosis of Power Distribution System)

Abstract

This study presents a segmentation framework for images of 220 kV cable insulation that addresses sample scarcity and blurred boundaries. The framework integrates data augmentation using the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) and the TransUNet architecture. Considering the difficulty and high cost of obtaining real cable images, WGAN-GP generates high-quality synthetic data to expand the dataset and improve the model’s generalization. The TransUNet network, designed to handle the structural complexity and indistinct edge features of insulation layers, combines the local feature extraction capability of convolutional neural networks (CNNs) with the global context modeling strength of Transformers. This combination enables accurate delineation of the insulation regions. The experimental results show that the proposed method achieves mDice, mIoU, MP, and mRecall scores of 0.9835, 0.9677, 0.9840, and 0.9831, respectively, with improvements of approximately 2.03%, 3.05%, 2.08%, and 1.98% over a UNet baseline. Overall, the proposed approach outperforms UNet, Swin-UNet, and Attention-UNet, confirming its effectiveness in delineating 220 kV cable insulation layers under complex structural and data-limited conditions.

1. Introduction

In recent years, the growing demand for electricity and the rapid expansion of urban power grid infrastructure have led to the widespread adoption of 220 kV cables in transmission projects. As a key component ensuring the safe and stable operation of high-voltage cable systems, the installation quality of intermediate joints is of paramount importance [1]. In engineering practice, inadequate insulation during joint installation may trigger severe faults such as partial discharge or dielectric breakdown, thereby posing serious threats to the reliability and safety of the power grid [2]. Accordingly, accurate segmentation and identification of the cable insulation layer during joint installation are essential for effective quality assessment.
However, on-site insulation quality assessment is still largely dependent on manual measurement and visual inspection, which are inefficient and highly susceptible to human error. This results in significant judgment variability and increases the risk of unstable joint quality and operational failures [3,4,5].
With the rapid advancement of deep learning in computer vision, CNN-based semantic segmentation methods have been extensively adopted in industrial inspection and object recognition, substantially outperforming traditional image processing approaches [6,7]. U-Net, proposed by Ronneberger et al. in 2015 [8], is a widely used CNN-based region delineation framework that utilizes a symmetric encoder–decoder architecture and skip connections to fuse local details with global semantic information. However, its dependence on local convolution operations limits its ability to model long-range dependencies, reducing its effectiveness in scenarios involving blurred boundaries or complex structural features.
These limitations have motivated the introduction of attention mechanisms and Transformer-based architectures to improve the modeling of complex backgrounds and critical semantic regions. For example, Oktay et al. proposed Attention U-Net [9], which incorporates learnable attention gates into the U-Net framework. These gates dynamically suppress irrelevant features and highlight key regions via skip connections, thereby enhancing the model’s responsiveness to boundary-sensitive structures such as organs. However, its attention mechanism remains constrained to local contextual information and lacks the capacity for global semantic reasoning. To address this, Cao et al. developed Swin-UNet [10], which embeds Swin Transformers into the encoder–decoder structure. By employing a hierarchical architecture and shifted window attention, it effectively integrates local and global feature representations. Despite its strong performance across various medical image delineation benchmarks, Swin-UNet remains limited by its fixed local window design, which restricts its ability to capture long-range dependencies and maintain structural continuity at object boundaries.
More recently, TransUNet has emerged as a promising architecture that combines the global modeling capability of Transformers [11,12] with the local feature extraction strengths of CNNs, demonstrating robust performance in complex medical and industrial region delineation tasks [13,14,15,16]. Compared with U-Net, the architecture can effectively model long-range dependencies, thereby enhancing global semantic representation. Relative to Attention U-Net, it introduces explicit positional encoding and multi-head self-attention, which improve contextual understanding and semantic integration. Unlike Swin-UNet, the model adopts a global attention mechanism instead of fixed local windows, yielding more consistent and holistic structural representations. These advantages make it particularly suitable for extracting insulation and core regions from cable images with complex geometries and indistinct boundaries. However, its application to insulation layer extraction remains insufficiently explored and warrants further investigation.
Furthermore, due to the complex and variable conditions of construction site environments, acquiring high-quality cable images is both challenging and expensive. As a result, the available datasets are typically small in scale and lack diversity, which severely limits the performance and generalization capability of delineation models. To address this issue, generative adversarial networks (GANs) have been widely adopted as an effective data augmentation technique [17,18,19]. However, traditional GANs often suffer from unstable training and low-quality sample generation. The WGAN-GP addresses these problems by introducing a gradient penalty term and optimizing the Wasserstein distance, thereby significantly improving training stability and the quality of generated images [20]. In this study, WGAN-GP is employed to enrich the limited dataset and enhance the generalization performance of the region extraction model.
Accordingly, this study presents a region delineation method that integrates WGAN-GP-based data augmentation with the TransUNet model. The approach targets accurate extraction of 220 kV cable insulation layers while enhancing the model’s adaptability and performance under real-world engineering conditions.

2. Related Work and Methodology

2.1. Dataset Construction

Segmentation of the insulation layer in 220 kV 1 × 2500 single-core high-voltage cables was supported by constructing a representative dataset. Cross-sectional images were captured using an industrial-grade camera (MER2-2000-19U3C, Beijing Daheng Image Vision Co., Ltd., Beijing, China). Following the guidelines of IEEE Std 525–2025 [21] and the surface characteristics of actual cables, the images were classified into four categories, i.e., normal, worn, scratched, and damaged cables, as shown in Figure 1.
Image analysis reveals the following features: Figure 1a shows a normal cable, with an intact insulation structure, well-defined boundaries, and distinct grayscale contrast between the conductor and insulation. Figure 1b displays a worn cable, where fine cracks or rough textures appear on the insulation due to cutting or prolonged use. Figure 1c presents a scratched cable, characterized by deep linear marks and locally uneven grayscale distribution. Figure 1d depicts a damaged cable with irregular morphology and significant insulation loss, particularly in the red-elliptical-highlighted region, indicating severe structural degradation.
These variations in cable condition, often caused by human operation or environmental interference, lead to structural degradation, reduced grayscale contrast, and blurred boundaries between the insulation and inner conductor. Such factors pose significant challenges to the accuracy of region delineation. Traditional image-based partitioning methods are generally less robust. Such methods are prone to over- or under-segmentation between the conductor and the shielding layer and often fail to extract the insulation layer accurately, which affects subsequent parameter analysis.
Given the structural complexity of high-voltage cables, the variability in image quality, and the demand for both high region delineation accuracy and strong model generalization, a deep learning approach is employed. The method aims to provide accurate and robust extraction of the insulation layer and cable core regions.

2.2. WGAN-GP Data Augmentation Model

2.2.1. Principle of WGAN-GP-Based Data Augmentation

In deep learning, deep neural networks have demonstrated strong capabilities in feature extraction and complex pattern recognition and are widely applied across image recognition tasks. However, their performance often depends on access to large-scale, high-quality datasets, which are essential for capturing the underlying data distribution and mitigating overfitting. For 220 kV cable cross-sectional images, data collection is limited by high acquisition costs and complex on-site environments, making it challenging to build large datasets. This constraint hinders model training and reduces generalization performance.
This study employs WGAN-GP to address data scarcity through the generation of realistic and structurally consistent synthetic samples. The augmented dataset enhances the delineation model’s robustness and improves its generalization performance.
WGAN-GP is an adversarial generative framework optimized with the Wasserstein distance. Its key innovation lies in enforcing a Lipschitz continuity constraint on the discriminator, which improves training stability and mitigates vanishing gradients. The Wasserstein distance is defined as
W P r , P g = inf γ Π P r , P g Ε x , y γ x y
where P r and P g denote the real and generated data distributions, respectively, and Π P r , P g represents the set of all joint distributions with marginals P r and P g .
WGAN-GP satisfies the Lipschitz constraint required for Wasserstein distance optimization by introducing a gradient penalty (GP) term into the discriminator loss. The penalty term is defined as
L G P = λ Ε x ^ P x ^ x ^ D x ^ 2 1 2
where x ^ denotes interpolated samples between real and generated data, and λ is the penalty coefficient.

2.2.2. Network Architecture of WGAN-GP

The generator G receives a 256-dimensional latent random vector as input and initially projects it into a 4 × 4 × 1024 feature map through a transposed convolutional layer. This is followed by seven residual upsampling blocks, which progressively increase the spatial resolution to 512 × 512 while simultaneously reducing the number of channels to 3. A final convolutional layer outputs an RGB image with a resolution of 512 × 512 × 3.
The discriminator D then receives the generated images together with real samples for adversarial learning. The discriminator consists of six downsampling convolutional blocks and a MiniBatch standard deviation module. It terminates with a convolutional layer with an 8 × 8 kernel to produce a scalar output representing the global authenticity score. During training, spectral normalization and the GP mechanism are applied to enforce Lipschitz continuity, thereby stabilizing the adversarial learning process. The architecture of the WGAN-GP model is illustrated in Figure 2.

2.3. TransUNet Segmentation Model

In recent years, the Vision Transformer (ViT) has demonstrated outstanding performance in image segmentation tasks. The TransUNet model adopted in this study combines the complementary strengths of CNNs and Transformers. Its encoder features a hybrid architecture that integrates CNN layers with ViT modules, enabling effective contextual modeling of both local and global features. The decoder is fully convolutional and employs skip connections along with cascaded upsampling to progressively restore spatial resolution, thereby achieving precise accurate delineation of target regions. The architecture of the TransUNet model is illustrated in Figure 3.
In this network, the encoder first applies a CNN-based downsampling operation to extract high-level semantic feature maps from the input image. The resulting feature map is then divided into a series of non-overlapping patches of size P × P, and each patch undergoes a patch embedding process. The vectorized representation x p is projected into a D-dimensional latent space via a learnable linear projection. To retain positional information, positional encodings are added during the embedding process, as defined in Equation (3):
z 0 = x p 1 E ; x p 2 E ; ; x p N E + E p o s
where x p i denotes the vector representation of the i-th patch, E R P 2 C × D is the learnable projection matrix, and E p o s R N × D is the positional encoding corresponding to the i-th patch. During tokenization, the input image of size H × W × C, where H, W and C denote the height, width, and number of channels, respectively, is divided into a total of N = H W P 2 .
The Transformer encoder comprises multiple layers of Multi-Head Self-Attention (MSA) mechanisms and Multi-Layer Perceptron (MLP) modules. The output of the l-th Transformer layer is computed as follows:
z l = M S A L N z l 1 + z l 1
z l = M L P L N z l + z l
where LN denotes Layer Normalization, and z l represents the encoded image features at the l-th layer. The structure of a Transformer layer is illustrated in Figure 4.
In the Transformer encoder, the MSA module learns to propagate and route important features among the image patches generated during tokenization [22]. The inputs to the self-attention mechanism include the query (Q), key (K), and value (V) matrices. The attention computation is defined as
A t t e n t i o n Q , K , V = s o f t max Q K T d k V
where 1 d k is a scaling factor used for normalization, and d k is the dimensionality of the key vectors.
In the TransUNet architecture, the decoder employs a Cascaded Upsampler (CUP) to convert the model’s latent features into the final delineation output while preserving the original image resolution. After reconstructing and rearranging the hidden features z l R H W P 2 × D , their shape is transformed to H P × W P × D . These features are then progressively restored to the spatial dimensions of H × W × D through a sequence of upsampling modules, each consisting of a 2× upsampling operation, a 3 × 3 convolutional layer, and a ReLU activation function.
The encoder of TransUNet comprises four convolutional blocks designed to extract hierarchical feature representations within the U-shaped architecture. Each block includes two successive convolutional layers, each followed by batch normalization and a ReLU activation, and concludes with a max pooling operation. The numbers of convolutional filters in the four downsampling stages are 64, 128, 256, and 512, respectively. The CNN-extracted feature maps are subsequently partitioned into patches and embedded before being forwarded to a 12-layer Transformer module, which captures long-range dependencies among the embedded tokens.
The decoder consists of four convolutional blocks and a final output convolutional layer. Each decoding block includes a 2× upsampling operation followed by two 3 × 3 convolutional layers. To preserve fine-grained spatial information, skip connections are adopted to fuse encoder and decoder feature maps by concatenating them along the channel dimension. This design significantly enhances the decoder’s ability to recover structural details during upsampling.

3. Experiments and Results

3.1. Experimental Platform and Evaluation Methods

3.1.1. Hardware and Software Configuration

The experiments used the PyTorch (version 2.4) deep learning framework in a GPU-accelerated environment. The hardware setup included an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM) (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Xeon E5-2680 V4 processor (Intel Corporation, Santa Clara, CA, USA) running at 2.40 GHz (14 cores, 12 threads, 32 GB RAM). The software environment comprised Ubuntu 22.04, CUDA 11.8, and Python 3.10. This configuration ensured efficient training and stable inference of deep learning models.

3.1.2. Data Acquisition and Annotation

In this study, a total of 200 cross-sectional images of 220 kV single-core high-voltage cables were collected, covering four categories: intact, worn, scratched, and damaged cables. The images were acquired under varying illumination intensities and viewing angles with a resolution of 512 × 512 pixels, aiming to enhance the adaptability of the model to real-world application scenarios. Due to the difficulty and high cost of data acquisition, only 200 real images were available for training, which limits the model’s generalization. To mitigate the limitations associated with the small dataset, the WGAN-GP method was employed for data augmentation, which effectively expanded the training set and improved the model’s robustness.
The annotation process was carried out using the Labelme tool (Kentaro Wada, Tokyo, Japan.). The target regions in each image were manually labeled, generating annotation files in JSON format. These files were then batch-converted into PNG-format segmentation masks using custom scripts, which served as the ground truth for supervised training, as illustrated in Figure 5.

3.1.3. Evaluation Metrics for Insulation Layer Segmentation

This study formulates the segmentation task as a binary pixel-wise classification problem, where pixels belonging to the cable core are labeled as positive samples, and all other pixels are treated as negative. The predicted delineation mask is first normalized with a Sigmoid activation function and then binarized with a threshold of 0.5 to generate the final prediction. Let Y denote the ground truth mask and N the total number of pixels in the image. A binary confusion matrix is constructed at the pixel level with the following definitions:
  • True Positive (TP): Both the predicted and ground truth labels are positive, indicating correct identification of cable core pixels;
  • True Negative (TN): Both the predicted and ground truth labels are negative, indicating correct identification of background pixels;
  • False Positive (FP): A background pixel is incorrectly predicted as cable core;
  • False Negative (FN): A cable core pixel is incorrectly predicted as background.
Delineation performance was comprehensively evaluated using four pixel-level metrics. Final results are reported as the mean values across all test images:
  • Mean Dice Coefficient (mDice): This measures the overlap between the predicted and ground truth cable core regions. It is defined as
m D i c e = 1 k i = 1 k 2 T P 2 T P + F P + F N
2.
Mean Intersection over Union (mIoU): This represents the ratio of the intersection to the union between predicted and ground truth regions:
m I o U = 1 k i = 0 k T P T P + F P + F N
3.
Mean Precision (MP): This indicates the proportion of correctly predicted core pixels among all predicted core pixels:
M P = 1 k i = 0 k T P T P + F P
4.
Mean Recall (mRecall): This represents the proportion of correctly predicted core pixels among all actual core pixels:
m R e c a l l = 1 k i = 0 k T P T P + F N

3.1.4. Image Generation Quality Metrics

The consistency between generated and real images was evaluated based on visual quality and structural fidelity using four objective metrics commonly applied in image generation tasks: Frechet Inception Distance (FID), Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity Index Measure (SSIM), and Peak Signal-to-Noise Ratio (PSNR). These metrics quantify image realism and perceptual quality from multiple perspectives, including perceptual similarity, structural consistency, and pixel-level accuracy. Definitions of these metrics are provided below.
  • Fréchet Inception Distance (FID): FID measures the difference in distribution between generated and real images in a high-dimensional feature space. It is computed based on features extracted by the Inception network, which are assumed to follow multivariate Gaussian distributions. The Frechet distance between these two distributions is given by
F I D = μ r μ g 2 + T r Σ r + Σ g 2 Σ r Σ g 1 2
where μ r , Σ r and μ g , Σ g denote the mean and covariance of real and generated image features, respectively.
  • Learned Perceptual Image Patch Similarity (LPIPS): LPIPS is a perceptual similarity metric that leverages deep features to mimic human sensitivity to structural and textural differences. It is defined as
L P I P S x , y = l 1 H l W l h , w w l f ^ l x h , w f ^ l y h , w 2 2
where f ^ l x and f ^ l y are the normalized features of images x and y at the l-th layer, w l denotes the learnable weights for each channel, and ⊙ represents element-wise multiplication.
  • Structural Similarity Index Measure (SSIM): SSIM evaluates the structural similarity between two images by comparing luminance, contrast, and structure. It is computed as
S S I M x , y = 2 μ x μ y + C 1 2 σ x y + C 2 μ x 2 + μ y 2 + C 1 σ x 2 + σ y 2 + C 2
where μ x and μ y represent the local means of the two images, σ x 2 and σ y 2 denote their local variances, and σ x y is the local covariance. C 1 and C 2 are constants used for numerical stability.
  • Peak Signal-to-Noise Ratio (PSNR): PSNR quantifies the reconstruction accuracy of generated images by measuring the pixel-level error with respect to the reference image. It is defined as
P S N R x , y = 10 log 10 M A X 2 M S E
where MAX is the maximum possible pixel value, and MSE is the mean squared error between the generated and real images.

3.2. Data Augmentation Strategies

This study applies two data augmentation strategies to enhance the generalization and robustness of the segmentation model: (1) conventional techniques such as horizontal and vertical flipping, brightness and contrast adjustment, and Gaussian blurring, and (2) WGAN-GP-based augmentation, which generates structurally and semantically consistent synthetic samples using a generative adversarial network to increase data diversity and coverage. The original dataset contains 200 images, and both methods expand it to a total of 800 images.
In this study, the generative framework augments four categories of cable cross-sectional images, each initially containing 50 samples with a resolution of 512 × 512. For each category, it generates 150 additional images, expanding each to 200 and producing a total of 800 synthetic samples. This augmentation strategy significantly improves data quantity and diversity, thereby enhancing the robustness and generalization of the delineation model.
Figure 6 illustrates the progressive refinement of normal cable images synthesized by the proposed augmentation across different training epochs. As training advances, the images evolve from blurry and unstructured forms into clear cable cross-sections with well-defined textures and boundaries. This progression demonstrates the model’s improved capacity to capture structural features and recover fine details, indicating both stable training and strong generative ability. The resulting high-quality synthetic images provide valuable data for downstream delineation tasks.
Preliminary experiments compared three commonly used optimizers in WGAN-GP (Adam, RMSprop, and SGD) to ensure training stability and image generation consistency. A range of learning rates was explored separately for the generator (G) and discriminator (D). Adam demonstrated superior performance in both training stability and image quality. When the β parameters were set to (0.5, 0.9), it effectively reduced gradient oscillations and improved boundary detail modeling. In addition, other hyperparameters—including the number of training epochs, n-critic, batch size, latent vector dimension, random seed, input resolution, and pixel normalization range—were determined through ablation studies and comparative analysis. These settings were selected by jointly considering adversarial training stability and downstream segmentation performance. Based on these findings, Adam was selected as the optimizer, and asymmetric learning rates were applied to G and D to enhance training stability.
The final configuration of the generative model is as follows: the learning rates of G and D were set to 0.0001 and 0.00005, respectively, with β parameters of (0.5, 0.9). The model was trained for 500 epochs using an n-critic setting of 3, meaning the discriminator was updated three times per generator update. The batch size was 8, the latent space dimension was 256, and the random seed was fixed at 42. Input and output images were resized to 512 × 512 and normalized to the range [−1, 1]. This configuration effectively mitigated discriminator overfitting and enhanced the generator’s capacity to capture structural and edge information. As a result, it improved the clarity of generated images and contributed to more stable and accurate training of the subsequent TransUNet delineation model.
During the experiments, some intermediate images generated by the framework exhibited artifacts, including texture blurring, edge discontinuities, local noise (e.g., black spots), and partial detail distortions. These artifacts commonly occurred in the early training iterations, mainly due to the generator’s insufficient learning of the real data distribution. As training progressed and both the generator and discriminator improved, such artifacts largely disappeared in the final synthetic images, whose overall quality became stable and increasingly consistent with real data. To ensure the validity and reliability of the augmented dataset, the generated images were subjected to manual inspection and subsequent experimental evaluation. The results confirmed that the augmented data did not introduce noticeable distributional bias or negatively affect delineation performance; on the contrary, it contributed to enhancing the robustness and generalization capability of the delineation model.

3.3. Segmentation Model Training and Loss Function

The evaluation of different data augmentation strategies was conducted by dividing the original, conventionally augmented, and WGAN-GP augmented datasets into training, validation, and test sets with an 8:1:1 ratio. This configuration provided sufficient data for learning and ensured reliable performance assessment.
The Dice loss function was used during TransUNet training to improve segmentation accuracy in foreground regions and mitigate class imbalance. It measures mask quality by evaluating the normalized overlap between the predicted mask and the ground truth label. The formulation is as follows:
L D i c e = 1 2 Σ i y ^ i y i + ε Σ i y ^ i + Σ i y i + ε
where y ^ i denotes the predicted probability for pixel i, y i is the corresponding ground truth label, and ε is a small constant to prevent division by zero. Compared with traditional binary cross-entropy loss, the Dice loss is more effective for imbalanced delineation tasks with small foreground regions, as it provides a more accurate measure of the overlap between the predicted and ground truth masks.
A systematic search identified the optimal training configuration for TransUNet. This process evaluated three commonly used optimizers in pixel-wise labeling tasks: Stochastic Gradient Descent (SGD), Adam, and AdamW. The search also examined a range of learning rates and weight decay values, along with adjustments to training epochs and batch size. Convergence speed, final performance, and overfitting risk served as the main evaluation criteria. Experimental results showed that SGD offered better robustness in terms of training stability and boundary detail modeling compared with first-moment-based optimizers such as Adam and AdamW. Based on these findings, SGD was selected as the final optimizer for TransUNet.
The finalized training configuration consisted of 500 epochs, a batch size of 4, an initial learning rate of 0.0001, and a weight decay coefficient of 0.0001. An early stopping strategy was applied to enhance training efficiency and reduce the risk of overfitting. Training was terminated automatically when the validation loss did not improve over 25 consecutive epochs. For comparative analysis of different data augmentation strategies, TransUNet was trained on the original dataset, a standard augmented dataset, and a dataset augmented using WGAN-GP. The corresponding training loss curves are illustrated in Figure 7.
The results demonstrate that the choice of data augmentation strategy significantly affects model performance in the cable image segmentation task. When trained on the original dataset, the minimum smoothed validation loss was 0.0152—considerably higher than that achieved with the augmented datasets. The loss curve showed a slower descent and noticeable fluctuations, indicating limited feature learning and generalization capability. With conventional augmentation, the validation loss decreased to 0.0064, representing a 57.9% reduction relative to the original dataset. This confirms the effectiveness of basic image transformations in enhancing model robustness and generalization. Building on this, the WGAN-GP augmentation strategy further reduced the validation loss to 0.0062—a 59.2% reduction compared to the original dataset. Although the final loss was slightly higher than that of the conventionally augmented dataset, this method achieved faster convergence, with an earlier and more stable decline in the loss curve. Particularly within the first 100 epochs, it significantly suppressed the validation loss, reflecting superior early-stage learning efficiency.
Additionally, images generated by the proposed augmentation exhibited greater structural diversity and finer edge details, effectively compensating for the original dataset’s limitations in complex structural regions. This contributed to improved delineation performance on heterogeneous image distributions. Considering convergence speed, generalization performance, and training stability, the method demonstrates broader applicability and greater practical value within the context of this study.
Assessment of the impact of different data augmentation strategies on cable image delineation performance involved training the TransUNet model separately on three types of datasets, all under a consistent architecture and training configuration. Model performance was evaluated on the test set, and the results are presented in Figure 8.
The experimental results indicate that the proposed augmentation outperformed all others across all evaluation metrics. Specifically, the mDice score reached 0.9835, representing a 1.32% improvement over the original dataset. The most significant improvement was observed in mIoU, which increased from 0.9516 to 0.9677—a relative gain of 1.61%. In addition, precision and recall improved to 0.9840 and 0.9831, respectively, demonstrating the model’s ability to accurately identify core pixels while minimizing false positives and false negatives.
By contrast, the conventional augmentation strategy led to moderate improvements over the original dataset but offered limited overall enhancement. Notably, its mRecall remained lower than that achieved with the proposed augmentation. These findings suggest that conventional augmentation techniques are less effective for complex structural segmentation tasks.
In comparison, the proposed strategy generated semantically consistent and structurally diverse training samples, effectively compensating for the original dataset’s limitations in complex regions. This led to substantial improvements in both delineation accuracy and model generalization.

3.4. Ablation Study

This study conducts two ablation experiments under identical training conditions to evaluate the contribution of key components in the WGAN-GP architecture to image generation quality and segmentation performance. In the first setting, both the Wasserstein distance and the gradient penalty (GP) were removed, yielding a standard GAN. In the second, the Wasserstein distance was retained while the GP was excluded, resulting in a WGAN. Comparing these variants clarifies the individual roles of each component. Table 1 summarizes the image quality metrics for the original framework and its ablated versions.
The comparison results, as shown in Table 1, indicate that the full model achieves the best performance across all metrics, with FID and LPIPS values of 33.5571 and 0.1796, respectively—significantly outperforming both GAN and WGAN. In terms of SSIM and PSNR, this variant also demonstrates superior fidelity and sharpness, highlighting the critical role of the Wasserstein distance and gradient penalty in enhancing the quality of generated images.
The impact of generated data on downstream semantic segmentation was further examined by training a TransUNet-based model with augmented samples from each generative architecture. Delineation performance on the test set is reported in Table 2.
Table 2 summarizes the segmentation performance of different architectures. The model trained with GAN-generated samples performs the worst, with mDice and mIoU scores of 0.9632 and 0.9372, respectively. This result indicates that the standard GAN still struggles with structural inconsistencies in generated images. Adding the Wasserstein distance in WGAN improves delineation performance, increasing the mDice to 0.9764. With the gradient penalty further incorporated, the proposed model achieves the highest scores across all mask-based metrics. These results confirm the combined contribution of the Wasserstein distance and gradient penalty to both data augmentation quality and downstream delineation performance.
In summary, the Wasserstein distance and gradient penalty in WGAN-GP both play essential roles in enhancing the fidelity of generated images and improving downstream delineation performance. Their combined effect significantly promotes perceptual consistency and structural completeness, affirming the effectiveness and sound design of the architecture.

3.5. Comparative Experiments

The effectiveness of the proposed delineation approach was evaluated by comparing it with three representative semantic segmentation models: UNet, Swin-UNet, and Attention-UNet. All models were trained and tested on the WGAN-GP augmented dataset using identical training configurations. Their delineation performance on the test set is illustrated in Figure 9.
Figure 9 presents the performance metrics and inference speed of different models on the test set. The TransUNet model adopted in this study achieved the best performance across all evaluation metrics, demonstrating high segmentation accuracy and stability. Specifically, mDice and mIoU reached 0.9835 and 0.9677, representing improvements of approximately 2.03% and 3.05% over UNet. MP and mRecall reached 0.9840 and 0.9831, indicating higher delineation precision and consistency. Attention-UNet, which incorporates attention mechanisms, and Swin-UNet, which employs a Transformer encoder, both achieved significant gains over UNet but still fell short of TransUNet in global context modeling and boundary detail preservation.
This study further assessed deployment suitability by comparing inference efficiency, including average inference time (ms/image), frames per second (FPS), and parameter count. TransUNet delivered the highest delineation accuracy, with 105.5 M parameters, an inference time of 38.6 ms per image, and 25.9 FPS. Although slower than the lightweight UNet, its superior accuracy and robustness highlight its value in high-precision applications. Swin-UNet and Attention-UNet achieved a more balanced trade-off between performance and complexity. Considering both accuracy and efficiency, this study adopts TransUNet as the primary model, making it more suitable for applications requiring high delineation quality.
All three improved models outperformed the baseline UNet, further confirming the effectiveness of deep feature extraction structures and attention mechanisms in enhancing delineation performance. Representative delineation results on cable cross-sectional images are shown in Figure 10.
Compared to UNet, Swin-UNet, and Attention-UNet, the proposed TransUNet integrates a Transformer encoder into a conventional encoder–decoder framework, enabling efficient fusion of local convolutional features and global contextual information. By introducing a global self-attention mechanism, the model’s ability to capture long-range dependencies and preserve boundary continuity is significantly enhanced, thereby improving segmentation accuracy in regions with complex morphology or blurred edges. In addition, incorporating WGAN-GP–generated augmented samples effectively expanded the training dataset, further enhancing the model’s robustness and generalization capability. Representative delineation results of the proposed method on 220 kV cable insulation layers are presented in Figure 11.
Although the proposed method demonstrates strong segmentation performance on typical 220 kV cable insulation defect images, challenges remain under extreme conditions. Severe insulation damage, distorted textures, blurred edges, or strong background interference can impair the model’s structural perception, reducing mask accuracy. Prior studies have shown that deep learning models heavily rely on image structural integrity for cable fault detection. When image quality degrades or feature expression weakens, model performance tends to decline. Reference [23] reports that model generalization and stability are often limited under sample distribution shifts or complex defect patterns. Reference [24] further indicates that convolutional neural networks exhibit high sensitivity to image structures, with decreased accuracy in detecting blurry or aliased boundaries in medium-voltage cable images. To enhance robustness and applicability under abnormal conditions, future work may explore the integration of physical priors, incorporation of multimodal data (e.g., infrared thermography and electrical signal features), or the development of task-adaptive mechanisms tailored for complex defects. These strategies aim to improve the model’s adaptability and generalization under extreme scenarios.
During long-term operation of power systems, 220 kV cable insulation layers are frequently subjected to thermal aging, electrical stress concentration, and partial discharge. These factors often lead to nonlinear structural degradation and complex texture variations. In imaging, such deterioration manifests as boundary fractures, blurred textures, and increased background noise, which significantly complicate pixel-wise labeling tasks and demand higher structural perception and discrimination capabilities from deep learning models. Although the proposed method was validated primarily on typical defect samples, theoretical analysis suggests that severe structural degradation may limit the model’s ability to extract fine details and model global semantics, thereby affecting pixel-wise accuracy. Future research may consider incorporating simulated or measured partial discharge images, alongside multimodal data such as infrared thermograms and electrical signal features, to expand the training data coverage and enhance the model’s capacity to recognize complex fault patterns [25].

4. Conclusions

This study proposes an automated segmentation method for 220 kV cable insulation images by integrating WGAN-GP-based data augmentation with the TransUNet delineation model. The approach effectively addresses practical challenges such as limited sample availability, high data acquisition costs, and blurred boundaries. To mitigate the impact of insufficient training samples, three datasets—original, conventional augmented, and the proposed augmentation—were constructed and evaluated under identical training configurations. The results demonstrate that the generated images exhibit high structural fidelity and semantic consistency, significantly enhancing delineation performance. Ablation studies further confirm the critical role of the Wasserstein distance and gradient penalty in improving image quality and training stability.
The delineation model combines the local detail extraction capability of CNNs with the long-range contextual modeling strength of Transformers, offering improved adaptability in complex structures and ambiguous boundary regions. Supported by WGAN-GP-augmented data, TransUNet achieves mDice, mIoU, MP, and mRecall scores of 0.9835, 0.9677, 0.9840, and 0.9831, respectively, on the test set—outperforming baseline models including UNet, Swin-UNet, and Attention-UNet. These results validate the effectiveness and advancement of the proposed method in cable image partitioning tasks.
Beyond accuracy improvement, the method provides a quantifiable and deployable solution for insulation quality assessment during intermediate joint construction. Based on high-quality mask outcomes, critical diagnostic metrics were developed, including insulation thickness distribution and deviation, edge damage, and burr detection. These metrics support evaluation of processing quality, structural integrity, and potential breakdown risks, enabling quality control and early defect identification in field applications.
From an engineering perspective, the method demonstrates strong diagnostic sensitivity and deployment feasibility. The augmentation strategy effectively addresses sample scarcity and enhances model robustness in complex environments. By leveraging both CNN and Transformer architectures, the method shows superior performance in capturing fuzzy boundaries and multi-scale texture features. Experimental results indicate competitive image processing speed, frame rate, and parameter efficiency, highlighting the potential for real-time deployment on embedded edge-computing platforms. The approach can be integrated into intelligent inspection systems at construction sites to enable closed-loop fault identification and diagnosis.
Future work will focus on improving the method’s adaptability across various insulation structures and defect types. Efforts will also target lightweight model design and integration with edge computing platforms to promote real-time deployment and large-scale application in embedded systems. Overall, the proposed method exhibits strong potential in diagnostic accuracy, practical deployment, and engineering impact, offering significant value for intelligent inspection and fault prediction in power systems.

Author Contributions

Conceptualization, L.L., S.Q. and G.L.; methodology, L.L., S.Q. and G.L.; software, L.L. and F.W.; validation, Y.L., Z.Z. and Y.A.; formal analysis, S.Q., Y.L., Z.Z. and Y.X.; investigation, S.Q., Z.Z., F.W. and Y.A.; resources, L.L. and Y.X.; data curation, Y.L., S.Q., Z.Z. and Y.A.; writing—original draft preparation, L.L., S.Q. and G.L.; writing—review and editing, G.L., Y.X., S.Q. and X.C.; visualization, L.L. and F.W.; supervision, L.L., Y.X. and X.C.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State Grid Sichuan Electric Power Company of State Grid Corporation of China, project number 521904250004.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors gratefully acknowledge the support provided by Chongqing University of Technology during the course of this research.

Conflicts of Interest

Authors Liang Luo, Song Qing, Yingjie Liu, Ziying Zhang, and Yuhang Xia were employed by Chengdu Power Supply Company of State Grid Sichuan Electric Power Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WGAN-GPWasserstein Generative Adversarial Network with Gradient Penalty
CNNsConvolutional Neural Networks
GANsGenerative Adversarial Networks
ViTVision Transformer
MSAMulti-Head Self-Attention
MLPMulti-Layer Perceptron
TPTrue Positive
TNTrue Negative
FPFalse Positive
FNFalse Negative
mDiceMean Dice Coefficient
mIoUMean Intersection over Union
MPMean Precision
mRecallMean Recall
FIDFréchet Inception Distance
LPIPSLearned Perceptual Image Patch Similarity
SSIMStructural Similarity Index Measure
PSNRPeak Signal-to-Noise Ratio
SGDStochastic Gradient Descent
GPGradient Penalty

References

  1. Qiu, W.; Li, C.; Chen, N.; Huang, Y.; Jiang, Z.; Cui, J.; Wang, P.; Liu, G. Review of Explosion Mechanism and Explosion-Proof Measures for High-Voltage Cable Intermediate Joints. Energies 2025, 18, 1552. [Google Scholar] [CrossRef]
  2. Xu, Z.L.; Feng, Y.; Yang, Y.P.; Gou, Y.; Zhao, Q.; Zhou, K. Electrochemical Corrosion and Water Ingress Defect Diagnosis of Lead-Sealed Section in High-Voltage Cable Joints. Insul. Mater. 2022, 55, 118–126. [Google Scholar]
  3. Deng, L.; Liu, G.H.; Deng, H.; Huang, J.J.; Zhou, B.H. Parameter Measurement Algorithm of Reaction Force Cone in XLPE Cable Joint Based on 3D Point Cloud Processing. Chin. J. Lasers 2023, 50, 66–80. [Google Scholar]
  4. Hütten, N.; Alves Gomes, M.; Hölken, F.; Andricevic, K.; Meyes, R.; Meisen, T. Deep Learning for Automated Visual Inspection in Manufacturing and Maintenance: A Survey of Open-Access Papers. Appl. Syst. Innov. 2024, 7, 11. [Google Scholar] [CrossRef]
  5. Zhu, W.; Dong, F.; Hou, B.P.; Gwatidzo, W.K.T.; Zhou, L.; Li, G. Segmenting the Semi-Conductive Shielding Layer of Cable Slice Images Using the Convolutional Neural Network. Polymers 2020, 12, 2085. [Google Scholar] [CrossRef]
  6. Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef]
  7. Zhang, R.; Li, J.T. A Survey of Scene Segmentation Algorithms Based on Deep Learning. Comput. Res. Dev. 2020, 57, 859–875. [Google Scholar]
  8. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015. [Google Scholar]
  9. Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
  10. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
  11. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
  12. Shu, H.; Liu, H.; Tang, Y.; Su, X.; Han, Y.; Dai, Y. Fault Identification Method for Measured Travelling Wave of Transmission Line Based on CSCRFAM-Transformer. Prot. Control Mod. Power Syst. 2025, 10, 69–82. [Google Scholar] [CrossRef]
  13. Fang, J.; Yang, C.; Shi, Y.; Wang, N.; Zhao, Y. External Attention Based TransUNet and Label Expansion Strategy for Crack Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19054–19063. [Google Scholar] [CrossRef]
  14. Gao, Z.J.; Wang, Z.M.; Li, Y. A Novel Intraretinal Layer Semantic Segmentation Method of Fundus OCT Images Based on the TransUNet Network Model. Photonics 2023, 10, 438. [Google Scholar] [CrossRef]
  15. Ji, Z.L.; Sun, H.R.; Yuan, N.; Zhang, H.Y.; Sheng, J.X.; Zhang, X.J.; Ganchev, I. BGRD-TransUNet: A Novel TransUNet-Based Model for Ultrasound Breast Lesion Segmentation. IEEE Access 2024, 12, 31182–31196. [Google Scholar] [CrossRef]
  16. Wu, J.Z.; Li, Z.J.; Cai, Y.H.; Liang, H.; Zhou, L.; Chen, M.; Guan, J. A Novel Tongue Coating Segmentation Method Based on Improved TransUNet. Sensors 2024, 24, 4455. [Google Scholar] [CrossRef]
  17. Du, J.Z.; Li, S.Q. Low-Light Flower Image Enhancement Model for Kiwi Based on Improved GAN. Trans. Chin. Soc. Agric. Eng. 2024, 40, 165–171. [Google Scholar]
  18. Wen, P.Z.; Chen, J.M.; Xiao, Y.N.; Wen, Y.Y.; Huang, W.M. Underwater Image Enhancement Algorithm Based on Generative Adversarial Network and Multilevel Wavelet Packet Convolutional Network. J. Zhejiang Univ. Eng. Ed. 2022, 56, 213–224. [Google Scholar]
  19. Que, Y.; Ji, X.; Jiang, Z.P.; Dai, Y.; Wang, Y.F.; Chen, J. Semantic Segmentation Algorithm for Pavement Cracks Based on GAN Data Augmentation. J. Jilin Univ. Eng. Technol. Ed. 2023, 53, 3166–3175. [Google Scholar]
  20. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved Training of Wasserstein GANs. Adv. Neural Inf. Process. Syst. 2017, 30, 5767–5777. [Google Scholar]
  21. Std 525-2025; IEEE Guide for the Design and Installation of Cable Systems in Substations. Institute of Electrical and Electronics Engineers, IEEE: New York, NY, USA, 2025.
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
  23. Shao, Q.; Fan, S.; Zhang, Z.; Liu, F.; Fu, Z.; Lv, P.; Mu, Z. Artificial Intelligence in Cable Fault Detection and Localization: Recent Advances and Research Challenges. Energies 2025, 18, 3662. [Google Scholar] [CrossRef]
  24. Uckol, H.I.; Ilhan, S.; Ozdemir, A. Workmanship defect classification in medium voltage cable terminations with convolutional neural network. Electr. Power Syst. Res. 2021, 194, 107105. [Google Scholar] [CrossRef]
  25. Mantach, S.; Lutfi, A.; Tavasani, H.M.; Ashraf, A.; El-Hag, A.; Kordi, B. Deep Learning in High Voltage Engineering: A Literature Review. Energies 2022, 15, 5005. [Google Scholar] [CrossRef]
Figure 1. Representative cross-sectional images of four cable types.
Figure 1. Representative cross-sectional images of four cable types.
Energies 18 04667 g001
Figure 2. Schematic diagram of the WGAN-GP architecture.
Figure 2. Schematic diagram of the WGAN-GP architecture.
Energies 18 04667 g002
Figure 3. Structural diagram of the TransUNet architecture.
Figure 3. Structural diagram of the TransUNet architecture.
Energies 18 04667 g003
Figure 4. Schematic diagram of a Transformer layer.
Figure 4. Schematic diagram of a Transformer layer.
Energies 18 04667 g004
Figure 5. Sample images from the insulation layer segmentation dataset: (a) original image; (b) corresponding annotated mask.
Figure 5. Sample images from the insulation layer segmentation dataset: (a) original image; (b) corresponding annotated mask.
Energies 18 04667 g005
Figure 6. Evolution of WGAN-GP-generated normal cable images across different training epochs.
Figure 6. Evolution of WGAN-GP-generated normal cable images across different training epochs.
Energies 18 04667 g006
Figure 7. Training loss curves for different datasets: (a) original dataset; (b) conventionally augmented dataset; (c) WGAN-GP augmented dataset.
Figure 7. Training loss curves for different datasets: (a) original dataset; (b) conventionally augmented dataset; (c) WGAN-GP augmented dataset.
Energies 18 04667 g007
Figure 8. Segmentation performance under different data augmentation strategies.
Figure 8. Segmentation performance under different data augmentation strategies.
Energies 18 04667 g008
Figure 9. Quantitative comparison of segmentation performance among different models.
Figure 9. Quantitative comparison of segmentation performance among different models.
Energies 18 04667 g009
Figure 10. Segmentation results of various models on cable cross-sectional images.
Figure 10. Segmentation results of various models on cable cross-sectional images.
Energies 18 04667 g010
Figure 11. Representative segmentation results of 220 kV cable insulation layers by the TransUNet model.
Figure 11. Representative segmentation results of 220 kV cable insulation layers by the TransUNet model.
Energies 18 04667 g011
Table 1. Comparison of different generative architectures based on image quality metrics.
Table 1. Comparison of different generative architectures based on image quality metrics.
StructureFIDLPIPSSSIMPSNR
GAN47.82340.23710.611518.4472
WGAN42.30960.22140.633719.2275
WGAN-GP33.55710.17960.682221.1608
Table 2. Impact of augmented data from different generative architectures on segmentation performance.
Table 2. Impact of augmented data from different generative architectures on segmentation performance.
StructuremDicemIoUMPmRecall
GAN0.96320.93720.96320.9633
WGAN0.97640.95740.97340.9763
WGAN-GP0.98350.96770.98400.9831
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, L.; Qing, S.; Liu, Y.; Lu, G.; Zhang, Z.; Xia, Y.; Ao, Y.; Wei, F.; Chen, X. Segmentation of 220 kV Cable Insulation Layers Using WGAN-GP-Based Data Augmentation and the TransUNet Model. Energies 2025, 18, 4667. https://doi.org/10.3390/en18174667

AMA Style

Luo L, Qing S, Liu Y, Lu G, Zhang Z, Xia Y, Ao Y, Wei F, Chen X. Segmentation of 220 kV Cable Insulation Layers Using WGAN-GP-Based Data Augmentation and the TransUNet Model. Energies. 2025; 18(17):4667. https://doi.org/10.3390/en18174667

Chicago/Turabian Style

Luo, Liang, Song Qing, Yingjie Liu, Guoyuan Lu, Ziying Zhang, Yuhang Xia, Yi Ao, Fanbo Wei, and Xingang Chen. 2025. "Segmentation of 220 kV Cable Insulation Layers Using WGAN-GP-Based Data Augmentation and the TransUNet Model" Energies 18, no. 17: 4667. https://doi.org/10.3390/en18174667

APA Style

Luo, L., Qing, S., Liu, Y., Lu, G., Zhang, Z., Xia, Y., Ao, Y., Wei, F., & Chen, X. (2025). Segmentation of 220 kV Cable Insulation Layers Using WGAN-GP-Based Data Augmentation and the TransUNet Model. Energies, 18(17), 4667. https://doi.org/10.3390/en18174667

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop