3.1. Experimental Platform and Evaluation Methods
3.1.1. Hardware and Software Configuration
The experiments used the PyTorch (version 2.4) deep learning framework in a GPU-accelerated environment. The hardware setup included an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM) (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Xeon E5-2680 V4 processor (Intel Corporation, Santa Clara, CA, USA) running at 2.40 GHz (14 cores, 12 threads, 32 GB RAM). The software environment comprised Ubuntu 22.04, CUDA 11.8, and Python 3.10. This configuration ensured efficient training and stable inference of deep learning models.
3.1.2. Data Acquisition and Annotation
In this study, a total of 200 cross-sectional images of 220 kV single-core high-voltage cables were collected, covering four categories: intact, worn, scratched, and damaged cables. The images were acquired under varying illumination intensities and viewing angles with a resolution of 512 × 512 pixels, aiming to enhance the adaptability of the model to real-world application scenarios. Due to the difficulty and high cost of data acquisition, only 200 real images were available for training, which limits the model’s generalization. To mitigate the limitations associated with the small dataset, the WGAN-GP method was employed for data augmentation, which effectively expanded the training set and improved the model’s robustness.
The annotation process was carried out using the Labelme tool (Kentaro Wada, Tokyo, Japan.). The target regions in each image were manually labeled, generating annotation files in JSON format. These files were then batch-converted into PNG-format segmentation masks using custom scripts, which served as the ground truth for supervised training, as illustrated in
Figure 5.
3.1.3. Evaluation Metrics for Insulation Layer Segmentation
This study formulates the segmentation task as a binary pixel-wise classification problem, where pixels belonging to the cable core are labeled as positive samples, and all other pixels are treated as negative. The predicted delineation mask is first normalized with a Sigmoid activation function and then binarized with a threshold of 0.5 to generate the final prediction. Let Y denote the ground truth mask and N the total number of pixels in the image. A binary confusion matrix is constructed at the pixel level with the following definitions:
True Positive (TP): Both the predicted and ground truth labels are positive, indicating correct identification of cable core pixels;
True Negative (TN): Both the predicted and ground truth labels are negative, indicating correct identification of background pixels;
False Positive (FP): A background pixel is incorrectly predicted as cable core;
False Negative (FN): A cable core pixel is incorrectly predicted as background.
Delineation performance was comprehensively evaluated using four pixel-level metrics. Final results are reported as the mean values across all test images:
- 2.
Mean Intersection over Union (mIoU): This represents the ratio of the intersection to the union between predicted and ground truth regions:
- 3.
Mean Precision (MP): This indicates the proportion of correctly predicted core pixels among all predicted core pixels:
- 4.
Mean Recall (mRecall): This represents the proportion of correctly predicted core pixels among all actual core pixels:
3.1.4. Image Generation Quality Metrics
The consistency between generated and real images was evaluated based on visual quality and structural fidelity using four objective metrics commonly applied in image generation tasks: Frechet Inception Distance (FID), Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity Index Measure (SSIM), and Peak Signal-to-Noise Ratio (PSNR). These metrics quantify image realism and perceptual quality from multiple perspectives, including perceptual similarity, structural consistency, and pixel-level accuracy. Definitions of these metrics are provided below.
Fréchet Inception Distance (FID): FID measures the difference in distribution between generated and real images in a high-dimensional feature space. It is computed based on features extracted by the Inception network, which are assumed to follow multivariate Gaussian distributions. The Frechet distance between these two distributions is given by
where
and
denote the mean and covariance of real and generated image features, respectively.
where
and
are the normalized features of images
x and
y at the
l-th layer,
denotes the learnable weights for each channel, and ⊙ represents element-wise multiplication.
where
and
represent the local means of the two images,
and
denote their local variances, and
is the local covariance.
and
are constants used for numerical stability.
where
MAX is the maximum possible pixel value, and
MSE is the mean squared error between the generated and real images.
3.2. Data Augmentation Strategies
This study applies two data augmentation strategies to enhance the generalization and robustness of the segmentation model: (1) conventional techniques such as horizontal and vertical flipping, brightness and contrast adjustment, and Gaussian blurring, and (2) WGAN-GP-based augmentation, which generates structurally and semantically consistent synthetic samples using a generative adversarial network to increase data diversity and coverage. The original dataset contains 200 images, and both methods expand it to a total of 800 images.
In this study, the generative framework augments four categories of cable cross-sectional images, each initially containing 50 samples with a resolution of 512 × 512. For each category, it generates 150 additional images, expanding each to 200 and producing a total of 800 synthetic samples. This augmentation strategy significantly improves data quantity and diversity, thereby enhancing the robustness and generalization of the delineation model.
Figure 6 illustrates the progressive refinement of normal cable images synthesized by the proposed augmentation across different training epochs. As training advances, the images evolve from blurry and unstructured forms into clear cable cross-sections with well-defined textures and boundaries. This progression demonstrates the model’s improved capacity to capture structural features and recover fine details, indicating both stable training and strong generative ability. The resulting high-quality synthetic images provide valuable data for downstream delineation tasks.
Preliminary experiments compared three commonly used optimizers in WGAN-GP (Adam, RMSprop, and SGD) to ensure training stability and image generation consistency. A range of learning rates was explored separately for the generator (G) and discriminator (D). Adam demonstrated superior performance in both training stability and image quality. When the β parameters were set to (0.5, 0.9), it effectively reduced gradient oscillations and improved boundary detail modeling. In addition, other hyperparameters—including the number of training epochs, n-critic, batch size, latent vector dimension, random seed, input resolution, and pixel normalization range—were determined through ablation studies and comparative analysis. These settings were selected by jointly considering adversarial training stability and downstream segmentation performance. Based on these findings, Adam was selected as the optimizer, and asymmetric learning rates were applied to G and D to enhance training stability.
The final configuration of the generative model is as follows: the learning rates of G and D were set to 0.0001 and 0.00005, respectively, with β parameters of (0.5, 0.9). The model was trained for 500 epochs using an n-critic setting of 3, meaning the discriminator was updated three times per generator update. The batch size was 8, the latent space dimension was 256, and the random seed was fixed at 42. Input and output images were resized to 512 × 512 and normalized to the range [−1, 1]. This configuration effectively mitigated discriminator overfitting and enhanced the generator’s capacity to capture structural and edge information. As a result, it improved the clarity of generated images and contributed to more stable and accurate training of the subsequent TransUNet delineation model.
During the experiments, some intermediate images generated by the framework exhibited artifacts, including texture blurring, edge discontinuities, local noise (e.g., black spots), and partial detail distortions. These artifacts commonly occurred in the early training iterations, mainly due to the generator’s insufficient learning of the real data distribution. As training progressed and both the generator and discriminator improved, such artifacts largely disappeared in the final synthetic images, whose overall quality became stable and increasingly consistent with real data. To ensure the validity and reliability of the augmented dataset, the generated images were subjected to manual inspection and subsequent experimental evaluation. The results confirmed that the augmented data did not introduce noticeable distributional bias or negatively affect delineation performance; on the contrary, it contributed to enhancing the robustness and generalization capability of the delineation model.
3.3. Segmentation Model Training and Loss Function
The evaluation of different data augmentation strategies was conducted by dividing the original, conventionally augmented, and WGAN-GP augmented datasets into training, validation, and test sets with an 8:1:1 ratio. This configuration provided sufficient data for learning and ensured reliable performance assessment.
The Dice loss function was used during TransUNet training to improve segmentation accuracy in foreground regions and mitigate class imbalance. It measures mask quality by evaluating the normalized overlap between the predicted mask and the ground truth label. The formulation is as follows:
where
denotes the predicted probability for pixel
i,
is the corresponding ground truth label, and
is a small constant to prevent division by zero. Compared with traditional binary cross-entropy loss, the Dice loss is more effective for imbalanced delineation tasks with small foreground regions, as it provides a more accurate measure of the overlap between the predicted and ground truth masks.
A systematic search identified the optimal training configuration for TransUNet. This process evaluated three commonly used optimizers in pixel-wise labeling tasks: Stochastic Gradient Descent (SGD), Adam, and AdamW. The search also examined a range of learning rates and weight decay values, along with adjustments to training epochs and batch size. Convergence speed, final performance, and overfitting risk served as the main evaluation criteria. Experimental results showed that SGD offered better robustness in terms of training stability and boundary detail modeling compared with first-moment-based optimizers such as Adam and AdamW. Based on these findings, SGD was selected as the final optimizer for TransUNet.
The finalized training configuration consisted of 500 epochs, a batch size of 4, an initial learning rate of 0.0001, and a weight decay coefficient of 0.0001. An early stopping strategy was applied to enhance training efficiency and reduce the risk of overfitting. Training was terminated automatically when the validation loss did not improve over 25 consecutive epochs. For comparative analysis of different data augmentation strategies, TransUNet was trained on the original dataset, a standard augmented dataset, and a dataset augmented using WGAN-GP. The corresponding training loss curves are illustrated in
Figure 7.
The results demonstrate that the choice of data augmentation strategy significantly affects model performance in the cable image segmentation task. When trained on the original dataset, the minimum smoothed validation loss was 0.0152—considerably higher than that achieved with the augmented datasets. The loss curve showed a slower descent and noticeable fluctuations, indicating limited feature learning and generalization capability. With conventional augmentation, the validation loss decreased to 0.0064, representing a 57.9% reduction relative to the original dataset. This confirms the effectiveness of basic image transformations in enhancing model robustness and generalization. Building on this, the WGAN-GP augmentation strategy further reduced the validation loss to 0.0062—a 59.2% reduction compared to the original dataset. Although the final loss was slightly higher than that of the conventionally augmented dataset, this method achieved faster convergence, with an earlier and more stable decline in the loss curve. Particularly within the first 100 epochs, it significantly suppressed the validation loss, reflecting superior early-stage learning efficiency.
Additionally, images generated by the proposed augmentation exhibited greater structural diversity and finer edge details, effectively compensating for the original dataset’s limitations in complex structural regions. This contributed to improved delineation performance on heterogeneous image distributions. Considering convergence speed, generalization performance, and training stability, the method demonstrates broader applicability and greater practical value within the context of this study.
Assessment of the impact of different data augmentation strategies on cable image delineation performance involved training the TransUNet model separately on three types of datasets, all under a consistent architecture and training configuration. Model performance was evaluated on the test set, and the results are presented in
Figure 8.
The experimental results indicate that the proposed augmentation outperformed all others across all evaluation metrics. Specifically, the mDice score reached 0.9835, representing a 1.32% improvement over the original dataset. The most significant improvement was observed in mIoU, which increased from 0.9516 to 0.9677—a relative gain of 1.61%. In addition, precision and recall improved to 0.9840 and 0.9831, respectively, demonstrating the model’s ability to accurately identify core pixels while minimizing false positives and false negatives.
By contrast, the conventional augmentation strategy led to moderate improvements over the original dataset but offered limited overall enhancement. Notably, its mRecall remained lower than that achieved with the proposed augmentation. These findings suggest that conventional augmentation techniques are less effective for complex structural segmentation tasks.
In comparison, the proposed strategy generated semantically consistent and structurally diverse training samples, effectively compensating for the original dataset’s limitations in complex regions. This led to substantial improvements in both delineation accuracy and model generalization.
3.4. Ablation Study
This study conducts two ablation experiments under identical training conditions to evaluate the contribution of key components in the WGAN-GP architecture to image generation quality and segmentation performance. In the first setting, both the Wasserstein distance and the gradient penalty (GP) were removed, yielding a standard GAN. In the second, the Wasserstein distance was retained while the GP was excluded, resulting in a WGAN. Comparing these variants clarifies the individual roles of each component.
Table 1 summarizes the image quality metrics for the original framework and its ablated versions.
The comparison results, as shown in
Table 1, indicate that the full model achieves the best performance across all metrics, with FID and LPIPS values of 33.5571 and 0.1796, respectively—significantly outperforming both GAN and WGAN. In terms of SSIM and PSNR, this variant also demonstrates superior fidelity and sharpness, highlighting the critical role of the Wasserstein distance and gradient penalty in enhancing the quality of generated images.
The impact of generated data on downstream semantic segmentation was further examined by training a TransUNet-based model with augmented samples from each generative architecture. Delineation performance on the test set is reported in
Table 2.
Table 2 summarizes the segmentation performance of different architectures. The model trained with GAN-generated samples performs the worst, with mDice and mIoU scores of 0.9632 and 0.9372, respectively. This result indicates that the standard GAN still struggles with structural inconsistencies in generated images. Adding the Wasserstein distance in WGAN improves delineation performance, increasing the mDice to 0.9764. With the gradient penalty further incorporated, the proposed model achieves the highest scores across all mask-based metrics. These results confirm the combined contribution of the Wasserstein distance and gradient penalty to both data augmentation quality and downstream delineation performance.
In summary, the Wasserstein distance and gradient penalty in WGAN-GP both play essential roles in enhancing the fidelity of generated images and improving downstream delineation performance. Their combined effect significantly promotes perceptual consistency and structural completeness, affirming the effectiveness and sound design of the architecture.
3.5. Comparative Experiments
The effectiveness of the proposed delineation approach was evaluated by comparing it with three representative semantic segmentation models: UNet, Swin-UNet, and Attention-UNet. All models were trained and tested on the WGAN-GP augmented dataset using identical training configurations. Their delineation performance on the test set is illustrated in
Figure 9.
Figure 9 presents the performance metrics and inference speed of different models on the test set. The TransUNet model adopted in this study achieved the best performance across all evaluation metrics, demonstrating high segmentation accuracy and stability. Specifically, mDice and mIoU reached 0.9835 and 0.9677, representing improvements of approximately 2.03% and 3.05% over UNet. MP and mRecall reached 0.9840 and 0.9831, indicating higher delineation precision and consistency. Attention-UNet, which incorporates attention mechanisms, and Swin-UNet, which employs a Transformer encoder, both achieved significant gains over UNet but still fell short of TransUNet in global context modeling and boundary detail preservation.
This study further assessed deployment suitability by comparing inference efficiency, including average inference time (ms/image), frames per second (FPS), and parameter count. TransUNet delivered the highest delineation accuracy, with 105.5 M parameters, an inference time of 38.6 ms per image, and 25.9 FPS. Although slower than the lightweight UNet, its superior accuracy and robustness highlight its value in high-precision applications. Swin-UNet and Attention-UNet achieved a more balanced trade-off between performance and complexity. Considering both accuracy and efficiency, this study adopts TransUNet as the primary model, making it more suitable for applications requiring high delineation quality.
All three improved models outperformed the baseline UNet, further confirming the effectiveness of deep feature extraction structures and attention mechanisms in enhancing delineation performance. Representative delineation results on cable cross-sectional images are shown in
Figure 10.
Compared to UNet, Swin-UNet, and Attention-UNet, the proposed TransUNet integrates a Transformer encoder into a conventional encoder–decoder framework, enabling efficient fusion of local convolutional features and global contextual information. By introducing a global self-attention mechanism, the model’s ability to capture long-range dependencies and preserve boundary continuity is significantly enhanced, thereby improving segmentation accuracy in regions with complex morphology or blurred edges. In addition, incorporating WGAN-GP–generated augmented samples effectively expanded the training dataset, further enhancing the model’s robustness and generalization capability. Representative delineation results of the proposed method on 220 kV cable insulation layers are presented in
Figure 11.
Although the proposed method demonstrates strong segmentation performance on typical 220 kV cable insulation defect images, challenges remain under extreme conditions. Severe insulation damage, distorted textures, blurred edges, or strong background interference can impair the model’s structural perception, reducing mask accuracy. Prior studies have shown that deep learning models heavily rely on image structural integrity for cable fault detection. When image quality degrades or feature expression weakens, model performance tends to decline. Reference [
23] reports that model generalization and stability are often limited under sample distribution shifts or complex defect patterns. Reference [
24] further indicates that convolutional neural networks exhibit high sensitivity to image structures, with decreased accuracy in detecting blurry or aliased boundaries in medium-voltage cable images. To enhance robustness and applicability under abnormal conditions, future work may explore the integration of physical priors, incorporation of multimodal data (e.g., infrared thermography and electrical signal features), or the development of task-adaptive mechanisms tailored for complex defects. These strategies aim to improve the model’s adaptability and generalization under extreme scenarios.
During long-term operation of power systems, 220 kV cable insulation layers are frequently subjected to thermal aging, electrical stress concentration, and partial discharge. These factors often lead to nonlinear structural degradation and complex texture variations. In imaging, such deterioration manifests as boundary fractures, blurred textures, and increased background noise, which significantly complicate pixel-wise labeling tasks and demand higher structural perception and discrimination capabilities from deep learning models. Although the proposed method was validated primarily on typical defect samples, theoretical analysis suggests that severe structural degradation may limit the model’s ability to extract fine details and model global semantics, thereby affecting pixel-wise accuracy. Future research may consider incorporating simulated or measured partial discharge images, alongside multimodal data such as infrared thermograms and electrical signal features, to expand the training data coverage and enhance the model’s capacity to recognize complex fault patterns [
25].