1. Introduction
The laser diode chip, a fundamental component of laser systems, is distinguished by its compact size and high power output, enabling its widespread application in the fields of optical communication and medical and industrial manufacturing [
1]. However, its small dimensions make the emitting facet highly prone to pronounced and subtle scratches during various manufacturing processes [
2]. Such scratches can compromise the integrity of the chip and, in severe cases, cause laser energy to concentrate at defect sites, resulting in chip burnout. Therefore, detecting scratches on the emitting facet is essential for ensuring laser quality and operational reliability.
The semiconductor industry predominantly employs image semantic segmentation to detect scratches on laser diode chips. As illustrated in
Figure 1, an image of a chip’s emitting facet under 20× magnification reveals a surface with approximate dimensions of 2 mm × 0.14 mm. The magnified region in
Figure 1 contains a deep scratch and a shallow scratch, each with a width of less than 10 pixels in the image, corresponding to sub-micrometer scales in reality. Deep scratches exhibit a high contrast against the background, facilitating their identification. In contrast, shallow scratches closely resemble the background, making them challenging to detect even with human vision. Such subtle scratches constitute approximately 20% of all scratch-related defects. In addition, the high noise levels introduced by high-magnification imaging further exacerbate the difficulty of identifying these shallow scratches.
Scratch segmentation mainly involves traditional methods and deep learning techniques [
3]. Traditional approaches often utilize topology and mathematical modeling. Peng et al. proposed an enhanced watershed algorithm with optimal labeling and edge constraints, using edge operators to refine boundaries for efficient segmentation of low-contrast, high-noise foam images [
4]. Kishorjit et al. combined simple linear iterative clustering superpixels with an adaptive Gaussian radial basis function kernel-based fuzzy C-means method to improve segmentation robustness [
5]. Cai et al. developed an adaptive variational level set model, integrating scale bias correction and denoising terms to improve noisy image segmentation [
6]. Although traditional methods are computationally efficient and fast, they struggle to detect scratches on the emitting facet due to high noise and indistinct features.
Deep learning approaches for scratch segmentation are data-driven, learning class features from datasets, and categorized into supervised, semi-supervised, and self-supervised methods. Supervised methods utilize labeled datasets to train models capable of segmenting unseen data. Long et al. introduced fully convolutional networks (FCNs) for pixel-level segmentation, outperforming earlier region-classification approaches and enabling industrial inspection applications [
7]. Wang et al. proposed FCN-SFW, combining structured forests with wavelet transforms to detect minute cracks in steel beams [
8]. U-Net, which extends FCN with an encoder–decoder architecture and skip connections, captures both low- and high-level semantics for biomedical image segmentation [
9]. Li et al. improved U-Net with VGG16 and a hybrid attention mechanism for real-time PCB soldering defect detection [
10]. Wang et al. further improved U-Net with extended and offset convolutions for SAR image segmentation in aquaculture monitoring [
11]. Attention mechanisms have further improved segmentation performance [
12]. Yeung et al. proposed ABFormer, which employs a boundary-aware module and attention mechanisms to improve the accuracy of defect segmentation [
13]. Although supervised methods excel with sufficient labeled data, annotating shallow scratches on chips is challenging, and models trained on deep scratches often fail to generalize to shallow ones.
Semi-supervised methods utilize limited labeled data combined with techniques such as data augmentation and consistency regularization to learn semantic features from unlabeled data [
14]. Shi et al. enhanced pseudo-label reliability in semi-supervised learning through a dynamic threshold strategy, improving defect segmentation accuracy [
15]. Chen et al. proposed cross pseudo supervision, where two differently initialized networks mutually guide each other with pseudo-labels to improve consistency and segmentation performance [
16]. Zhang et al. developed a weakly supervised method that generates pixel-level annotations from image-level labels, integrating class activation maps with a dense energy loss function to optimize segmentation [
17]. Although semi-supervised methods effectively leverage unlabeled data through regularization, their performance may degrade when labeled and unlabeled data differ significantly.
Unsupervised segmentation often employs clustering techniques [
18] but struggles to segment scratches in chip images due to their subtle contrast with the background. Self-supervised learning through synthetic data generation provides an alternative. Li et al. used CutPaste data augmentation for self-supervised representation learning, creating a generative one-class classifier for annotation-free defect localization through expanded patch representations [
19]. Schlüter et al. utilized Poisson image editing to fuse multi-scale image patches, producing realistic synthetic anomalies to train models for robust defect segmentation [
20]. Advances in generative models have enhanced this approach. Zhang et al. introduced Strength-controllable Diffusion Anomaly Synthesis (SDAS), employing diffusion models to generate anomaly patterns superimposed on normal images, which enabled a residual detection model to achieve state-of-the-art defect segmentation [
21]. However, reliance on pattern splicing limits the generation of realistic data, impeding effective scratch segmentation in real-world environments.
In summary, the scratch detection task for laser diode chips encounters two main challenges: (1) Annotating shallow scratches is challenging, with current synthesis methods offering insufficient realism and control. (2) The subtle features of shallow scratches, combined with their low contrast against the background, place stringent requirements on the detection model. To address the above challenges, we propose the Segment Structure with Controllable Realistic Synthetic (SCRS). Drawing on extensive analysis of scratch images and their high similarity to normal patterns, we introduce Mask-Guided Local Mean-Shift Diffusion Data Synthesis (MSDS). This method achieves realistic and diverse scratch images through direct generation and mask-based depth control. To address the difficulty of distinguishing scratch patterns from normal ones due to their high similarity, we propose TransCNN, a model that employs ViT blocks for global feature encoding, enhances pattern differentiation through attention mechanisms, and extracts distinct scratch features. Skip connections and convolutional decoding further refine spatial features, improving scratch segmentation accuracy. Experimental results demonstrate that SCRS achieves mean Intersection over Union (mIoU) values of 74.4% for deep scratches and 75.8% for shallow scratches in production, highlighting its significant industrial application value.
3. Experiment
SCRS constructs a realistic scratch dataset using MSDS to train TransCNN for the chip scratch segmentation. To validate the feasibility of this framework, we conducted the experiments below.
3.1. Dataset
We propose the SCRS: The first part employs the MSDS to generate a training dataset for scratch segmentation, while the second part trains TransCNN for scratch segmentation. The diffusion model in MSDS is trained using normal data. The effectiveness of MSDS is assessed through segmentation performance. Similarly, TransCNN is trained on the MSDS-generated dataset, and its effectiveness is evaluated using segmentation results of real-world scratches. To validate the SCRS framework, we utilize chip images collected from a production line to construct the MSDS training set and the TransCNN test set.
Given that scratches occupy a small proportion of chip images, directly using full images increases computational costs and exacerbates data imbalance, impairing model training. To address this, cropping is adopted during dataset construction, generating images of size 256 × 256 to increase the proportion of scratch regions and mitigate class imbalance.
Training set. We cropped 3600 normal images from scratch-free chip images to train the diffusion model in the MSDS. Additionally, we sampled masks to generate corresponding scratch images using MSDS, creating a dataset of pairs of scratch images. As the distinction between deep and shallow scratches in masks provides limited benefit for detection, we unified all scratches into a single category, further mitigating class imbalance and enhancing training performance.
To determine an optimal dataset size for scratch training, we generated datasets containing 200, 400, 600, 800, 1000, and 1200 samples and assessed their respective training outcomes, as presented in
Figure 4. The segmentation accuracy for deep and shallow scratches enhances with larger datasets, but this effect plateaus beyond a size of 800. Consequently, we adopted a dataset size of
.
Test set. We selected 50 scratch-containing images from a production line for meticulous annotation and sampled 300 images with varying scratch depths through random cropping to form the test dataset. Scratches in the test set are qualitatively classified as deep or shallow to evaluate the effectiveness of different methods on varying scratch category.
3.2. Implementation Details
All experiments were conducted on a high-performance server, which was equipped with an Intel(R) Xeon(R) Platinum 8368Q CPU @ 2.60 GHz and an NVIDIA A100 GPU with 80 GB of VRAM.
The MSDS employs the U-Net model as the base for its diffusion model, with a diffusion timestep of 1000. The model has 49.8M trainable parameters and achieves a loss convergence to 0.015. TransCNN, with 60.8M trainable parameters, converges to a loss of 0.003. The training loss curves is presented in
Figure 5.
The hyperparameter controls the degree of mean shift in generating scratch images. Experiments indicate that a small results in scratches too similar to normal images, lacking distinct features, while a large causes severe distortion and unnatural transitions between scratches and background. was set to 0.05 following experimentation.
3.3. Msds Sample Results
The synthetic images should closely resemble the real scratch images and include scratches of varying depths to the greatest extent possible. To demonstrate the effectiveness of the MSDS, we compared it with the representative CutPaste method and the state-of-the-art (SOTA) industrial defect detection data synthesis method, SDAS. To visually illustrate the data synthesis performance of these methods, we created a set of masks from the labels in the test set and used them to generate images, with the results shown in
Figure 6.
The MSDS employs a scratch fine-tuning process during mask generation to produce scratches that are deeper in the center and shallower at the edges, aligning with real-world scratch patterns. While CutPaste and SDAS excel in generating deep scratches in
Figure 6a, their pattern-adding approach falters in
Figure 6b,c for the shallow- and mixed-depth scratches, respectively. In these cases, appearance of the scratch is largely dictated by pattern differences rather than mask control, resulting in the overly deep shallow scratches in
Figure 6b and indistinct depth variations in
Figure 6c. In addition, this approach produces unrealistic anomaly patterns, as shown in the boxes (1), (2), and (4) in
Figure 6. In contrast, MSDS leverages the mask-guided diffusion model to directly generate scratch images, controlling mean shift via mask depth values to achieve precise scratch depth variation. This direct generation improves image consistency and visual coherence. Consequently, MSDS enables the creation of a more balanced and comprehensive scratch dataset, which improves robustness and accuracy in the detection of scratches of varying depths.
3.4. Scratch Segment Results with TransCNN
To validate the authenticity and effectiveness of the MSDS-generated data, we followed the methodology in
Section 3.2 to construct a training dataset using 1000 masks and trained TransCNN for scratch segmentation, evaluating its performance. To further assess TransCNN’s efficacy in scratch segmentation, we trained SegNet, U-Net, and SegFormer on the MSDS-generated dataset and compared their test results. We adopted the mean Intersection over Union (mIoU) and Dice coefficient as evaluation metrics, with IoU emphasizing scratch segmentation accuracy unaffected by background pixels and the Dice coefficient balancing precision and recall.
The effectiveness of MSDS. All methods are capable of realistically synthesizing deep scratches due to their distinct features, allowing trained models to effectively detect deep scratches. As illustrated in
Figure 7a, all three methods successfully segment deep-scratch regions, achieving high mIoU and Dice coefficients, as detailed in
Table 1. For shallow scratches, the MSDS, by directly generating scratches, produces a dataset that closely mirrors real-world conditions, outperforming CutPaste and SDAS in segmentation performance. As shown in
Table 1, MSDS shows a greater advantage, surpassing SDAS by 19.8% in mIoU, while maintaining consistent performance relative to deep-scratch detection as illustrated in
Figure 8. This is mainly due to the mask-guided generation of images with various scratch depths, resulting in a more comprehensive dataset and a model with enhanced generalization.
The effectiveness of TransCNN. TransCNN exhibits robust scratch segmentation performance. Compared to mainstream segmentation networks, its advantage in deep-scratch detection is modest, but it significantly outperforms SegNet in shallow scratches. As illustrated in
Table 1, against the structurally similar U-Net, TransCNN achieves an 8.8% higher mIoU score, and it exceeds SegFormer, which shares a similar modular design, by 0.048 in mIoU. We attribute TransCNN’s superior performance over SegFormer and U-Net to its use of ViT blocks for feature encoding, which expands the receptive field and enhances global feature perception, particularly for smaller scratches. In addition, skip connections and CNN blocks enable spatial detail recovery for ViT-encoded features, resulting in improved segmentation accuracy, as shown in
Figure 7.
3.5. Ablation Study
U-Net can serve as a control for the ablation study of the ViT encoding block due to its convolutional encoder-decoder architecture with skip connections. SegFormer, a ViT-based network, serves as a control for the ablation study of the skip-connected convolutional decoding structure. The results of these experiments have already been presented in
Section 3.4 and are not repeated here. This section focuses on analyzing the factors affecting the performance of the MSDS, with results and metrics presented in
Figure 9 and
Table 2.
Effect of . To study the effect of sampling method on the generation of scratch images, we chose a completely random-sampling mask sampling method for comparison. A scratch dataset guided by masks generated by random sampling was created to train the TransCNN. We labeled the dataset sampled using
as
and the dataset sampled using random sampling as
. The results are shown in
Figure 9. Among them, the scratch width is too wide or too narrow in the
, while the sampled scratch width is more uniform and more in line with the distribution of
, thus, as illustrated in
Table 2, a better scratch segmentation model can be trained.
Effect of Scratch Fine-Tuning Process. To investigate the impact of the FT process on scratch generation, we removed it from the complete sampling pipeline and repeated the experimental steps outlined in the previous section. We labeled the dataset sampled without the FT process as
and present a selection of results in
Figure 9. Removing the FT process resulted in sharper edges in the generated images. Conversely, incorporating the FT process yielded more natural edge transitions, more closely resembling real-world scratches, and, consequently, leading to superior performance metrics.