4.1.1. Datasets and Annotation Settings
Experiments are conducted on the MVTec AD dataset and a self-constructed vial defect dataset, as shown in
Figure 6. MVTec AD is one of the most widely used public benchmarks in industrial anomaly detection [
28]. It contains 15 categories of industrial images that cover both texture and object characteristics and can effectively reflect the applicability of a method in standard industrial scenarios. In the quantitative comparison, all 15 categories are used for evaluation. In the analysis, representative categories such as grid, hazelnut, and wood are further examined to verify the applicability of the proposed method in standard industrial scenarios [
29].
To verify the background blending and structural control capability of the proposed method under complex imaging conditions, a self-constructed vial defect dataset is further established in addition to the MVTec AD dataset. To meet the practical requirements for high-precision and high-efficiency vial appearance inspection in pharmaceutical production, an online image acquisition system based on machine vision is constructed to collect vial sample images under real quality inspection conditions. The multi-view photographs of the vial visual inspection system are shown in
Figure 7, and the parameters of the related acquisition equipment are listed in
Table 1.
The system simulates the visual inspection workflow of an actual pharmaceutical production line, as shown in
Figure 8. Vials to be inspected first enter the clamping station along the conveyor line, where the rotary clamping mechanism ensures continuous and stable automated handling and accurately transfers the vials into the illumination area. Under the combined illumination of a high-intensity ring light and auxiliary light sources, the industrial camera is synchronized with the rotation of the clamping mechanism to capture high-resolution images of key structural regions of the vials. The acquired image data are then uploaded to the host platform for subsequent defect annotation, anomalous sample generation, and downstream detection model training and evaluation.
Considering that vials are transparent glass containers with reflection, specular highlights, and edge refraction, high-quality labels were constructed through manual annotation. During annotation, professional image annotation tools were used to provide pixel-level masks for typical defects such as fine cracks, glass burrs, scratches, and stains on the vial surface. Meanwhile, tight object-detection bounding boxes enclosing the defect regions were generated based on the corresponding masks. This refined dual-annotation scheme provides accurate spatial references for anomaly-region guidance and structural constraints and also supports the training and evaluation of downstream detectors. The self-constructed vial dataset contains six defect categories, with 500 anomalous images for each category and 3000 anomalous images in total. In addition, normal vial images were collected from the corresponding key inspection regions. Specifically, 1000 normal images were collected for each inspection region, resulting in 6000 normal images in total. Therefore, the normal/anomalous sample ratio of the self-constructed vial dataset is 2:1. The six defect categories correspond to six key-location anomalies: Bottle Expansion (BE), Bottle Bottom (BB), Cap Edge (CE), Cake Surface (CS), Cap Appearance (CA), and Bottle Appearance (BA). The normal images are used as target background images during defect generation, while the anomalous images and their masks provide reference defect information for anomaly synthesis. The raw images acquired by the industrial camera are initially captured at 1920 × 1200 pixels. For subsequent experiments, the images are cropped around defect-related regions and resized to 640 × 640, which is used as the input resolution of the downstream detector. According to
Table 2, the dataset is divided into training, validation, and test sets to ensure consistent training and evaluation. To avoid data leakage, the reference defect images and their corresponding masks used during defect generation are selected only from the training split, while the final test images are kept strictly independent. No reference defect image or reference mask used for generation is reused as a final test sample in downstream detection evaluation.
To further illustrate the complexity of the self-built vial defect dataset, this paper conducts a statistical analysis of the morphological and scale characteristics of six typical defect categories, as shown in
Table 3. The statistical indicators include the average aspect ratio, average area, and the dispersion degrees of the aspect ratio and area of defect regions. Among them, the average aspect ratio is used to reflect the spatial extension direction of defect regions, the average area is used to characterize the defect scale, and the dispersion degree is used to measure the stability of morphological and scale variations within the same category.
From the overall results, different defect categories exhibit obvious differences in scale range, morphological extension direction, and distribution stability. This indicates that vial defects are not conventional anomalies with a single scale or a single texture pattern, but instead include multiple cases such as tiny defects, structural defects, and large-area regional anomalies. Such complexity increases the difficulty of both defect generation and downstream detection tasks and also imposes higher requirements on the texture realism, morphological rationality, and spatial controllability of generated samples.
Specifically, Cap Appearance defects have a relatively large average aspect ratio, showing obvious vertical elongation characteristics. Bottle Appearance defects have a relatively small average aspect ratio and are closer to horizontal or block-like distributions. Cap Edge and Cake Surface defects exhibit moderate morphological elongation. In terms of area statistics, Cap Edge defects have a significantly larger average area than the other categories, indicating that this type of defect usually covers a wider region. By contrast, Bottle Bottom and Bottle Expansion defects have smaller average areas and are closer to fine-grained defects. Their generation and detection, therefore, rely more heavily on local texture details, edge information, and high-resolution feature representation.
The above statistical results show that the self-built vial dataset contains multiple defect categories with significant scale differences, complex structural morphologies, and imbalanced inter-category distributions. This characteristic further demonstrates that relying solely on local anomalous appearance transfer is insufficient to cover the complex distribution of real industrial defects. Therefore, this paper introduces multi-domain consistency constraints into the defect generation process to enhance the realism of defect textures, boundary transitions, and background integration. Meanwhile, geometric-semantic constraints are introduced to improve the rationality of defect shape evolution and spatial placement.
Table 3 not only provides supplementary evidence for the morphological and scale characteristics of the self-built dataset, but also offers a data basis for the subsequent analysis of generation quality and detection performance differences among different defect categories.
The original MVTec AD dataset mainly provides image-level and pixel-level anomaly annotations, but it does not directly provide object-detection bounding boxes. To construct the downstream detection task, this paper converts the official pixel-level anomaly masks into bounding-box annotations. Specifically, connected-component analysis is first applied to each anomaly mask. Then the minimum enclosing rectangle of each connected anomalous region is extracted as the object-detection label. Small noisy regions with areas below a threshold are filtered out to reduce interference from pseudo-targets during detector training. For generated samples, the corresponding bounding-box labels are synchronously generated from anomaly masks in the same way, which ensures a consistent task definition for downstream detection experiments across different augmentation methods.
4.1.2. Experimental Environment
The experimental platform is configured with an Intel Core i7-13700F CPU, 32 GB memory, and an NVIDIA GeForce RTX 4090 GPU. The software environment includes Windows 10, Python 3.9, PyTorch 2.0, and CUDA 11.8. To ensure fair comparison, YOLOv11 is uniformly selected as the downstream detector. The data split, number of training epochs, learning rate, optimizer, and input resolution are kept the same across all methods, and the only difference is the source of the training samples. To balance the preservation of real data distribution and the effectiveness of defect-sample augmentation, the ratio of original samples to synthetic samples is uniformly set to 1:1 for all methods. Detailed settings are listed in
Table 2.
To further reduce bias caused by differences in sample scale, all augmentation methods generate the same number of synthetic samples, and only the generation strategy differs. Specifically, in the vial scenario, the algorithm generates 1000 pairs of anomalous samples with high-precision mask annotations for each defect category for generation-quality evaluation and downstream detection experiments. On the MVTec AD dataset, each category also uses a fixed number of augmented samples. The main purpose is to compare generation quality across methods in standard industrial scenarios.
In the data preprocessing stage, images used in the generation branch are adjusted to a fixed resolution of 1024 × 1024 using a unified padding strategy to meet the input requirements of the pretrained diffusion model. It should be noted that 640 × 640 is used as the standardized resolution for the processed dataset and the input size of the downstream detector, whereas 1024 × 1024 is used only in the generation branch to satisfy the input-resolution requirement of the diffusion model. Here, 640 × 640 corresponds to the original data resolution and the input size of the downstream detector, whereas 1024 × 1024 is used only in the generation branch to satisfy the input-resolution requirements of the diffusion model. To preserve fine details in vial images, the algorithm first performs local cropping and scale enlargement on potential defect regions with the help of annotation masks. This local-focusing strategy retains as many fine defect details as possible, such as cracks, under limited computational resources. Through this standardized preprocessing pipeline, the generated defect images maintain high semantic consistency with the original annotations and better match the practical requirements of high-precision industrial inspection.
4.1.3. Implementation Details
The proposed method is built upon the TF-IDG framework; therefore, the overall diffusion backbone and inference pipeline follow the settings of the original framework. Specifically, pretrained Stable Diffusion and its corresponding ControlNet are adopted for image editing, without any parameter updates. The ROI is obtained before image generation and is used only as a spatial constraint for defect generation during the inference stage. Feature representations are extracted by the pretrained feature encoder in the TF-IDG framework, while the proposed method is used only to guide the diffusion sampling process. The sampling strategy follows the same DDIM Scheduler configuration as the original framework. During the entire inference stage, the inputs include a normal image, a reference defective image, the corresponding defect mask, and the ROI prior, without the need to train any additional model.
For hyperparameter settings, the newly introduced constraint terms use fixed weights, and the detailed values are listed in
Table 4. Here, MDC denotes the multi-domain consistency constraint mechanism, and SSC denotes the geometric semantic constraint mechanism.
To prevent scale differences among different constraint terms from affecting the generation process, this paper determines the weight parameters by combining empirical tuning through repeated experiments with grid search. Specifically, the parameter search ranges are first determined by referring to the magnitude settings of the energy guidance terms in the original TF-IDG framework. Then, grid search is performed on the validation set for the weights of the multi-domain consistency constraints and geometric-semantic constraints, respectively, with a step size of 0.05. Local IS, IC LPIPS, and downstream detection mAP are used as comprehensive evaluation metrics. All experiments maintain the same parameter configuration, and no category-specific parameter tuning is conducted, so as to ensure fairness. When further parameter adjustment no longer brings obvious performance improvement, the final configuration is adopted as the experimental setting of this paper. Therefore, the parameters listed in
Table 4 are fixed values obtained after validation-set tuning.
Given the need for fine-grained structure preservation and local texture control in industrial defect generation, the denoising process adopts a DDIM scheduler with 50 sampling steps. The first 30 steps perform feature alignment and gradient guidance, and the last 20 steps perform texture enhancement and detail refinement. In addition, the ROI priors are obtained by a region localization module based on the positions of reference defect images, and its parameters are not used to update the defect-generation backbone. In the vial scenario, the ROI can be directly obtained from component-region annotations or an existing localization module [
28,
30].