4.1. Experimental Setting
Diffusion model. The proposed method follows the DDPM [
19] paradigm with a UNet [
39] architecture trained from scratch. We modified the input and output channels accordingly to support our bounding-box encoding representation and the denoising of the segmentation map. Both during training and testing, the number of denoising iterations was set to 1000. We trained for 300 epochs using AdamW [
40] as the optimizer with a learning rate of
and a batch size of 8.
Downstream task. For the semantic segmentation downstream task we employed a UNet architecture with a ResNet-18 [
41] backbone. We used a single network for each segmentation class to avoid class balancing problems and concentrate on the synthetic data assessment. The training lasted for 100 epochs using AdamW as the optimizer with a learning rate of
and a batch size of 64 on a single Nvidia RTX 4090.
Dataset. Although several open industrial defect datasets are available, most present significant practical limitations when used as a single training source. The first group of datasets is constrained by limited scale [
42,
43]. Such limited data volume is insufficient for training diffusion models. The second group includes datasets with single-class or very low class diversity, such as concrete crack benchmarks [
44], which provide high-quality masks but model only one defect category, thereby preventing evaluation of multi-class segmentation capability. The third group comprises datasets with strong class imbalance or scarce defect-positive samples, such as KolektorSDD/SDD2 [
45,
46], where only a small fraction of images contain annotated defects, limiting effective supervised learning of defect regions. Additionally, some classical benchmarks such as DAGM 2007 either rely partially on synthetic data or lack the variability and scale typically encountered in real industrial production.
In contrast, the Wood Defect Detection dataset [
47] combines large-scale data volume, multiple defect categories, real production-line acquisition, and pixel-level annotations for all defect instances. It contains 20,276 images with semantic segmentation and bounding-box annotations of 10 different classes of wood defects. In our experiments, we decided to aggregate the 4 classes of knots and exclude the blue stain and overgrown classes that are underrepresented. Thus, we obtained a dataset comprising 20,107 images with a total of 5 defect classes (knot, crack, quartzite, resin, and marrow).
Moreover, we split the dataset into three subsets:
for training the diffusion model,
for training the segmentation model, and
as a fixed real test set. Additionally, the bounding-box annotations from the
real split are used to generate synthetic data for evaluating the semantic segmentation task.
Figure 3 illustrates some samples from the original dataset.
4.2. Data Synthesis Assessment
To assess the quality of synthetic data, we compare our approach with the current state-of-the-art layout-conditional diffusion model [
11], utilizing its original code implementation and adapting it to take non-squared images. Specifically, we focus on evaluating the consistency between the generated defects and their corresponding bounding-box constraints. To quantify this relationship, we introduce two metrics, the Segmentation Alignment Error (SAE) and the Empty Bounding-Box Rate (EBR).
Segmentation Alignment Error (SAE). With this measure, we quantify how many generated defect pixels fall outside their designated bounding boxes, indicating misalignment between the generated defects and their constraints. Formally, let:
Thus, we define the metric as follows:
where a lower value indicates that the model is more consistent with the generation condition.
As shown in
Table 1, the method proposed in [
11] struggles to maintain defect placement within the bounding boxes, resulting in a very high mean SAE of
across all the defects. In contrast, our approach, leveraging a dual bounding-box encoding strategy (BASD and C-BASD), significantly improves alignment, with only
of generated pixels falling outside the given regions.
Empty Bounding-Box Rate (EBR). To assess whether the generated defects correctly fall within their designated bounding boxes, we define the Empty Bounding-Box Rate (EBR). This metric quantifies how many bounding boxes remain empty, meaning no synthetic pixels are generated inside them. Formally, let:
be the set of all bounding boxes;
be the subset of bounding boxes that contain no generated pixels.
Thus, we define the metric as follows:
where higher values indicate that a larger number of bounding boxes are missed during generation, signifying a poorer retrieval of the provided conditioning.
As reported in
Table 2, the EBR metric shows the superiority of our proposal in retrieval abilities by a large margin. Specifically, our average EBR lies around 5.51% on the total amount of bounding boxes and surpasses the competitor by more than 20% points [
11].
Visual sample quality. To further analyze the quality of the generated synthetic images, we report the Fréchet Inception Distance (FID) [
48], the Kernel Inception Distance (KID) [
49], and LPIPS [
50]. FID and KID are computed between real and synthetic images using features extracted from the InceptionV3 network [
51]. Specifically, we evaluate the statistics at different intermediate feature layers (corresponding to different spatial resolutions: 2048, 768, 192, and 64 channels), following standard practice to assess both high-level semantic alignment and lower-level texture fidelity.
Higher-level features (like the 2048-dimensional Inception embedding) capture broader semantic structures and distributional alignment, but are less sensitive to fine texture and local structural coherence than lower-level features. Therefore, a method that excels at texture/detail accuracy (which matters more for defect realism) can sometimes appear worse at the highest feature layer, because those layers emphasize global layout similarity rather than local perceptual fidelity.
As shown in
Table 3, our method consistently improves FID and KID at lower-level feature representations, indicating better local structural fidelity. Moreover, LPIPS computed across multiple backbones (AlexNet, VGG-16, and SqueezeNet) confirms improved perceptual similarity and robustness across architectures. All metrics were computed using the same number of real and synthetic samples for both methods to ensure a fair comparison: FID and KID are computed using the full real test split (10% of the dataset,
N images) and an equal number (
N) of synthetic samples generated considering the same bounding-box annotations for each method. Each metric is evaluated over three independent sampling runs using the same trained model, and we report mean ± standard deviation.
Qualitative results. To further illustrate this comparison,
Figure 4 and
Figure 5 depict qualitative examples. Moreover, the results demonstrate that [
11] not only fails to confine defects within the bounding boxes but also occasionally generates wrong segmentation labels.
4.3. Downstream Task Evaluation
To evaluate the effectiveness of our synthetic data, we conduct a semantic segmentation experiment using a UNet architecture trained on different data configurations.
Starting from the 20% split, we use the original bounding-box annotations as guidance to generate pairs of images and labels. We do so for both methods, ours and [
11]. We then use this synthetic split to train the segmentation pipeline. Moreover, to ensure a fair comparison between approaches, we discard synthetic pixel labels generated outside the conditioning bounding boxes. This step is applied identically to all methods and does not modify the generated RGB images. Its purpose is to isolate the effect of bounding-box-guided supervision during downstream training, while global consistency and leakage are independently evaluated through SAE and EBR.
Table 4 presents the F1 scores computed on the
real test split, where we compare models trained on real data, synthetic data, and a combination of both. Notably, when training on synthetic data alone, our approach surpasses [
11] by an impressive
, demonstrating its ability to generate more valid training samples. This highlights the superior quality and consistency of our synthetic segmentation maps, which provide a more reliable learning signal for the segmentation task.
When incorporating real data into the training process, the performance gap between the two methods narrows, as real samples provide a strong baseline. However, even in this hybrid setting, leveraging our synthetic data leads to the best overall F1 score, achieving a improvement over using only real data. This behavior suggests that the diffusion model captures the real data distribution reasonably well, producing synthetic samples whose visual and statistical properties are close to those of the real data. As a result, most of the performance gain on segmentation is achieved when replacing real data with synthetic data, while adding synthetic data on top of real data yields diminishing but still measurable returns. This result highlights that our method complements real-world annotations and can achieve strong performance with fewer labeled samples, potentially reducing the time and effort required for manual labeling in industrial scenarios.
4.4. Ablation Study
To isolate the contribution of the proposed encoding strategies, we conduct an ablation study in which we remove the signed distance representation (BASD) while preserving the class-aware bounding-box encoding (C-BASD).
It is important to note that class information cannot be removed without fundamentally altering the task definition. The class label specifies which defect type must be generated inside each bounding box, and therefore constitutes a necessary conditioning signal rather than a design choice. For this reason, the only meaningful internal ablation consists of removing the geometric signed distance encoding while keeping the class-aware representation unchanged. This allows us to directly assess the contribution of the boundary-aware signal introduced by BASD.
4.4.1. Impact on Retrieval Ability (EBR)
As shown in
Table 5, removing the SDF component degrades the Empty Bounding-Box Rate across most classes. The average EBR increases from
in the full model to
without SDF.
The degradation is particularly visible for the crack class (from to ), while other categories remain relatively stable. Although the numerical differences may appear moderate, the consistent increase in miss rate indicates that the signed distance encoding contributes to more reliable activation of the conditioned regions. In other words, BASD improves the robustness of defect retrieval inside the prescribed bounding boxes.
4.4.2. Impact on Spatial Alignment (SAE)
The effect of removing BASD is more pronounced when analyzing spatial alignment. The overall SAE increases from to , indicating a clear deterioration in boundary consistency. In particular, knot and resin exhibit noticeable increases in misaligned pixels, and crack shows a degradation from to .
These results suggest that the geometric information encoded by the signed distance map provides a structural prior that guides the denoising trajectory toward spatially coherent defect shapes. Without this boundary-aware signal, the diffusion process still generates defects inside the boxes, but with weaker spatial precision and increased leakage or boundary irregularities.
4.4.3. Controlled Overlap Analysis
To further evaluate the robustness of our encoding in multi-class overlap scenarios, we conduct an additional controlled experiment by generating synthetic samples with predefined bounding-box overlaps of , , and IoU. We then compute EBR and SAE under these controlled conditions.
As reported in
Table 6, the degradation remains limited as overlap increases. Both retrieval reliability (EBR) and spatial alignment (SAE) show only moderate variation across overlap levels, indicating that the proposed deterministic partition strategy provides stable conditioning even in structured intersection regions. These results confirm that the analog bit encoding combined with the geometric overlap partition does not introduce instability in multi-class overlapping configurations.
4.4.4. Discussion
Overall, this ablation confirms that the performance gains observed in
Section 4 cannot be attributed solely to the class-enriched encoding. While class information determines what defect to generate, the signed distance representation strongly influences how the defect conforms to its spatial constraint.
The combination of BASD and C-BASD therefore proves essential for achieving both reliable bounding-box retrieval and accurate spatial alignment. In particular, BASD acts as a geometric regularizer that stabilizes conditioning and improves boundary fidelity during the diffusion process.