Better with Less: Efficient and Accurate Skin Lesion Segmentation Enabled by Diffusion Model Augmentation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsSome weaknesses must be addressed before publication.
- The paper does not sufficiently justify why dilated convolutions in DDPM significantly outperform alternative designs or other state-of-the-art augmentation frameworks. No direct comparison with advanced GAN-based augmentation methods or diffusion-based conditional generation is provided.
- The paper does not sufficiently justify why dilated convolutions in DDPM significantly outperform alternative designs or other state-of-the-art augmentation frameworks. No direct comparison with advanced GAN-based augmentation methods or diffusion-based conditional generation is provided.
- Several important parameters are missing or unclear, e.g., total number of generated images, computational cost of DDPM training, rationale for choosing 1000 synthetic samples, and details of filtering poor-quality images.
- Many sentences in related works and results sections are repetitive, reducing conciseness. The manuscript contains numerous grammatical errors, awkward phrasing, and inconsistent tense usage, reducing readability.
- Figures of generated images (Fig. 2, Fig. 3) are low-resolution and do not clearly demonstrate fidelity improvements. Tables lack statistical significance testing except for one instance (Fig. 5).The visual differences between baseline and augmented segmentation masks are not compelling.
- Some works about segmentation should be cited in this paper to make this submission more comprehensive, such as 10.1109/TPAMI.2024.3511621.
I recommend major revisions to enhance the quality of this manuscript. Additional details and explanations would greatly improve the manuscript.
Author Response
Comments 1: The paper does not sufficiently justify why dilated convolutions in DDPM significantly outperform alternative designs or other state-of-the-art augmentation frameworks. No direct comparison with advanced GAN-based augmentation methods or diffusion-based conditional generation is provided. |
Response 1: We kindly thank the reviewer for their valuable comments. As clarified in the revised manuscript (Section 3.1, pages 5–6), the primary objective of our work is to investigate whether incorporating dilated convolutions into the DDPM backbone can effectively enlarge its receptive field and thereby improve the quality of generated dermoscopic images, rather than to achieve state-of-the-art performance across all augmentation frameworks. Our focus is on demonstrating the architectural impact within DDPM itself. Regarding GAN-based augmentation methods, we have discussed their characteristics in the Related Works section (Section 2.2, page 4), noting that while they have been widely used, such models often suffer from training instability and mode collapse, which we also encountered in our earlier studies. Our previous work explored GAN-based generation extensively, but we have since shifted to DDPM-based approaches due to their superior stability and fidelity in our domain. For conditional diffusion-based generation, we acknowledge its potential and view it as a valuable direction for future work; however, it is beyond the scope of this study, which focuses on architectural modification of the unconditional DDPM for data augmentation. |
Comments 2: The paper does not sufficiently justify why dilated convolutions in DDPM significantly outperform alternative designs or other state-of-the-art augmentation frameworks. No direct comparison with advanced GAN-based augmentation methods or diffusion-based conditional generation is provided. |
Response 2: Our study aims to examine the architectural effect of enlarging the DDPM receptive field via dilated convolutions, rather than to benchmark against all existing augmentation frameworks. Standard DDPMs employ fixed-size convolutions, which limits multi-scale context modeling for dermoscopic lesions; dilated convolutions address this without increasing parameters. GAN-based augmentation methods are discussed in Section 2.2, where we note their training instability and mode collapse, challenges also observed in our earlier work. Conditional diffusion is a valuable direction but beyond the scope of this study, which focuses on unconditional DDPM modification. |
Comments 3: Several important parameters are missing or unclear, e.g., total number of generated images, computational cost of DDPM training, rationale for choosing 1000 synthetic samples, and details of filtering poor-quality images. |
Response 3: We have added these details in Section 4.2. Specifically, we generated 5,000 dermoscopic images at 256 × 256 resolution and retained 1,000 after a two-stage quality control process: (1) automated screening using FID and LPIPS thresholds to remove unrealistic samples, and (2) manual annotation of remaining images using Labelme. The choice of 1,000 samples reflects a balance between annotation cost and dataset diversity. Compute cost has also been reported: ~19.4 GPU-hours for DDPM training and ~20.39 GPU-hours for generating 5,000 images on a single RTX 3090 (16GB). All new information is highlighted in red in the revised manuscript. |
Comments 4: Many sentences in related works and results sections are repetitive, reducing conciseness. The manuscript contains numerous grammatical errors, awkward phrasing, and inconsistent tense usage, reducing readability. |
Response 4: We thank the reviewer for pointing out these issues. In the revised manuscript, we have thoroughly edited the Related Works and Results sections to remove repetitive sentences and improve conciseness. We have also performed a comprehensive language revision, including correction of grammatical errors, rephrasing of awkward expressions, and unification of tense usage across the paper. All modifications are highlighted in red in the revised version. |
Comments 5: Figures of generated images (Fig. 2,Fig. 3) are low-resolution and do not clearly demonstrate fidelity improvements. Tables lack statistical significance testing except for one instance (Fig. 5). The visual differences between baseline and augmented segmentation masks are not compelling. |
Response 5: We appreciate the reviewer’s insightful comments. In the revised manuscript, Figures 2 and 3 have been updated with higher-resolution examples to more clearly demonstrate fidelity improvements. Furthermore, statistical significance testing has been added to all relevant tables to reinforce the robustness of our findings. In addition, we have included improved qualitative comparisons of segmentation masks as shown in Fig 1, which better highlight the differences between baseline and augmented results. All revisions are marked in red in the manuscript. Figure 1: Qualitative comparison of segmentation results. (a) Original image, (b) Ground truth, (c) U-Net (R), (d) U-Net (R+S), (e) Dilated U-Net (R), (f) Dilated U-Net (R+S). Incorporating synthetic data (R+S) improves boundary sharpness and completeness, especially for ambiguous lesion regions. |
Comments 6: Some works about segmentation should be cited in this paper to make this submission more comprehensive, such as 10.1109/TPAMI.2024.3511621. |
Response 6: We appreciate the reviewer’s suggestion. The recommended work (DOI: 10.1109/TPAMI.2024.3511621) and more works have been added to the References section and appropriately cited in the Related Works to improve the completeness of the literature review. |
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors proposed an innovative two-stage model, investigated a comprehensively augmented dataset and an improved denoising diffusion probabilistic model, achieving promising results in automated segmentation of skin lesions. However, the paper still has areas that require improvement.
- In recent years, some deep feature convergence methods have been applied significantly in various fields (e.g., MFST(IEEE GRSL), MCBTNet(IEEE JBHI)). These methods should also be incorporated into relevant works.
- Why can’t conventional DDPM models be used for skin lesion segmentation tasks? What was the authors’ motivation for improving the DDPM?
- DDPM can be used to generate images, but how is the ground truth determined when these synthesized images are used for training?
- What similarities and differences exist between images generated by DDPM and real images? The authors should conduct an analysis on the realism of the synthesized images.
- In the related work, the authors identify class imbalance as a key challenge, yet their proposed method lacks targeted innovation and merely applies an existing loss function. The analysis of related work should instead focus on the innovations introduced in their proposed approach.
- The authors did not present a visual comparison of segmentation results between the proposed method and existing approaches in the experimental results. It is recommended to include such visualizations.
- Why does DeepLabV3-ResNet50 show a performance drop in Table 3? The authors should provide a theoretical analysis for this anomalous phenomenon.
- Why do DeepLabV3-ResNet50 and DeepLabV3-ResNet101 with the same structure exhibit completely opposite performance behaviors in Table 6?
Author Response
Comments 1:In recent years, some deep feature fusion methods have been applied significantly in various fields (e.g., MFST (IEEE GRSL), MCBTNet (IEEE JBHI)). These methods should also be incorporated into relevant works. |
Response 1: We thank the reviewer for pointing out this important omission. In the revised manuscript, we have added a discussion of the MFST and MCBTNet frameworks in the Related Works section (Section 2.2, pages 4—5). The MFST integrates multi-level feature merging and adaptive feature compression to improve semantic consistency in remote sensing scene classification, while the MCBTNet combines CNN and Transformer modules within a U-shaped architecture for efficient and accurate medical image segmentation. We also emphasize in the revised text that, unlike these feature fusion approaches which operate in the feature space, our method performs data-level augmentation using an enhanced DDPM, thereby enriching the diversity of training samples prior to feature extraction. All newly added content in the manuscript is highlighted in red. |
Comments 2: Why can’t conventional DDPM models be used for skin lesion segmentation tasks? What was the authors’ motivation for improving the DDPM? |
Response 2: We agree that conventional DDPMs can perform image synthesis and often surpass GANs. However, as noted in the revised manuscript (Section 3.1, pages 5—6), their fixed-size convolutions limit the receptive field, reducing their ability to model multi-scale context and diffuse lesion borders. Our improvement integrates dilated convolutions to enlarge the receptive field without extra parameters, enabling the generation of higher- quality dermoscopic images that enhance downstream segmentation performance. |
Comments 3: DDPM can be used to generate images, but how is the ground truth determined when these synthesized images are used for training? |
Response 3: We appreciate the reviewer’s comment. This was not clearly explained in the original manuscript, and we have now clarified it in Section 4.2 (pages 8—9). The original ISIC 2017 and ISIC 2018 datasets include ground truth masks for real images. However, the synthetic images generated by the enhanced DDPM did not have labels. We therefore manually annotated all synthetic images using the Labelme software, following the same annotation protocol as in the original datasets. The annotations were performed by experienced annotators and subsequently reviewed for accuracy. All newly added text is highlighted in red in the revised manuscript. |
Comments 4: What similarities and differences exist between images generated by DDPM and authentic images? The authors should analyze the realism of the synthesized images. |
Response 4: We have added a brief analysis in Section 4.4 (pages 9—10). The enhanced DDPM produces images that closely match authentic dermoscopic images in lesion boundaries and color patterns (Fig. 2), with low FID scores (Table 2) indicating high fidelity. Minor differences remain in fine textures and rare cases (Fig. 3), so we apply manual quality control before use. |
Comments 5: In the related work, the authors identify class imbalance as a key challenge, yet their proposed method lacks targeted innovation and merely applies an existing loss function. The analysis of related work should instead focus on the innovations introduced in their proposed approach. |
Response 5: We appreciate the reviewer’s observation. It is true that our framework does not introduce a new loss function specifically for class imbalance, and we have clarified this in the revised Related Works section (Section 2, page 5). Instead, our main strategy for alleviating imbalance is at the data level: the enhanced DDPM generates diverse synthetic samples that include more underrepresented lesion types, thereby improving the balance of the training set and indirectly mitigating the problem.
|
Comments 6: The experimental results section does not provide visual comparisons of segmentation results between the proposed method and existing approaches. It is recommended to include such visualizations. |
Response 6: We thank the reviewer for this valuable suggestion. In the revised manuscript, we have added qualitative visual comparisons of segmentation results (Fig.1). These examples illustrate that incorporating synthetic data (R+S) leads to sharper and more complete lesion boundaries, particularly in challenging and ambiguous regions. Together with the comprehensive quantitative evaluations (Tables 3—6), these visualizations provide clearer evidence of the effectiveness of our proposed framework. |
Comments 7: Why does DeepLabV3-ResNet50 show a performance drop in Table 3? The authors should provide a theoretical analysis for this anomalous phenomenon. |
Response 7: We thank the reviewer for pointing out this anomaly again. As noted in the revised manuscript (Section 4.5, page 11), the small performance drop observed for DeepLabV3-ResNet50 is within the typical range of run-to-run stochastic variation in our experiments. While most architectures benefited from the diversity of the synthetic data, this particular configuration may have inherently less synergy with the type of augmentation used, resulting in negligible or slightly negative changes in some runs. This outcome does not contradict the overall trend that synthetic augmentation improves performance for the majority of tested architectures. |
Comments 8: Why do DeepLabV3-ResNet50 and DeepLabV3-ResNet101 with the same structure exhibit completely opposite performance behaviors in Table 6? |
Response 8: As noted for the previous comment on Table 3, this difference can be attributed to the stochastic variation inherent in the training process as well as differences in the architectures’ sensitivity to our augmentation. While both models share the same segmentation head, the ResNet101 backbone has greater depth and capacity, which may allow it to exploit the additional diversity from synthetic data more effectively. In contrast, the ResNet50 variant appears less sensitive to such augmentation in our runs, leading to marginal or negative changes in some cases. The observed differences fall within the typical variability range of our experiments and do not alter the overall conclusion that most architectue. |
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors- Dataset diversity is not addresses. Experiment can be perform on multiple datasets to check its robustness.
- Figure 1 need to be improved.
- How less number of epochs are sufficient for the convergence of mode?
- Abstract need to be restructured.
- Ablation study with multiple models can be performed to highlight the performance of each model.
- Clearly explain the type of modification you performed in the DDPM for data synthesis.
- If possible try to compare the DDPM with other generative models to draw a specific result oriented direction.
Author Response
Comments 1: Dataset diversity is not addressed. Experiment can be performed on multiple datasets to check its robustness. |
Response 1: We thank the reviewer for pointing out this important aspect. In fact, our experiments were conducted on two widely used public datasets, ISIC2017 and ISIC2018, which already provide a certain degree of diversity in terms of lesion types and image acquisition conditions. To clarify this, we have explicitly emphasized the use of both datasets in the revised manuscript. We also agree that extending our study to additional datasets would further validate robustness, and we have included this as a future research direction. |
Comments 2: Figure 1 need to be improved. |
Response 2: We thank the reviewer for the helpful suggestion. In the revised manuscript, Figure 1 has been updated to improve both clarity and readability. First, the figure was regenerated in vector format (PDF), ensuring higher resolution and sharper details. Second, additional annotations were included to make the workflow more intuitive, such as explicitly indicating that the blue arrows denote down-sampling operations and the orange arrows denote up-sampling operations. These refinements improve visual clarity and provide clearer guidance to readers in understanding the proposed framework. |
Comments 3: How less number of epochs are sufficient for the convergence of model? |
Response 3: We appreciate the reviewer’s concern. In our method, the use of dilated convolutions enlarges the receptive field, allowing the network to capture lesion features more effectively at multiple scales. Combined with the optimized loss function, this improves feature discrimination and accelerates training convergence. As a result, fewer epochs are sufficient for the model to reach stable performance compared to conventional settings. |
Comments 4: Abstract need to be restructured. |
Response 4: We thank the reviewer for this suggestion. In the revised manuscript, the Abstract has been restructured to improve clarity and conciseness. The revised version highlights the key motivation (data scarcity), the core methodological contribution (enhanced DDPM with dilated convolutions and self-attention), and the main findings (consistent DICE improvements and efficiency advantages). We believe this restructuring makes the Abstract more focused and easier to follow. |
Comments 5: Ablation study with multiple models can be performed to highlight the performance of each model. |
Response 5: We thank the reviewer for this valuable suggestion. The main contribution of our framework is the introduction of dilated convolutions to enlarge the receptive field. In the Experiments and Results section, particularly in the subsections Generalization of Data Augmentation to Other Segmentation Architectures and Impact of Synthetic Data on Segmentation Performance, we have already compared our method with and without dilated convolutions across different segmentation models. These comparisons can be regarded as an ablation analysis, highlighting the effect of the proposed modification. We appreciate the reviewer’s rigorous perspective, which helps us clarify this point in the revision. |
Comments 6: Clearly explain the type of modification you performed in the DDPM for data synthesis. |
Response 6: We thank the reviewer for pointing this out. The modification has been described in the Methodology section, but we agree that the explanation was not sufficiently clear. In the revised manuscript, we have clarified that our main modification is the integration of dilated convolutions into the DDPM backbone. This enlarges the receptive field without adding parameters, enabling the model to capture multi-scale lesion context and generate higher-quality dermoscopic images. The relevant section has been revised for clarity, and the new text is highlighted in red. |
Comments 7: If possible try to compare the DDPM with other generative models to draw a specific result oriented direction. |
Response 7: We appreciate the reviewer’s suggestion. The main focus of this work is to investigate whether incorporating dilated convolutions into the DDPM framework is feasible and effective for skin lesion data augmentation, rather than to provide an exhaustive benchmark of generative models. Nevertheless, we have previously conducted studies using GAN-based generative approaches and observed limitations such as unstable training and mode collapse, which motivated us to adopt diffusion models in this study. We have clarified this point in the revised manuscript to better explain our methodological choice. |
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsNo more comments.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe author answered my questions well, and I have no other suggestions.
Reviewer 3 Report
Comments and Suggestions for AuthorsImprove the quality of figure 1 part (a)