Next Article in Journal
Classification of Climate-Driven Geomorphic Provinces Using Supervised Machine Learning Methods
Previous Article in Journal
Eddy Current Distribution in Magnetotherapy of Bones: A Qualitative and Quantitative Study
 
 
Article
Peer-Review Record

Deep Learning-Based Denoising for Interactive Realistic Rendering of Biomedical Volumes

Appl. Sci. 2025, 15(18), 9893; https://doi.org/10.3390/app15189893
by Elena Denisova 1, Leonardo Bocchi 2 and Cosimo Nardi 3,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Appl. Sci. 2025, 15(18), 9893; https://doi.org/10.3390/app15189893
Submission received: 25 July 2025 / Revised: 5 September 2025 / Accepted: 8 September 2025 / Published: 9 September 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript presents a valuable investigation into deep learning-based denoising for interactive realistic rendering of biomedical volumes. However, several aspects require further refinement to enhance this research.

  1. The dataset with limited spiral CT and MRI samples is well-characterized but lacks diversity in pathological conditions and imaging modalities. Data augmentation can be used to enrich samples.
  2. The clinical survey with 51 experts is a strength, but the evaluation criteria lack structured specificity. Clarify why OIDN ADV* received higher subjective ratings despite mixed quantitative results.
  3. Clarifying how to balance quality and speed for clinical workflows is essential for practical adoption.
  4. Demonstrating whether deep learning models outperform these non-deep learning denoisers, especially in resource-constrained settings where deep learning may be impractical.
  5. Evaluating whether HDR training would provide valuable insights into data preprocessing choices.

Author Response

1. The dataset with limited spiral CT and MRI samples is well-characterized but lacks diversity in pathological conditions and imaging modalities. Data augmentation can be used to enrich samples.

We thank the reviewer for this observation. Our primary aim was to improve visualization within our MCPT framework, which currently supports only CT and MRI 3D data. Therefore, we used spiral CT, CBCT, and MRI for training. We acknowledge that the dataset could benefit from data augmentation techniques such as image rotation, and we plan to consider these in future work.

2. The clinical survey with 51 experts is a strength, but the evaluation criteria lack structured specificity. Clarify why OIDN ADV* received higher subjective ratings despite mixed quantitative results.

We acknowledge that this point was not clearly explained in the original manuscript. Our intention was to design an initial “first-impact” survey that evaluates overall perceived image quality rather than highly structured criteria. Upon analyzing the results, particularly the responses to open-ended questions (e.g., “please specify”), we observed that OIDN ADV* received higher subjective ratings due to perceptions such as: “Although it appears too ‘plastic’, the image processed by AI looks ‘cleaner’ than the original and, in some ways, better than the reference by enhancing the bone margins.” In other words, clinicians sometimes preferred these “false” images (“better than the reference”) because of additional qualities such as brightness, contrast, and sharpness. We note that in a subsequent study with an extended survey of 74 participants (presented in a separate publication), we analyzed the results in more detail. This complementary work provides a deeper perspective beyond the scope of the current manuscript.

We have included this clarification in the revised manuscript.

3. Clarifying how to balance quality and speed for clinical workflows is essential for practical adoption.

To assess the practical adoption, we tested our framework on NVIDIA RTX A4000 using CUDA–OpenGL interoperability to minimize CPU–GPU data transfers. With these optimizations, the system achieved an interactive frame rate of ~15 FPS (against 24 FPS without DL denoising), which is suitable for clinical use. Profiling revealed that the majority of the processing time is spent in denoising inference (27 ms for AE Full), while CPU–GPU transfer overhead was significantly reduced by the CUDA–OpenGL pipeline (~1 ms). These results demonstrate that, on modern hardware, our framework can maintain real-time interaction while delivering deep learning denoising. This discussion has been added to the revised manuscript.

4. Demonstrating whether deep learning models outperform these non-deep learning denoisers, especially in resource-constrained settings where deep learning may be impractical.

We thank the reviewer for this suggestion. Evaluating the performance of deep learning models relative to non-deep learning denoisers in resource-constrained settings is indeed important; however, this analysis was considered beyond the scope of the current study and is left for future work.

5. Evaluating whether HDR training would provide valuable insights into data preprocessing choices.

In the current study, our model trained from scratch used LDR targets, which were sufficient to achieve higher quantitative metrics (PSNR, SSIM, LDR-FLIP, tPSNR). However, this choice may have influenced perceptual qualities valued by clinicians, such as natural brightness, contrast, and sharpness. Training with HDR images could potentially provide richer intensity information and improve perceptual realism, offering valuable insights into preprocessing choices and better alignment between quantitative metrics and subjective evaluations. We have added this discussion to the revised manuscript.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present a study on DL-based denoising for Monte Carlo Path Tracing (MCPT) applied to biomedical volume rendering. Following are my comments:

  1. The FPS drop from ~15 to ~6 during interactions on the tested hardware challenges the 'interactive' claim. Please clarify the minimum hardware requirements for clinically viable use and profile the time spent in inference vs. CPU-GPU transfers. Please present results for more modern GPUs to assess scalability.
  2. How does the model performance change when tested on other biomedical imaging modalities (e.g., ultrasound, PET) or varying CT acquisition settings? Could the model overfit to the limited set of anatomical regions and scanner types present in the training data?
  3. Only two denoisers (OIDN and OIDN ADV*) were tested for tPSNR, excluding the other trained-from-scratch models. Please extend the temporal stability analysis to all tested models or justify the restriction. Would a lightweight temporal feedback module significantly improve stability without harming FPS?
  4. The survey lacks a structured, task-based scoring system tied to specific anatomical landmarks or diagnostic tasks. Further, inter-rater reliability (eg. Cohen’s kappa) is not reported. Please design an evaluation where experts score the visibility of predefined regions of interest (eg. fracture lines and vessel lumen). A task-based assessment (eg. fracture detection sensitivity/specificity) can provide a more objective link between denoising quality and clinical utility.
  5. As the training data is heavily weighted toward CBCT, with minimal MRI and spiral CT inclusion, please test on unseen modalities or acquisition parameters to assess robustness.
  6. Why were more recent architectures (e.g., transformer-based models, diffusion-based denoisers) excluded from experiments when speed was not the only criterion? Did the authors consider mixed-precision inference to accelerate larger models like Samsung/Tyan?
  7. The work does not experimentally evaluate state-of-the-art architectures (eg. Restormer and SwinIR), despite noting them as future work. At least one additional benchmark, with a modern lightweight transformer-based denoiser to strengthen comparisons, should be considered.
  8. While the authors acknowledge the PSNR/SSIM vs. clinical perception gap, they do not propose or test improved perceptual metrics. Please explore volumetric-specific perceptual metrics or present correlation analyses between expert ratings and existing metrics.
  9. Please clarify whether HDR training was attempted and if it meaningfully affected results. Also, explain whether mixed-precision inference (FP16) was tested for speed improvement.
  10. Provide exact hardware specs for all experiments (GPU VRAM, CPU threads and OS) for reproducibility. Also, include inference time breakdown for tile splitting, model inference, and tile merging.
  11. Finally, please make the training code, pretrained domain-specific models, along with preprocessing and tile-based inference scripts, publicly available for verification. Authors can choose GitHub/Zenodo-like repositories.

Author Response

1. The FPS drop from ~15 to ~6 during interactions on the tested hardware challenges the 'interactive' claim. Please clarify the minimum hardware requirements for clinically viable use and profile the time spent in inference vs. CPU-GPU transfers. Please present results for more modern GPUs to assess scalability.

We acknowledge the reviewer’s concern regarding the drop in FPS during interactions. To assess the practical performance, we tested our framework on NVIDIA RTX A4000 using CUDA–OpenGL interoperability to minimize CPU–GPU data transfers. With these optimizations, the system achieved an interactive frame rate of ~15 FPS (against 24 FPS without DL denoising), which is suitable for clinical use. Profiling revealed that the majority of the processing time is spent in denoising inference (27 ms for AE Full), while CPU–GPU transfer overhead was significantly reduced by the CUDA–OpenGL pipeline (~1 ms). These results demonstrate that, on modern hardware, our framework can maintain real-time interaction while delivering deep learning denoising. This discussion has been added to the revised manuscript.

2. How does the model performance change when tested on other biomedical imaging modalities (e.g., ultrasound, PET) or varying CT acquisition settings? Could the model overfit to the limited set of anatomical regions and scanner types present in the training data?

Our current framework supports CT and MRI 3D data only, so modalities such as ultrasound or PET are not included at this stage. Regarding the concern about overfitting, our training datasets include images acquired with four different scanners and cover different anatomical regions from both animals and humans. Moreover, our testing datasets were acquired with different acquisition parameters than the training sets. Therefore, we believe that the model generalizes well across the tested CT and MRI settings, and overfitting to a limited set of scanners or anatomical regions is unlikely.

3. Only two denoisers (OIDN and OIDN ADV*) were tested for tPSNR, excluding the other trained-from-scratch models. Please extend the temporal stability analysis to all tested models or justify the restriction. Would a lightweight temporal feedback module significantly improve stability without harming FPS?

The other trained-from-scratch models were excluded from the temporal stability analysis because their non-temporal metrics were consistently worse, and we aimed to focus on the best-performing denoisers (OIDN and OIDN ADV*) for clarity and conciseness. While incorporating a lightweight temporal feedback module could potentially improve temporal stability, it is also known to introduce artifacts and ghosting effects. We thank the reviewer for this insight; however, exploring temporal feedback mechanisms is beyond the scope of the current study.

4. The survey lacks a structured, task-based scoring system tied to specific anatomical landmarks or diagnostic tasks. Further, inter-rater reliability (eg. Cohen’s kappa) is not reported. Please design an evaluation where experts score the visibility of predefined regions of interest (eg. fracture lines and vessel lumen). A task-based assessment (eg. fracture detection sensitivity/specificity) can provide a more objective link between denoising quality and clinical utility.

Our intention was to design an initial “first-impact” survey that evaluates overall perceived image quality rather than highly structured criteria. We recognize the limitations of the current survey, including the absence of task-based scoring and inter-rater reliability metrics. Conducting a more structured, task-based evaluation is challenging, as it requires participants to have access to the interactive visualization framework and sufficient training to perform specific diagnostic tasks. These issues are discussed in more detail in our complementary work, presented in a separate publication.

5. As the training data is heavily weighted toward CBCT, with minimal MRI and spiral CT inclusion, please test on unseen modalities or acquisition parameters to assess robustness.

Our testing dataset already included scans acquired with different acquisition parameters, ensuring that the evaluation assessed robustness beyond the training conditions. As for other modalities beyond CBCT, we currently do not have access to sufficient spiral CT and MRI datasets to perform such an evaluation. Expanding the training and testing to additional modalities remains an important direction for future work.

6. Why were more recent architectures (e.g., transformer-based models, diffusion-based denoisers) excluded from experiments when speed was not the only criterion? Did the authors consider mixed-precision inference to accelerate larger models like Samsung/Tyan?

We acknowledge the reviewer’s point regarding more recent architectures. At the time we initiated this work, our focus was on convolutional autoencoder–based approaches, and we were not yet aware of the more recent transformer- or diffusion-based denoisers. These are indeed promising directions that could be explored in future research. Regarding mixed-precision inference, we did not experiment with it in this study, but we agree that it could be a valuable approach to accelerate inference for larger models, and we will consider this in our future work.

7. The work does not experimentally evaluate state-of-the-art architectures (eg. Restormer and SwinIR), despite noting them as future work. At least one additional benchmark, with a modern lightweight transformer-based denoiser to strengthen comparisons, should be considered.

We thank the reviewer for this valuable suggestion. We agree that benchmarking against more recent transformer-based architectures such as Restormer and SwinIR would further strengthen the comparisons. However, conducting such experiments is beyond the current scope of this study, which was designed as a first step toward integrating deep learning denoising into our interactive MCPT framework. We plan to explore and benchmark lightweight transformer-based models as part of future work.

8. While the authors acknowledge the PSNR/SSIM vs. clinical perception gap, they do not propose or test improved perceptual metrics. Please explore volumetric-specific perceptual metrics or present correlation analyses between expert ratings and existing metrics.

In addition to PSNR and SSIM, we also included LDR-FLIP, which is a perceptual metric designed to capture human visual sensitivity. However, we agree that volumetric-specific perceptual metrics are needed to better reflect clinical perception, though developing or integrating such metrics is beyond the scope of the present study. As for the analysis of the survey data, a more detailed evaluation and discussion is provided in our complementary work (presented in a separate publication).

9. Please clarify whether HDR training was attempted and if it meaningfully affected results. Also, explain whether mixed-precision inference (FP16) was tested for speed improvement.

We did not consider mixed-precision inference in the current study, as also noted in Comment 6. However, we agree that this is an important direction and have added a note on mixed-precision inference (FP16) to the Discussion section as a potential way to accelerate inference on larger models. Regarding HDR training, we did not attempt it within the scope of this work. Nonetheless, we acknowledge that the observed differences between quantitative and qualitative metrics could be influenced by the choice of LDR vs. HDR training, and we identify this as a promising avenue for future investigation.

10. Provide exact hardware specs for all experiments (GPU VRAM, CPU threads and OS) for reproducibility. Also, include inference time breakdown for tile splitting, model inference, and tile merging.

We thank the reviewer for this suggestion. We have added detailed hardware specifications for all experiments, including GPU VRAM, CPU threads, and operating system. Additionally, we included a breakdown of inference time for tile splitting, model inference, and tile merging in the revised manuscript.

11. Finally, please make the training code, pretrained domain-specific models, along with preprocessing and tile-based inference scripts, publicly available for verification. Authors can choose GitHub/Zenodo-like repositories.

The training code and data, pretrained domain-specific models, as well as inference script have been made publicly available via Google Drive. Please note that the code for preprocessing and tile-based inference is implemented in C++ and is part of the production code, so it is not publicly accessible.

 

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

This paper is a valuable and well-documented contribution to applying deep learning to MCPT denoising in biomedical visualization. However, I have a few comments and questions.

Comments:

- Limitations in the clinical evaluation. The evaluation was performed without clearly defined landmarks, making it difficult to compare results between participants. Are additional tests planned?

- The article describes limited tests on "unseen" parameters. Only 10 images per configuration make assessing the models' generalisation outside the training distribution challenging.

- There is no direct comparison with other temporal denoising methods, although they exist in the literature and could better maintain consistency in interaction.

Questions:
- Has contrastive or self-supervised learning been considered to increase the diversity of representations with a limited dataset?

- Has an analysis been conducted to determine which image elements (e.g., bone boundaries, vascular structures) are most often misinterpreted by the models as noise? How do you plan to reconcile diagnostic quality requirements with the need to maintain smooth interaction on mid-range hardware?

In summary, this work is a solid starting point for further research, including temporal denoising and the integration of transformer methods.

Sincerely.

Author Response

- Limitations in the clinical evaluation. The evaluation was performed without clearly defined landmarks, making it difficult to compare results between participants. Are additional tests planned?

We thank the reviewer for this observation. Our intention was to design an initial “first-impact” survey that evaluates overall perceived image quality rather than highly structured criteria. We recognize that the current survey did not define specific anatomical landmarks which makes direct comparison between participants more challenging. Conducting a more structured, task-based evaluation is difficult, as it requires participants to have access to the interactive visualization framework and sufficient training to assess predefined regions consistently. These issues are discussed in more detail in our complementary work, presented in a separate publication.

- The article describes limited tests on "unseen" parameters. Only 10 images per configuration make assessing the models' generalisation outside the training distribution challenging.

At the time of writing this paper, the framework we used did not provide an interface for modifying light configuration. Additionally, we observed that users typically do not change transfer function parameters and instead rely on the available presets. Nevertheless, we agree that training and evaluating the model on a broader range of configurations would strengthen the assessment of generalization, and we have added this point to the Limitations section of the revised manuscript.             

- There is no direct comparison with other temporal denoising methods, although they exist in the literature and could better maintain consistency in interaction.

We thank the reviewer for this insight; however, the primary scope of the paper was to focus on improving single-frame denoising. Exploring mechanisms such as temporal feedback and comparisons with other temporal denoising methods was considered beyond the scope of the current study.

Questions:
- Has contrastive or self-supervised learning been considered to increase the diversity of representations with a limited dataset?

Contrastive or self-supervised learning approaches were not considered in the current study, but we acknowledge that they could be valuable for increasing representation diversity when training with a limited dataset and may be explored in future work.

- Has an analysis been conducted to determine which image elements (e.g., bone boundaries, vascular structures) are most often misinterpreted by the models as noise? How do you plan to reconcile diagnostic quality requirements with the need to maintain smooth interaction on mid-range hardware?

In the current study, we did not conduct a systematic analysis of which specific image elements (e.g., bone boundaries, vascular structures) are most often misinterpreted as noise by the models. This is an interesting direction for future work, as it could help tailor denoising strategies to preserve diagnostically relevant features. Regarding maintaining smooth interaction on mid-range hardware, our current optimizations using CUDA–OpenGL interoperability already reduce CPU–GPU transfer overhead and allow interactive frame rates on modern GPUs. Further strategies, such as model pruning or mixed-precision inference, could be explored in the future to balance diagnostic quality with real-time performance on a wider range of hardware. We have added this discussion to the revised manuscript.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Accept in present form.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors were able to address most of my queries. Benchmarking the algorithm would have been nice to see; however, it was called beyond the current scope by the authors.

Back to TopTop