Next Article in Journal
Using Machine Learning to Revise the AJCC Staging System for Neuroendocrine Tumors of the Pancreas
Previous Article in Journal
Exploring Novel Applications: Repositioning Clinically Approved Therapies for Medulloblastoma Treatment
Previous Article in Special Issue
Strategies for Offline Adaptive Biology-Guided Radiotherapy (BgRT) on a PET-Linac Platform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Interpretable and Performant Multimodal Nasopharyngeal Carcinoma GTV Segmentation with Clinical Priors Guided 3D-Gaussian-Prompted Diffusion Model (3DGS-PDM)

Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China
*
Authors to whom correspondence should be addressed.
Cancers 2025, 17(22), 3660; https://doi.org/10.3390/cancers17223660
Submission received: 7 September 2025 / Revised: 13 October 2025 / Accepted: 29 October 2025 / Published: 14 November 2025

Simple Summary

This is the first study to utilize 3D Gaussian representations and a Gaussian-prompt diffusion model for performant and interpretable multimodal medical imaging segmentation. Our proposed 3D Gaussian-prompted diffusion model addresses two long-standing challenges in this area: (1) accuracy limitation caused by heavy information redundancy and (2) intepretability defectiveness caused by unreliable information extraction and integration from multimodal inputs. Exclusive experiments have demonstrated that our proposed method can not only evidently boost the multimodal segmentation performance of gross tumor volume for nasopharyngeal carcinoma but also undertake segmentation in an interpretable step-wise diffusion process with traceable contribution from prior guidance on multimodal imaging inputs.

Abstract

Background: Gross tumor volume (GTV) segmentation of Nasopharyngeal Carcinoma (NPC) crucially determines the precision of image-guided radiation therapy (IGRT) for NPC. Compared to other cancers, the clinical delineation of NPC is especially challenging due to its capricious infiltration of the adjacent rich tissues and bones, and it routinely requires multimodal information from CT and MRI series to identify its ambiguous tumor boundary. However, the conventional deep learning-based multimodal segmentation method suffers from limited prediction accuracy and frequently performs as well as or worse than single-modality segmentation models. The limited multimodal prediction performance indicates defective information extraction and integration from the input channels. This study aims to develop a 3D Gaussian-prompted Diffusion Model (3DG-PDM) for more clinically targeted information extraction and effective multimodal information integration, thereby facilitating more accurate and clinically interpretable GTV segmentation for NPC. Methods: We propose a 3D-Gaussian-Prompted Diffusion Model (3DGS-PDM) that operates NPC tumor contouring in multimodal clinical priors through a guided stepwise process. The proposed model contains two modules: a Gaussian Initialization Module that utilizes a 3D-Gaussian-Splatting technique to distill 3D-Gaussian representations based on clinical priors from CT, MRI-t2 and MRI-t1-contract-enhanced-fat-suppression (MRI-t1-cefs), respectively, and a Diffusion Segmentation Module that generates tumor segmentation step-by-step from the fused 3D-Gaussians prompts. We retrospectively collected data on 600 NPC patients from four hospitals through paired CT, MRI series and clinical GTV annotations, and divided that dataset into 480 training volumes and 120 testing volumes. Results: Our proposed method can achieve a mean dice similarity cofficient (DSC) of 84.29 ± 7.33, a mean average symmetric surface distance (ASSD) of 1.31 ± 0.63, and a 95th percentile of Hausdorff (HD95) of 4.76 ± 1.98 on primary NPC tumor (GTVp) segmentation, and a DSC of 79.25 ± 10.01, an ASSD of 1.19 ± 0.72 and an HD95 of 4.76 ± 1.71 on metastasis NPC tumor (GTVnd) segmentation. Comparative experiments further demonstrate that our method can significantly improve the multimodal segmentation performance on NPC tumors, with superior advantages over five other state-of-the-art comparative methods. Visual evaluation on the segmentation prediction process and a three-step ablation study on input channels further demonstrate the interpretability of our proposed method. Conclusions: This study proposes a performant and interpretable multimodal segmentation method for GTV of NPC, contributing greatly to precision improvement for NPC therapy treatment.

1. Introduction

Nasopharyngeal Carcinoma is a malignant tumor arising in the roof and lateral walls of the nasopharyngeal cavity [1]. The majority of NPC tumors are cured with radiation therapy, and image-guided radiation therapy (IGRT) is the standard technique for NPC [2]. During IGRT treatment planning, the delineation accuracy of NPC gross tumor volume (GTV) is crucial due to the proximity between the NPC tumor and critical brain tissues. However, GTV contouring for NPC is particularly error-prone because the NPC can infiltrate diverse areas of the adjacent bones and rich tissues within the head and neck region, and the extent of this involvement is often reflected by subtle changes and exhibits varied contrasts in CT, MRI-t1, MRI-t2 and MRI-t1-contrast-enhancement (MRI-t1-ce) imaging modalities. As a result, NPC tumors are always compounded with the surrounding anatomy, and the tumor boundaries become visually ambiguous and difficult to recognize compared to tumors in other regions.
In clinical scenarios, to delineate accurate contouring of the NPC tumor region, radiologists need to fuse multimodal images (CT, MRI-t1, MRI-t2 and MRI-t1-ce) into the same coordinates, identify anatomical structures by comparing different modalities and adjacent slices, and determine the tumor-contouring workflow based on empirical evidence such as bone infiltration in MRI-t1, lymph node enlargement in MRI-t2 and meninges thickening in MRI-t1-cefs [3]. The clinical delineation process is extremely labor-intensive and highly reliable to the radiologists’ expertise.
To reduce the workload, fast and accurate deep-learning-based NPC GTV segmentation methods have gained overwhelming popularity in recent years. Conventional deep learning-based models follow an annotation-targeting paradigm by formulating deterministic mapping from the source image to the predicted segmentation masks. Segmentation models based on a single modality source image, such as CT imaging or MRI-t1-ce, have achieved moderate segmentation accuracy [4,5]. These methods seek network complexity increments and the integration of clinical priors to improve the segmentation accuracy. However, due to the diverse information exhibited by the NPC on CT and different MRI series, the incomplete anatomical information from a single modality have drawn serious clinical concerns and potentially hampers the segmentation performance.
Great efforts have been devoted to developing a multimodal segmentation model for further segmentation accuracy increments, such as multi-scale sensitive U-Net [6], intuition-inspired hypergraph modeling [7], automatic weighted dilated convolutional network [8], enhancement learning [9], mask-guided self-knowledge distillation [10], uncertainty and relation-based semi-supervised learning [11] and uncertainly-based end-to-end learning [12]. However, these attempts have achieved limited improvement or even performance degradation. Information redundancy may be the fundamental reason behind the performance degradation, as although supplementary tumor-related information is integrated from multimodal inputs, a larger ratio of irrelevant information is introduced at the same time [13]. The high ratio of irrelevant information greatly increases training difficulty [14], and the models fail to efficiently extract tumor-focused information from multimodal inputs while comprehensively integrating pertinent information for more accurate tumor contouring. Most importantly, due to the simplicity of the high-dimensional convolutional feature space and end-to-end predicting process, there remains an urgent demand for methods which can effectively undertake clinically oriented information extraction and reasonable contribution distributions from multimodal inputs.
To address the above limitations, in this study, we propose a 3D Gaussian-prompted Diffusion Model (3DG-PDM) for accurate and reasonable multimodal NPC segmentation. The 3DG-PDM contains two modules: a Gaussian Initialization Module that utilizes a 3D-Gaussian-Splatting technique to distil 3D-Gaussian representations based on clinical priors for clinically oriented information extraction from CT, MRI-t1-contract-enhanced(MRI-t1-ce) and MRI-t2, respectively, and a diffusion segmentation module that generates tumor segmentation step by step, guided by step-wise 3D-Gaussian, for an interpretable segmentation process. Extensive experiments are conducted on 600 NPC patients from four hospitals to evaluate the segmentation performance of our model. The dice similarity coefficient (DSC), mean average symmetric surface distance (ASSD) and 95th percentile of Hausdorff (HD95) indexes are employed for quantitative evaluation compared to other state-of-the-art multimodal segmentation models. A visual evaluation and ablation study are also performed to demonstrate the performance and interpretability of our method.

2. Materials and Methods

2.1. Method Overview

Figure 1 shows the overall workflow of the proposed 3DG-PDM model. A diffusion segmentation model is pretrained first, to build a relation library encoding segmentation mapping relation between source CT imaging and the paired GTV clinical annotations. Then, a Gaussian initialization module is pretrained on CT, MRI-t1-ce and MRI t2 imaging to extract 3D Gaussian features (3DGF) which contain key information from multimodal inputs, based on clinical focuses on each modality. During the segmentation process, multimodal imaging data (MRI-t1-ce, MRI-t2 and CT) are fed into 3DGF encoders to extract clinically-focused Gaussian features, and then the Gaussian features are used as prompts for guiding the conditional sampling process of the diffusion segmentation module, generating stepwise coarse-to-fine segmentation predictions with precise anatomy location, abnormality recognition and cancer distinguishing ability.

2.2. Gaussian Initialization Module

The Gaussian initialization module refines key information from multimodal imaging inputs based on clinical priors guidance, which filters unwanted information and facilitates effective multimodal information extraction.
Inspired by the success of the 3D Gaussian splatting technique in the area of novel 3D scene reconstruction [15], a great deal of research has been developed to leverage the unique advantages of 3D Gaussian feature segmentation [16,17,18]. Compared to conventional high-dimensional feature spaces encoded by convolutional neural network (CNN) layers or transformer tokens, features encoded in 3D Gaussian space have unique advantages: (1) 3D Gaussian features are highly refined from input imaging owing to the pixel-to-Gaussian training process; (2) 3D Gaussian features can be extracted individually on a single imaging modality integrated from multimodal imaging without information loss, through mathematical addition between 3D Gaussian distributions; and (3) 3D Gaussian features are extracted and applied in a clinically interpretable process, because the information containment of 3D Gaussian features is observable in structured low-dimensional Gaussian space. The explicitness of 3D Gaussian features enables effective information extraction from clinical priors.
Based on the above traits of 3D Gaussian features, we employ a 3D Gaussian features extraction technique for refining clinically focused information from multimodal inputs. To adapt the 3D Gaussian feature extraction process in medical imaging scenarios, we designed a Gaussian initialization module. Figure 2 shows the architecture of the proposed Gaussian initialization module. Regional 3D points are first located from a 3D image input by multiple CNN layers, namely multi-segmentors. An initialization block containing multiple 3D Gaussian encoders (3DGE) is then applied to transfer pixel points into 3D Gaussian feature points. The 3D Gaussian feature points are then decoded into regional anatomy images by multiple CNN layers, namely decoders. Finally, similarity loss is calculated by comparing decoded regional anatomy images and real regional anatomical images acquired based on clinical delineations. By optimizing the similarity loss, the gradients of the network are updated.
Specifically, the initialization block initials 3D Gaussian feature points from pixel points referring to definitions from the original 3D Gaussian splatting techniques [15]. The definition of each 3D Gaussian feature point can be formulated as
G ( x ) = e 1 2 ( x μ ) T 1 ( x μ ) C ( θ , ϕ )
where x represents a pixel point in 3D space and G ( x ) represents a 3D Gaussian point in 3D Gaussian space. μ represents the mean position of the Gaussian feature point in 3D Gaussian feature space. C ( θ , ϕ ) refers to Spherical Harmonic coefficients, which represent color information.
The 3D Gaussian features are conventionally rendered to 2D images by a “splatting” technique; they are projected to 2D Gaussian features by transforming the covariant matrix according to the following transformation [19]:
= J W W T J T
where J represents the Jacobian of the affine approximation of the projective transformation, and W is the view transformation.
Since medical imaging is defined by pixel intensity rather than RGB colors, we alters the 3D Gaussian feature formulation according to [20] as follows:
G ( x ) = e 1 2 ( x μ ) T 1 ( x μ ) i
where i refers to pixel intensity value.
Additionally, compared to rendering 3D Gaussian feature points into 2D images, we employ CNN decoders for directly transforming 3D Gaussian feature points into 3D imaging.

2.3. Diffusion Segmentation Module

The diffusion segmentation module integrates multimodal information in a Gaussian-prompted step-wise sampling process. The integration of Gaussian features in Gaussian space enables more effective information fusion, and the combination of Gaussian prompts and diffusional sampling processes allows segmentation to operate in an interpretable way. Interpretively, the influence of Gaussian prompts is explicitly exhibited as the altering of the sampling trajectory. This altering further demonstrates that 3D Gaussian features can effectively extract clinical prior information, and multimodal prior information can effectively alters segmentation processes through different aims.
Figure 3 shows the architecture of the proposed diffusion segmentation module. Figure 3a shows the unconditional pre-training process of the image-GTV relation library encoding. A typical U-Net is employed for denoising, and special relation-encoding blocks (REB) are used for generating correlations from CT imaging and GTV mask predictions, with reference to conventional designs of diffusion models for segmentation [21].
Figure 3b shows the key information selection process during 3D Gaussian feature extraction. According to an influential clinical NPC segmentation guideline [22], during the clinical NPC delineation process, doctors have a strong empirical focus on multimodal imaging. For NPC patients, due to mucosal inflammation, the water region is visually dark in MRI t1-ce imaging, resulting in an intensity enhancement of the cancer region and vessels. Furthermore, a combined region containing cancer and vessels can be further recognized from MRI-t1-ce by thresholding. The enhanced region automatically acquired by thresholding can also be used to guide 3D Gaussian feature extraction from MRI-t1-ce imaging to encode a coarser cancer range combining the cancer and vessels. Additionally, while the combined region of vessels and cancer is still intensity-enhanced, the vessels experience long-duration highlighting in MRI-t2 imaging, although the intensity distribution of the cancer region is uneven and mixed. This distribution difference can be leveraged to distinguish cancer from vessels, and can allow fine cancer contouring to be acquired from a coarse cancer range. To guide the 3D Gaussian feature encoder to recognize vessel regions, we set the training target as the difference found between the intensity-enhanced region (acquired by thresholding adjustment) and clinically annotated GTV. In addition, because NPC cancer involves bone destruction and CT exhibits high-contrast bone information, a 3D Gaussian feature encoder on CT imaging was developed for locating the bone-destructed region, targeted by an overlap between GTV annotations and bones (acquired by thresholding adjustment).
After pre-training of the diffusion segmentation module, which encodes the CT-GTV relation, and the pre-training of the Gaussian initialization module, which encodes multimodal clinically-focused priors, a Gaussian-prompted step-wise segmentation is operated during the sampling process of the diffusion model. As can be seen from Figure 3c, as GTV contouring is generated step-wise from the denoising sampling process of the diffusion model, 3D Gaussian features extracted from multimodal inputs are used as prompts to perturb the step-wise segmentation process. During this conditional sampling process, a coarse cancer range is firstly located by Gaussian features from MRI-t1-ce imaging, finer cancer contouring is then recognized by Gaussian features from MRI-t2 imaging, and finally the generated GTV contouring is expanded to include bone destruction regions through Gaussian features from CT imaging. By combining the Gaussian initialization module and the diffusion segmentation module, the proposed 3DG-PDM is able to operate interpretable and performant multimodal NPC GTV segmentation in a step-wise auto-contouring process guided by 3D Gaussian feature-encoded clinical priors.
Owing to its mathematical advantages, a 3D Gaussian distribution can be mathematically reflected along one axis into 2D Gaussian distribution. During the conditional sampling process, we reflected our extracted 3D Gaussian feature points along the z-axis into 2D Gaussian distribution and added them into step-wise Gaussian noise in the conditional sampling process. Compared to conventionally adding the image as a condition, directly adding 2D Gaussian distribution achieves better information preservation. To further enable prompts from multi-modal inputs to guide the sampling process, we added the condition step-wise from the three input modalities to the sampling process, specifically 3DG from MRI-t1-ce in step 1–350, 3DG from MRI-t2 in step 351–700 and 3DG from CT in step 701–1000.
The visibility of the input-feature and feature-prediction corresponding relations during the generating process strongly demonstrate the contribution of each imaging modality and inherently improve the segmentation performance through more refined information leveraging.

2.4. Dataset

As can be seen from Figure 4, data on 600 NPC patients were retrospectively collected with CT, MRI series (t1, t1-contrast enhanced and t2) and clinical annotations from four hospitals: Queen Mary’s Hospital, Hong Kong (QMH); Queen Elizabeth Hospital, Hong Kong (QEH); Xijing Hospital, Xian, China (XJH); and Western War Zone General Hospital, Chengdu, China (WWH). Then, 480 cases were selected as the training set and 120 cases for independent testing. The training set was further separated randomly into 384 cases for training and 96 cases for validation. For each patient, to align the MRI series and CT imaging into the same spatial coordinate system, 3D registration was operated using the Elastix registration toolbox from the MRI series (moving images) to CT imaging (fixed image) [23]. The voxel units of both CT and MRI series imaging were resampled to a 1:1:1 ratio by spacing along x, y and z axes. The intensity values of all images were scaled to the range of [0, 1], with a scaling factor calculated by the difference between the minimum and maxium intensity value, respectively, from all CT images and from each MRI imaging series. For CT imaging, a Hounsfield Unit (HU) value below −1000 or over 1000 was replaced with −1000 or 1000.

2.5. Implementation Details

All experiments were conducted on a 48 GB A6000 GPU. The number of 3D Gaussian feature points were initialized as 50,000 for each input imaging modality, similar to [20], occupying around 20GB GPU space during the Gaussian feature initialization training stage and 1 GB GPU space at the Gaussian feature initialization inference stage. The diffusion model for segmentation was unconditionally trained on 3D volumes of size (256,256,20) along the x, y and z axes, occupying 16GB GPU space for each individual batch.

3. Results

3.1. Quantitative Evaluation

To demonstrate the superiority of our method, we compared it with five state-of-the-art multimodal segmentation methods: MsU-Net [6], AD-Net [8], Multi-resU-Net [24], nnformer [9] and nnU-Net [25]. The dice similarity coeffient (DSC) [26], average symmetric surface distance (ASSD), and 95th percentile of Hausdorff ( H D 95 ) indexes were calculated between the predicted GTV segmentation and the real clinical annotations. The quantitative comparison results are listed in Table 1.
The table shows that our method achieved the highest segmentation accuracy among all comparative multimodal segmentation methods. Our method achieved a mean DSC of 84.29 ± 7.33 for GTVp segmentation and a mean DSC of 79.25 ± 10.01 for GTVnd segmentation, and reached an average segmentation accuracy of 81.77 ± 8.67. As observed, for GTVp segmentation, the mean DSC of comparative methods such as AD-Net, Multi-resU-Net, nnformer and nnU-Net commonly meets a bottleneck at 80. Compared to these methods, our method has achieved a significant accuracy improvement to 84, and the dominant accuracy improvements indicate that our method breaks through the performance bottleneck due to more efficient multimodal information extraction and integration. For GTVnd segmentation, our method also achieved prominent accuracy improvement. ASSD and H D 95 indexes follow the same performing tendency as DSC, which further demonstrates the superiority of our method.

3.2. Qualitative Evaluation

For the qualitative evaluation, a representative case was selected for evaluating the visual performance of our model. Figure 5 shows the comparative visual performance evaluation, including GTVp and GTVnd segmentation results. GTVp is marked in pink and GTVnd is marked in blue. The figures show that the GTVp and GTVnd contouring generated by our proposed method achieved the best similarty to the ground truth clinical annotations. It is worth noticing that our method is able to recognize subtle details within the cancer region, and achieves stable segmentation accuracy on both GTVp and GTVnd. By contrast, comparative methods have great difficulty in distinguishing subtle health regions from the cancer range, and most of them fail to maintain stable performance for both GTVp and GTVnd.

3.3. Ablation Study

To further demonstrate the effectiveness of multimodal information extraction and integration, we conducted an ablation study. As is shown in Figure 6, in the ablation study, the multimodal input channels are gradually fed to the diffusion model in three steps. Step 1 involves single Gaussian feature prompting from MRI-t1-ce imaging during the diffusion segmentation sampling process. Step 2 uses both Gaussian features from MRI-t1-ce and MRI-t2 imaging to prompt the diffusion sampling process. In step 3, Gaussian features from CT are added for a complete three-step prompt from all MRI-t1-ce, MRI-t2 and CT source imaging data. From the three-step ablation process we can observe improvements in segmentation accuracy, from a coarse cancer range to fine and even finer cancer contouring. We summarized the quantitative experiments on the ablation study in Table 2. The ablation study evidently demonstrates the effectiveness of information extraction and integration from multimodal imaging inputs. We summarized the quantitative results in Table 2, with a comparison between our proposed method and other multimodal segmentation methods by DSC, ASSD and H D 95 .

4. Discussion

In this study, we proposed a 3D Gaussian-prompted Diffusion Model for performant and interpretable multimodal NPC segmentation. We designed a 3D Gaussian feature extraction strategy, namely the Gaussian extraction module, for selectively extracting clinically focused information from multimodal imaging. Additionally, we designed a Gaussian-prompted segmentation-oriented denoising diffusion probabilistic model, namely the diffusion segmentation module, to efficiently integrate multimodal information and undertake segmentation in an interpretable step-wise generating process.
Exclusive experiments have demonstrated that, compared to other methods, our proposed method has achieved an accuracy improvement on both GTVp and GTVnd segmentation by DSC, on a multi-institutional testing dataset built on four hospitals. While other methods reached their performance bottleneck at a average DSC of 77, our method has made a significant breakthrough by increasing the average DSC to 81. ASSD and H D 95 indexes follow the same performing tendency as DSC. As has been mentioned in many works focusing on multimodal generation [13,27,28], information redundancy could be the main cause hampering the model performance. We addressed this limitation by proposing a Gaussian initialization module for a more refined and clinically biased information extraction from the multimodal inputs. The proposed Gaussian initialization module fully leverages the unique advantages of 3D Gaussian representations, including key clinically focused information refinement based on the highly refined trait of 3D Gaussian feature points, efficient multimodal information integration based on flexible mathematical additions in Gaussian space, and interpretable coarse-to-fine segmentation predicting processes based on the explicitness of low-dimensional Gaussian feature and the compatibility between Gaussian prompts and the conditional diffusion model. Additionally, the quantitative analysis summarized in Table 1 showed that, compared to other methods, our proposed method achieved sufficient performance consistency among four different hospitals with the minimum center bias. The high multi-institutional consistency further demonstrated the generalization ability and high clinical utility of the proposed method. Another unique advantage of the proposed method is that information is extracted separately from each input modality channel, which greatly alleviates potential registration errors and inherently increases the generalization ability across different centers.
Not restricted to more refined and clinically reliable information extraction, our proposed Gaussian-prompted diffusion segmentation module also further endows the whole multimodal segmentation process with interpretability. As related works commonly point out that the contribution of multimodal input channels tends to become uneven during the training process [29], currently, deep learning-based methods may not be able to achieve effective information integration from multimodal inputs based on a real clinical focus. This ineffective information integration can greatly limit the model’s performance, and moreover, adjusting the multimodal information contribution can be challenging due to the implicit end-to-end training paradigm. Adding multichannel weighing factors [30] or designing loss functions for specific channels [31] are common solutions for adjusting multichannel channel contributions, yet these methods lack effectiveness and fail to utilize clinical priors for adjusting multimodal input contributions. Combining Gaussian prompts and diffusion models can ensure interpretable and effective multimodal information integration. On the one hand, clinical bias can be applied in the Gaussian feature extraction period, for example, by guiding the Gaussian encoder to extract bone-destruction-related information from CT imaging. The clinically biased information can be further conveyed to the segmentation process by adding the Gaussian feature to the intermediate images during the sampling process of the diffusion model, without information loss due the flexible operational ability of Gaussian distribution. On the other hand, clinically biased information from different imaging modalities can be respectively used for prompts during the sampling process, and the effects from the prompt to the segmentation trajectory altering can be observable. Compared to a conventional end-to-end segmentation model, diffusion model-based segmentation is undertaken in an observable step-wise denoising trajectory [32], and the altering of the trajectory path prompted by Gaussian features from multimodal inputs reasonably demonstrates the multimodal information integration effectiveness and further demonstrates the interpretability of the proposed method. The conducted ablation study evidently reveals this effectiveness and interpretability.
Although our proposed method has achieved considerable accuracy improvement and demonstrated interpretability for multimodal segmentation, it still suffers from slow generating speed due to the diffusion-based model design. Additionally, there is still a wealth of professional knowledge on the clinical NPC GTV delineation process that can be leveraged in the Gaussian feature extract stage for a more clinically focused design. The multi-institutional robustness of the proposed method also requires further inter-observer assessments. In the future, we plan to employ more advanced denoising diffusion probabilistic approaches for accelerating the segmentation generation process, such as denoising diffusion implicit models (DDIM). We also plan to further explore effective clinical knowledge on recognizing NPC GTV in different imaging modalities for more comprehensive Gaussian initialization module design, thereby further boosting the model performance.

5. Conclusions

This study introduces a 3D Gaussian-prompted diffusion model for multimodal gross tumor volume (GTV) segmentation of nasopharyngeal carcinoma (NPC) patients. The model is designed to enhance the clinical relevance of feature extraction process and improve the integration efficiency of multimodal imaging information fusion process. The proposed method addresses two long-standing challenges in multimodal segmentation:accuracy limitation due to information redundancy;and poor interpretability caused by unreliable data fusion process. Extensive experiments demonstrate that the proposed model achieves significant improvements in both accuracy and interpretability over state-of-the-art methods. These gains are attributed to its use of 3D Gaussian representations for more clinically targeted feature extraction and a Gaussian-prompted diffusion mechanism for more effective multimodal integration. Overall, this work offers a robust and interpretable tool for multimodal NPC GTV segmentation, which is critical for NPC radiation therapy treatment precision. The proposed interpretable and robust multimodal segmentation paradigm processes significant clinical utility, which may potentially facilitate a more effective image-guided radiation therapy process.

Author Contributions

Conceptualization, methodology, writing—original draft preparation, investigation: J.Z.; validation, resources, Z.M.; writing—review and editing, supervision: G.R. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly supported by the Health and Medical Research Fund (11222456) of the Health Bureau, the Pneumoconiosis Compensation Fund Board in HKSAR, and Shenzhen Science and Technology Program (JCYJ20230807140403007), Guangdong Basic and Applied Basic Research Foundation (2025A1515012926).

Institutional Review Board Statement

This study complied with the Declaration of Helsinki and was approved by the Research Ethics Committee (Kowloon Central/Kowloon East), reference number KC/KE-18-0085/ER-1, 14 September 2018.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data that support the findings of this study are confidential.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wong, K.C.; Hui, E.P.; Lo, K.W.; Lam, W.K.J.; Johnson, D.; Li, L.; Tao, Q.; Chan, K.C.A.; To, K.F.; King, A.D.; et al. Nasopharyngeal carcinoma: An evolving paradigm. Nat. Rev. Clin. Oncol. 2021, 18, 679–695. [Google Scholar] [CrossRef]
  2. Lin, L.; Dou, Q.; Jin, Y.M.; Zhou, G.Q.; Tang, Y.Q.; Chen, W.L.; Su, B.A.; Liu, F.; Tao, C.J.; Jiang, N.; et al. Deep learning for automated contouring of primary tumor volumes by MRI for nasopharyngeal carcinoma. Radiology 2019, 291, 677–686. [Google Scholar] [CrossRef]
  3. Bossi, P.; Chan, A.T.; Licitra, L.; Trama, A.; Orlandi, E.; Hui, E.P.; Halámková, J.; Mattheis, S.; Baujat, B.; Hardillo, J.; et al. Nasopharyngeal carcinoma: ESMO-EURACAN Clinical Practice Guidelines for diagnosis, treatment and follow-up. Ann. Oncol. 2021, 32, 452–465. [Google Scholar] [CrossRef]
  4. Tang, P.; Zu, C.; Hong, M.; Yan, R.; Peng, X.; Xiao, J.; Wu, X.; Zhou, J.; Zhou, L.; Wang, Y. DA-DSUnet: Dual attention-based dense SU-net for automatic head-and-neck tumor segmentation in MRI images. Neurocomputing 2021, 435, 103–113. [Google Scholar] [CrossRef]
  5. Guo, Z.; Li, X.; Huang, H.; Guo, N.; Li, Q. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans. Radiat. Plasma Med. Sci. 2019, 3, 162–169. [Google Scholar] [CrossRef]
  6. Hao, Y.; Jiang, H.; Diao, Z.; Shi, T.; Liu, L.; Li, H.; Zhang, W. MSU-Net: Multi-scale Sensitive U-Net based on pixel-edge-region level collaborative loss for nasopharyngeal MRI segmentation. Comput. Biol. Med. 2023, 159, 106956. [Google Scholar] [CrossRef]
  7. He, Q.; Sun, X.; Diao, W.; Yan, Z.; Yao, F.; Fu, K. Multimodal remote sensing image segmentation with intuition-inspired hypergraph modeling. IEEE Trans. Image Process. 2023, 32, 1474–1487. [Google Scholar] [CrossRef]
  8. Peng, Y.; Sun, J. The multimodal MRI brain tumor segmentation based on AD-Net. Biomed. Signal Process. Control 2023, 80, 104336. [Google Scholar] [CrossRef]
  9. Tao, G.; Li, H.; Huang, J.; Han, C.; Chen, J.; Ruan, G.; Huang, W.; Hu, Y.; Dan, T.; Zhang, B.; et al. SeqSeg: A sequential method to achieve nasopharyngeal carcinoma segmentation free from background dominance. Med Image Anal. 2022, 78, 102381. [Google Scholar] [CrossRef]
  10. Zhang, J.; Li, B.; Qiu, Q.; Mo, H.; Tian, L. SICNet: Learning selective inter-slice context via Mask-Guided Self-knowledge distillation for NPC segmentation. J. Vis. Commun. Image Represent. 2024, 98, 104053. [Google Scholar] [CrossRef]
  11. Shi, Y.; Zu, C.; Yang, P.; Tan, S.; Ren, H.; Wu, X.; Zhou, J.; Wang, Y. Uncertainty-weighted and relation-driven consistency training for semi-supervised head-and-neck tumor segmentation. Knowl.-Based Syst. 2023, 272, 110598. [Google Scholar] [CrossRef]
  12. Tang, P.; Yang, P.; Nie, D.; Wu, X.; Zhou, J.; Wang, Y. Unified medical image segmentation by learning from uncertainty in an end-to-end manner. Knowl.-Based Syst. 2022, 241, 108215. [Google Scholar] [CrossRef]
  13. Zhang, Y.; Sidibé, D.; Morel, O.; Mériaudeau, F. Deep multimodal fusion for semantic image segmentation: A survey. Image Vis. Comput. 2021, 105, 104042. [Google Scholar] [CrossRef]
  14. Pandey, S.; Chen, K.F.; Dam, E.B. Comprehensive multimodal segmentation in medical imaging: Combining yolov8 with sam and hq-sam models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2592–2598. [Google Scholar]
  15. Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
  16. Jain, U.; Mirzaei, A.; Gilitschenski, I. Gaussiancut: Interactive segmentation via graph cut for 3D Gaussian splatting. Adv. Neural Inf. Process. Syst. 2024, 37, 89184–89212. [Google Scholar]
  17. Cen, J.; Fang, J.; Yang, C.; Xie, L.; Zhang, X.; Shen, W.; Tian, Q. Segment any 3D Gaussians. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 1971–1979. [Google Scholar]
  18. Kim, C.M.; Wu, M.; Kerr, J.; Goldberg, K.; Tancik, M.; Kanazawa, A. Garfield: Group anything with radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21530–21539. [Google Scholar]
  19. Zwicker, M.; Pfister, H.; Van Baar, J.; Gross, M. Ewa volume splatting. In Proceedings of the Proceedings Visualization (VIS’01), San Diego, CA, USA, 21–26 October 2001; IEEE: New York, NY, USA, 2001; pp. 29–538. [Google Scholar]
  20. Li, Y.; Fu, X.; Li, H.; Zhao, S.; Jin, R.; Zhou, S.K. 3DGR-CT: Sparse-view CT reconstruction with a 3D Gaussian representation. Med. Image Anal. 2025, 103, 103585. [Google Scholar] [CrossRef]
  21. Rahman, A.; Valanarasu, J.M.J.; Hacihaliloglu, I.; Patel, V.M. Ambiguous medical image segmentation using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Denver, CO, USA, 3–7 June 2023; pp. 11536–11546. [Google Scholar]
  22. Lee, A.W.; Ng, W.T.; Pan, J.J.; Poh, S.S.; Ahn, Y.C.; AlHussain, H.; Corry, J.; Grau, C.; Grégoire, V.; Harrington, K.J.; et al. International guideline for the delineation of the clinical target volumes (CTV) for nasopharyngeal carcinoma. Radiother. Oncol. 2018, 126, 25–36. [Google Scholar] [CrossRef]
  23. Klein, S.; Staring, M.; Murphy, K.; Viergever, M.A.; Pluim, J.P. Elastix: A toolbox for intensity-based medical image registration. IEEE Trans. Med. Imaging 2009, 29, 196–205. [Google Scholar] [CrossRef]
  24. Ibtehaz, N.; Sohel Rahman, M.M. Rethinking the U-Net architecture for multimodal biomedical image segmentation. arXiv 2019, arXiv:1902.04049. [Google Scholar] [CrossRef]
  25. Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
  26. Thada, V.; Jaglan, V. Comparison of jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm. Int. J. Innov. Eng. Technol. 2013, 2, 202–205. [Google Scholar]
  27. Zhan, F.; Yu, Y.; Wu, R.; Zhang, J.; Lu, S.; Liu, L.; Kortylewski, A.; Theobalt, C.; Xing, E. Multimodal image synthesis and editing: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 4, 15098–15119. [Google Scholar] [CrossRef]
  28. Zhou, T.; Fu, H.; Chen, G.; Shen, J.; Shao, L. Hi-net: Hybrid-fusion network for multi-modal MR image synthesis. IEEE Trans. Med Imaging 2020, 39, 2772–2781. [Google Scholar] [CrossRef]
  29. Yuan, X.; Lin, Z.; Kuen, J.; Zhang, J.; Wang, Y.; Maire, M.; Kale, A.; Faieta, B. Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6995–7004. [Google Scholar]
  30. Vora, A.; Paunwala, C.N.; Paunwala, M. Improved weight assignment approach for multimodal fusion. In Proceedings of the 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA), Mumbai, India, 4–5 April 2014; IEEE: New York, NY, USA, 2014; pp. 70–74. [Google Scholar]
  31. He, B.; Wang, J.; Qiu, J.; Bui, T.; Shrivastava, A.; Wang, Z. Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14867–14878. [Google Scholar]
  32. Wu, J.; Fu, R.; Fang, H.; Zhang, Y.; Yang, Y.; Xiong, H.; Liu, H.; Xu, Y. Medsegdiff: Medical image segmentation with diffusion probabilistic model. In Proceedings of the Medical Imaging with Deep Learning. PMLR, Paris, France, 3–5 July 2024; pp. 1623–1639. [Google Scholar]
Figure 1. The overall workflow of the proposed method.
Figure 1. The overall workflow of the proposed method.
Cancers 17 03660 g001
Figure 2. Gaussian initialization module.
Figure 2. Gaussian initialization module.
Cancers 17 03660 g002
Figure 3. Diffusion segmentation module.
Figure 3. Diffusion segmentation module.
Cancers 17 03660 g003
Figure 4. Dataset description flow.
Figure 4. Dataset description flow.
Cancers 17 03660 g004
Figure 5. Visual evaluation.GTVp is marked in pink and GTVnd is marked in blue.
Figure 5. Visual evaluation.GTVp is marked in pink and GTVnd is marked in blue.
Cancers 17 03660 g005
Figure 6. Ablation study.GTVp is marked in pink and GTVnd is marked in blue.
Figure 6. Ablation study.GTVp is marked in pink and GTVnd is marked in blue.
Cancers 17 03660 g006
Table 1. Comparison between our method and other multimodal segmentation methods.
Table 1. Comparison between our method and other multimodal segmentation methods.
DSC (%) ASSD (mm) HD 95 (mm)
MethodInstitution GTVp GTVnd Average GTVp GTVnd Average GTVp GTVnd Average
Mean Dev Mean Dev Mean Dev Mean Dev Mean Dev Mean Dev Mean Dev Mean Dev Mean Dev
QEH75.107.0869.217.9972.167.541.230.350.960.531.100.443.872.843.092.643.482.74
QMH78.2710.6270.937.7474.609.181.200.490.870.441.030.473.462.972.943.223.203.10
Msu-NetXJH75.317.7973.507.3474.417.571.120.611.180.411.150.513.753.262.912.653.332.96
WWH79.242.0774.207.1776.724.621.410.431.190.861.300.653.403.653.943.453.673.55
Average76.986.8971.967.8974.477.391.240.471.050.561.150.523.623.183.222.993.423.09
QEH80.889.9072.909.6776.899.791.270.561.200.301.230.433.742.133.451.623.601.88
QMH81.794.2471.789.9476.797.091.110.651.150.451.130.554.081.963.401.533.741.75
AD-NetXJH82.038.8974.8713.0278.4510.961.010.570.920.280.970.433.821.933.071.663.451.80
WWH76.7413.8179.5316.0178.1314.911.650.661.131.091.390.884.323.023.522.273.922.64
Average80.3611.6674.7712.1677.5711.911.260.611.100.531.180.573.992.263.361.773.682.01
QEH80.575.1671.429.5976.007.381.300.421.010.371.160.403.842.493.272.283.562.38
QMH81.745.8573.009.7077.377.781.370.331.230.471.300.403.722.783.022.263.372.52
Multi-resU-NetXJH84.007.7273.748.0578.877.891.310.391.200.581.250.493.822.673.362.063.592.37
WWH81.6519.8374.729.7878.1814.811.020.860.920.660.970.763.662.742.871.723.262.23
Average81.999.6473.229.2877.609.461.250.501.090.521.170.513.762.673.132.083.452.38
QEH82.436.1275.827.5679.136.841.330.701.160.411.250.554.661.824.302.004.481.91
QMH82.459.0974.3311.1278.3910.111.070.411.200.251.140.334.711.704.002.484.362.09
nnformerXJH74.817.5876.4811.5275.659.551.320.481.220.461.270.474.281.564.262.164.271.86
WWH81.4712.3772.938.8877.2010.621.360.851.020.841.190.854.551.164.162.364.351.76
Average80.298.7974.899.7777.599.281.270.611.150.491.210.554.551.564.182.254.371.91
QEH80.508.1375.7711.4378.139.781.050.421.000.611.020.524.241.723.931.404.091.56
QMH83.306.1774.6510.2078.978.181.380.391.250.471.320.434.271.564.271.504.271.53
nnU-NetXJH79.074.2777.9413.4878.518.881.250.681.090.591.170.644.361.854.341.784.351.82
WWH83.6114.3172.526.3378.0710.321.440.871.100.931.270.905.291.954.381.564.841.76
Average81.628.2275.2210.3678.429.291.280.591.110.651.200.624.541.774.231.564.391.67
QEH82.514.8981.949.4082.237.151.340.521.220.661.280.594.861.814.481.684.671.75
QMH82.085.0678.0512.5580.078.811.320.450.940.511.130.484.581.854.311.704.451.78
ProposedXJH84.864.4080.7010.4882.787.441.090.420.980.741.040.584.951.734.491.514.721.62
WWH87.7114.9776.317.9782.0111.471.491.131.620.971.561.054.652.535.161.954.912.24
Average84.297.3379.2510.1081.778.721.310.631.190.721.250.684.761.984.611.714.691.85
Table 2. Quantitative ablation study with comparison between the proposed method and other multimodal segmentation methods.
Table 2. Quantitative ablation study with comparison between the proposed method and other multimodal segmentation methods.
MethodInstitutionStep 1
MRI-t1-ce
Step 2
MRI-t1-ce, MRI-t2
Step 3
MRI-t1-ce, MRI-t2, CT
DSC (%) ASSD (mm) HD 95 (mm) DSC (%) ASSD (mm) HD 95 (mm) DSC (%) ASSD (mm) HD 95 (mm)
QEH71.040.943.4171.631.023.4572.161.103.48
QMH73.320.853.1874.010.953.1974.601.033.20
Msu-NetXJH73.261.013.2973.871.103.3174.411.153.33
WWH75.361.123.6276.051.213.6376.721.303.67
Average73.181.023.3873.831.093.4274.471.153.42
QEH75.541.073.5676.151.153.5776.791.233.60
QMH77.160.993.7077.751.043.7278.451.133.74
AD-NetXJH76.800.823.4277.460.913.4578.130.973.45
WWH77.021.233.8977.541.323.9278.131.393.92
Average76.371.053.6276.961.123.6577.571.183.68
QEH74.771.023.5075.391.083.5276.001.163.56
QMH76.031.133.3376.691.213.3377.371.303.37
Multi-resU-NetXJH77.861.083.5678.371.153.5878.871.253.59
WWH76.930.863.2077.530.923.2578.180.973.26
Average76.411.043.3776.991.113.4177.601.173.45
QEH78.091.064.4678.621.154.4679.131.254.48
QMH77.240.984.3577.741.054.3678.391.144.36
nnformerXJH74.371.127.2674.961.187.2775.651.277.27
WWH75.941.044.2776.641.124.3177.201.194.35
Average76.201.054.2976.891.144.3377.591.214.37
QEH76.920.874.0677.490.974.0778.131.024.09
QMH77.601.154.2278.291.244.2678.971.324.27
nnU-NetXJH77.171.024.3077.851.104.3478.511.174.35
WWH76.781.094.8077.471.184.8278.071.274.84
Average77.121.074.3277.721.144.3778.421.204.39
QEH79.851.073.7381.681.204.3482.231.284.67
QMH76.910.903.3979.191.044.0580.071.134.45
ProposedXJH79.600.833.6781.780.944.3682.781.044.72
WWH79.381.323.8581.341.464.4182.011.564.91
Average78.991.023.8180.971.164.3481.771.254.69
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, J.; Ma, Z.; Ren, G.; Cai, J. Interpretable and Performant Multimodal Nasopharyngeal Carcinoma GTV Segmentation with Clinical Priors Guided 3D-Gaussian-Prompted Diffusion Model (3DGS-PDM). Cancers 2025, 17, 3660. https://doi.org/10.3390/cancers17223660

AMA Style

Zhu J, Ma Z, Ren G, Cai J. Interpretable and Performant Multimodal Nasopharyngeal Carcinoma GTV Segmentation with Clinical Priors Guided 3D-Gaussian-Prompted Diffusion Model (3DGS-PDM). Cancers. 2025; 17(22):3660. https://doi.org/10.3390/cancers17223660

Chicago/Turabian Style

Zhu, Jiarui, Zongrui Ma, Ge Ren, and Jing Cai. 2025. "Interpretable and Performant Multimodal Nasopharyngeal Carcinoma GTV Segmentation with Clinical Priors Guided 3D-Gaussian-Prompted Diffusion Model (3DGS-PDM)" Cancers 17, no. 22: 3660. https://doi.org/10.3390/cancers17223660

APA Style

Zhu, J., Ma, Z., Ren, G., & Cai, J. (2025). Interpretable and Performant Multimodal Nasopharyngeal Carcinoma GTV Segmentation with Clinical Priors Guided 3D-Gaussian-Prompted Diffusion Model (3DGS-PDM). Cancers, 17(22), 3660. https://doi.org/10.3390/cancers17223660

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop