Splatting the Cat: Efficient Free-Viewpoint 3D Virtual Try-On via View-Decomposed LoRA and Gaussian Splatting
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper introduces a virtual 3D trial that combines Gaussian Splatting, SMPL-X priors, and LoRA fine-tuning for efficient synthesis in terms of freedom of vision. The paper describes an innovative aspect determined by breaking down the 3D trial process into manageable stages that improve both efficiency and realism. The integration of CatVTON with a lightweight architecture and the innovative multi-LoRA fine-tuning approach are clear strengths.
However, the manuscript presents critical issues to which the authors will need to provide adequate responses and additions to the proposed paper. The observations noted are reported below.
1. The authors should broaden the scope of the study to include biomedical applications and those related to virtual modelling detection. Recent studies have explored the topic of monitoring SAR and temperature variations using smart electronic devices, which provide a useful parallel for discussing how computationally efficient modelling can balance performance and safety in human-centred applications. These references could be added at the end of the introduction, where the relevance of VTON beyond e-commerce is discussed.
2. Could geometric consistency also be adapted to detect structural inconsistencies in materials? Defect detection through hybrid modelling could strengthen the “Related Work” section when discussing geometric precedents.
3. The proposed pipeline could benefit from temporal modelling in dynamic scenarios. The authors should consider extending your approach with sequential sensor data. I am not asking you to implement a model from scratch, but only to supplement the Methods section with this study doi:10.1109/SAS60918.2024.10636531 because it strengthens the justification for modular and hybrid approaches.
4. Why did the authors not compare computational efficiency with physics-based simulations? In this regard, I suggest a very interesting study that should be integrated into Section 4 doi: 10.3390/electronics14112268. The study highlights how FEM-AI integration reduces computational demands in biomedical modelling, thus aligning with your emphasis on lightweight design.
5. Since your approach is hardware-constrained, how might the design of the electronic interface affect scalability?
6. Could the view-decomposed LoRA strategy be extended to subsurface garment modelling?
7. Do the authors envision applications of your framework in the healthcare sector?
8. Did the authors consider developing and extending the system to other bio-signal domains?
9. Since your approach reduces computational requirements, could it be adapted for low-resource telemedicine environments?
10. Have the authors tested the robustness in more extreme configurations (e.g., rapid pose changes)?
11. How do you intend to align the lightness strategy with sustainability in large-scale implementations?
12. Can the method be optimised using systematic experimental design approaches?
13. Considering remote wearable and medical devices, did the authors think about adapting the approach for integrated patient monitoring?
Comments on the Quality of English LanguageThe manuscript features excessively long sentences that contain redundant expressions and lack clarity in the description. Specifically, the transitions between methodological phases are sometimes verbose, and some terminology is used inconsistently across sections (for example, "rendering," "editing," and "fine-tuning" are occasionally used interchangeably without a clear distinction).
Author Response
Dear Reviewer,
Thank you very much for your valuable comments and insightful perspectives on our manuscript, "Splatting the Cat: Efficient Free-Viewpoint 3D Virtual Try-On via View-Decomposed LoRA and Gaussian Splatting." We have carefully considered each of your suggestions, which have been crucial in enhancing the quality of our paper. Below are our point-by-point responses and a description of the corresponding revisions.
Comments 1: The authors should broaden the scope of the study to include biomedical applications and those related to virtual modelling detection. Recent studies have explored the topic of monitoring SAR and temperature variations using smart electronic devices, which provide a useful parallel for discussing how computationally efficient modelling can balance performance and safety in human-centred applications. These references could be added at the end of the introduction, where the relevance of VTON beyond e-commerce is discussed.
Response 1: Thank you for highlighting such a broad and socially valuable application prospect for our research. We fully agree that efficient 3D human modeling technology holds immense potential in fields like biomedicine and telemedicine.
However, the core scope of our current study is focused on addressing the 3D virtual try-on problem within the fashion and e-commerce domains, where the technical objective is to achieve realistic appearance and texture rendering. The biomedical applications you mentioned, such as structural defect detection or patient monitoring, demand medical-grade precision and often involve different data modalities (e.g., sensor data, bio-signals), which is fundamentally different from our goal of visual realism.
Forcibly including these applications in the introduction might confuse readers regarding the core contributions of this paper. Nevertheless, we are very inspired by your suggestion. In the revised manuscript, we will briefly mention in the Discussion and Future Work section (Section 4.4) that our lightweight framework has the potential to be adapted for medical assistance and related fields as a future prospect. Thank you for this valuable suggestion.
Comments 2: Could geometric consistency also be adapted to detect structural inconsistencies in materials? Defect detection through hybrid modelling could strengthen the “Related Work” section when discussing geometric precedents.
Response 2: Thank you for this insightful question. The "geometric consistency" mentioned in our paper aims to ensure that the 3D human model remains structurally stable and free from deformation across different views. This is achieved by comparing it against the SMPL-X parametric model, which guarantees the macroscopic anatomical structure of the human body.
The material structural inconsistency or defect detection you refer to, to the best of our knowledge, typically pertains to the examination of microscopic physical structures in materials science or industrial manufacturing. The two concepts differ significantly in scale and objective. Therefore, our method is not directly applicable to material defect detection at present. We will clarify this point in our response. Thank you for your question.
Comments 3: The proposed pipeline could benefit from temporal modelling in dynamic scenarios. The authors should consider extending your approach with sequential sensor data. I am not asking you to implement a model from scratch, but only to supplement the Methods section with this study doi:10.1109/SAS60918.2024.10636531 because it strengthens the justification for modular and hybrid approaches.
Response 3: Thank you for the suggestion. Introducing temporal modeling to handle dynamic scenes is indeed a significant research direction in the 3D vision field. However, our work focuses on reconstructing and editing a static 3D scene from a set of static multi-view images. This is a problem of scene representation and editing, rather than an analysis of dynamic or temporal behavior. Extending our method to process video or sequential sensor data would constitute an entirely new research topic requiring a different methodology. Therefore, we believe this is beyond the scope of the current paper. We will continue to monitor the development of dynamic 3D VTON in our future work.
Comments 4: Why did the authors not compare computational efficiency with physics-based simulations? In this regard, I suggest a very interesting study that should be integrated into Section 4 doi: 10.3390/electronics14112268. The study highlights how FEM-AI integration reduces computational demands in biomedical modelling, thus aligning with your emphasis on lightweight design.
Response 4: This is an excellent suggestion. We thank you for pointing this out. A comparison with physics-based simulations is highly valuable and would indeed provide readers with a more comprehensive understanding of our method's positioning. We will therefore cite the suggested paper (doi: 10.3390/electronics14112268) in Section 4.4 of our manuscript.
Comments 5: Since your approach is hardware-constrained, how might the design of the electronic interface affect scalability?
Response 5: Thank you for your question. As we are not entirely certain of its meaning, if we interpret the "design of the electronic interface" as the computing hardware (e.g., GPU), then this question is indeed highly relevant to the core of our work. A primary contribution of our paper is lowering the hardware barrier. While traditional methods often require professional-grade GPUs, our framework can complete all experiments on a single consumer-grade RTX 4090 with 24GB of VRAM.
Our lightweight design (e.g., adopting CatVTON) makes the method more adaptable to different hardware, which in itself enhances scalability. In the future, this framework has the potential to be further optimized to run on even lower-resource platforms.
Comments 6: Could the view-decomposed LoRA strategy be extended to subsurface garment modelling?
Response 6: This is a very inspiring suggestion! In fact, we have already tested a semi-transparent sleeveless top in our experiments and achieved promising results (please see Figure 3 in the paper), which provides preliminary validation of our model's potential to handle complex materials. We strongly agree that this is a direction worth exploring. Theoretically, the challenge of achieving cross-view consistency for garment modeling is the same as that for the human body and clothing in a virtual try-on task. Therefore, the extension you propose is likely highly feasible.
Comments 7: Do the authors envision applications of your framework in the healthcare sector?
Response 7: Thank you for the question. As we responded in Point 1, the healthcare sector is a highly specialized and important research direction. However, the level of precision and the focus required for its outcomes differ from our current application in the fashion and e-commerce industries. Therefore, extending this framework to healthcare applications would likely require targeted enhancements to the method. As it stands, the current framework cannot be directly applied to the healthcare domain, and this falls outside the scope of our research.
Comments 8: Did the authors consider developing and extending the system to other bio-signal domains?
Response 8: Thank you for your question. As mentioned in our response to Point 1, the field of bio-signals is highly specialized and important, but the data types and analytical methods it employs are fundamentally different from our computer vision task, which is based on RGB images. Consequently, directly extending our framework to the bio-signal domain is not feasible and is beyond the scope of this study.
Comments 9: Since your approach reduces computational requirements, could it be adapted for low-resource telemedicine environments?
Response 9: As the primary objective of our method is virtual try-on—that is, altering the appearance, style, and texture of a person's clothing—we believe our research would be the premier choice if a telemedicine environment has such specific requirements. However, if a low-resource telemedicine environment has application needs that are not focused on try-on, we believe other non-VTON research would be more suitable for the required tasks and could achieve even lower computational demands.
Comments 10: Have the authors tested the robustness in more extreme configurations (e.g., rapid pose changes)?
Response 10: This is a very critical and reasonable question. To address it, we have collected a human dataset with self-occlusion and used our method to perform a virtual try-on task to test the robustness improvement brought by the SMPL-X model. The results will be presented in Appendix 1. As for the extreme configuration of rapid pose changes, the initial 3DGS scene itself would fail to generate accurately, leading to predictably poor results. To implement try-on functionality under rapid pose changes, specialized research would be needed, starting from the initial 3D scene reconstruction. Therefore, we have not tested such overly extreme poses in this work. We appreciate your understanding.
Comments 11: How do you intend to align the lightness strategy with sustainability in large-scale implementations?
Response 11: Thank you for this question, which takes a global perspective. We mentioned in our introduction that VTON technology helps "promote sustainable consumption," for instance, by reducing physical product returns due to improper fit. We believe that our lightweight strategy will increase the adoption of virtual try-on. This will not only reduce the unnecessary carbon emissions from shipping clothes but also means that the energy consumption during model training and inference will be lower, which is in itself a contribution to sustainable development.
Comments 12: Can the method be optimised using systematic experimental design approaches?
Response 12: Thank you for the suggestion. In this study, our method's design follows the established practices of the field. For example, key hyperparameters (such as loss weights and the number of iterations) were determined based on relevant prior works (e.g., GS-VTON, Gaussian Editor) combined with extensive empirical tuning to achieve an optimal balance between generation quality and computational efficiency. We believe our current experimental design is reasonable and robust.
Comments 13: Considering remote wearable and medical devices, did the authors think about adapting the approach for integrated patient monitoring?
Response 13: As stated in our response to Point 9, the method proposed in this study focuses on virtual try-on, and the majority of the methods and strategies were designed and introduced for this specific application. If the task were patient monitoring, we believe it would not necessarily be optimal to use our framework, as it could lead to an unnecessary waste of computational resources.
Comments on the Quality of English Language: The manuscript features excessively long sentences that contain redundant expressions and lack clarity in the description. Specifically, the transitions between methodological phases are sometimes verbose, and some terminology is used inconsistently across sections (for example, "rendering," "editing," and "fine-tuning" are occasionally used interchangeably without a clear distinction).
Response: Thank you very much for meticulously pointing out the deficiencies in our manuscript's language. We will conduct a thorough professional proofreading of the entire paper, revising all sentences we find to be overly long or complex into more concise and clear descriptions.
However, regarding the inconsistent use of terminology, we have referred to the terms used in previous studies. To ensure our vocabulary aligns with the consensus in the academic community of this field, we will retain the current usage of "rendering," "editing," and "fine-tuning." We will add a terminology table with detailed descriptions at the end of the manuscript to prevent any misunderstanding by the reader.
We sincerely thank you again for these observations. Your review has been immensely helpful, allowing us to consider the development of our work in a broader range of fields and strengthening the conciseness of our descriptions.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes an efficient 3D virtual try-on (VTON) framework, “Splatting the Cat,” that aims to address high computational cost, large memory footprint, and cross-view inconsistencies in existing methods. The authors decompose the 3D editing task into a four-stage process. The core innovations include a view-decomposed LoRA strategy that enhances detail clarity and the use of a lightweight CatVTON together with an SMPL-X geometric prior to reduce resource consumption while preserving human-body structural consistency. The concept is clear, the methodology is well designed, and all experiments were conducted on consumer-grade hardware, substantially lowering the adoption barrier for 3D VTON and showing strong practical relevance. To further improve the paper’s quality and rigor, the following questions and suggestions are offered:
- The virtual try-on task is inherently image-based, yet the comparisons are against two text-prompt-based 3D editing methods (EditSplat, Instruct-NeRF2NeRF). This may not be a fair comparison. Please consider adding experimental comparisons with image-based methods to strengthen the paper’s rigor.
- The paper states that the “corresponding expert LoRA module” is selected based on the azimuth angle, but the selection mechanism and boundary-handling strategy are not described. If this approach follows prior work, please cite it; otherwise, elaborate the principle and the workflow in detail, including how tie-breaking near bin edges is handled.
- In the ablation study in Table 1, the “w/o View-Decomposed LoRA” configuration attains a higher PSNR in the frontal view. Please provide a deeper analysis of why a component designed to promote global stability can reduce pixel-level similarity to the ground truth (GT) in certain views.
- In Section 3.4.2, clarify the specific workflow of the multi-view information fusion within a single iteration, preferably with a diagram that illustrates the data flow and update steps.
- Section 4.2 notes that introducing the SMPL-X prior yields limited gains due to the dataset’s simple poses. Please add, in the supplementary material, a case with more challenging poses (e.g., self-occlusion or twisting) to demonstrate the role of the SMPL-X prior in complex scenarios.
Author Response
Dear Reviewer,
Thank you for your valuable comments and insightful perspectives on our manuscript, "Splatting the Cat: Efficient Free-Viewpoint 3D Virtual Try-On via View-Decomposed LoRA and Gaussian Splatting." We agree with all of your points, and your suggestions have been crucial for enhancing the rigor and completeness of our study. We have revised the paper according to your guidance. Below are our point-by-point responses and a description of the corresponding revisions.
Comments 1: The virtual try-on task is inherently image-based, yet the comparisons are against two text-prompt-based 3D editing methods (EditSplat, Instruct-NeRF2NeRF). This may not be a fair comparison. Please consider adding experimental comparisons with image-based methods to strengthen the paper’s rigor.
Response 1: We completely agree with your assessment. The virtual try-on task is inherently image-guided, and a comparison against state-of-the-art image-based methods is essential for accurately evaluating the performance of our framework. Thank you for pointing this out, as it prompted us to conduct more comprehensive experiments.
In response to your suggestion and to strengthen the paper's rigor, we have successfully deployed GS-VTON on higher-specification hardware and conducted detailed qualitative and quantitative comparisons against this image-guided 3D VTON study. The new comparison results have been added to the experiments section (Section 4) of our paper. We are confident that this new comparative experiment more effectively highlights the significant advantages of our method in terms of computational efficiency and garment reconstruction fidelity, all while maintaining high-quality results.
Comments 2: The paper states that the “corresponding expert LoRA module” is selected based on the azimuth angle, but the selection mechanism and boundary-handling strategy are not described. If this approach follows prior work, please cite it; otherwise, elaborate the principle and the workflow in detail, including how tie-breaking near bin edges is handled.
Response 2: Thank you for pointing out this omission in our method description. A clear explanation of the LoRA module selection mechanism is crucial for the reproducibility of our method.
We have now added a detailed explanation to the methods section of the paper (Section 3.3). Our partitioning criterion is based on a simple and explicit rule: the azimuth angle of the camera relative to the subject. The specific ranges are as follows:
- Front: Azimuth angle between -60° and +60°.
- Side: Azimuth angles between 60° and 120° and between -60° and -120°. These two regions share the same "side expert LoRA."
- Back: Azimuth angle between 120° and 240°.
Regarding the boundary handling, after implementing a series of designs for cross-view consistency, such as our reference attention mechanism (as shown in Equation 10), the results already exhibit a degree of smooth transition as the viewpoint changes. Therefore, we did not design any additional strategies to handle boundary issues.
Comments 3: In the ablation study in Table 1, the “w/o View-Decomposed LoRA” configuration attains a higher PSNR in the frontal view. Please provide a deeper analysis of why a component designed to promote global stability can reduce pixel-level similarity to the ground truth (GT) in certain views.
Response 3: This is a very sharp observation, and we thank you for prompting us to think more deeply about this phenomenon. We have carefully analyzed this counter-intuitive result and propose the following explanation:
We hypothesize that this is because a single LoRA method, in its effort to learn a 360-degree appearance, is exposed to images from all angles, allowing it to develop a more coherent and unified understanding of features like the arms. In contrast, while our view-decomposed LoRA approach reduces the occurrence of feature blurring and artifacts, each module specializes in a distinct perspective. Consequently, during the iterative editing of the 3DGS scene, different expert LoRAs can influence the results generated by others. This can cause the outcome of a frontal edit to be slightly altered during a subsequent back-view edit, which in turn leads to a decrease in PSNR values for certain views.
Thank you again for this comment, which allowed us to notice this phenomenon and add this finding to the discussion in Section 4.4 of our study, inspiring future work to address this direction.
Comments 4: In Section 3.4.2, clarify the specific workflow of the multi-view information fusion within a single iteration, preferably with a diagram that illustrates the data flow and update steps.
Response 4: Your suggestion is very pertinent. We acknowledge that we did not provide a complete description of the implementation details, which could hinder the reader's understanding of our method. To address this, we have added a more detailed description of the process in Section 3.4.2, explaining how the attention mechanism is fused. We hope this description is now sufficiently clear.
Comments 5: Section 4.2 notes that introducing the SMPL-X prior yields limited gains due to the dataset’s simple poses. Please add, in the supplementary material, a case with more challenging poses (e.g., self-occlusion or twisting) to demonstrate the role of the SMPL-X prior in complex scenarios.
Response 5: This is an excellent suggestion. Directly demonstrating the role of SMPL-X in challenging scenarios is indeed the most compelling proof of its value.
To this end, we have collected a dataset featuring self-occlusion (hands covering the abdomen), where the arms are not even visible from the back. We then conducted an experiment on this dataset using two configurations: our full method (View-Decomposed LoRA + SMPL-X) and a version with only View-Decomposed LoRA (without SMPL-X). We chose a sleeveless sheer garment for the try-on to more clearly reveal the human body parts, including the arms. The results are now presented in Appendix 1. It is clearly visible that the results generated without SMPL-X are inferior in the rendering of the wrists, arms, and face compared to the full method. This provides strong evidence of the robustness introduced by incorporating SMPL-X.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript proposed methods to address some challenges in VTON. Although the manuscript presents some details of the proposed method, there are still some concerns.
1. The challenges are not clear. In the abstract, the description of the challenges is “Existing 3D VTON approaches commonly face challenges such as high computational costs, substantial memory requirements, and cross-view inconsistencies”. According to the content of the entire manuscript, it seems that the memory requirements is the challenge to be solved, not the computational costs. Alternatively, a more precise expression is needed here.
2. The unclear of the proposed improvement ideas has led to doubts about the feasibility. Although the CatVTON is chosen for lightweight model, reference driven strategy is adopted to solve the shortcomings of diffusion. However, this approach requires generating a series of images. It will increase the complexity of the model again. No explanation about the complexity of proposed approach and other VTON models.
3. In section 3.3, the image set are manually partitioned into three independent subsets. The criteria for manual segmentation have not been explained. This will cast doubt on the reproducibility of the method.
4. In the experiment, the GS-VTON could not be deployed on RTX 4090 with 24 GB of memory. This is not a good excuse for not comparing this method. It is suggested to deploy GS-VTON in a better configuration and compare the proposed method with its performance, not just the complexity.
5. In the experiment, the basic indicators such as "flops" and "parameters" were all lacking in comparison. These are the key evaluation indicators that demonstrate the effectiveness of the proposed method in addressing computational complexity.
6. In the experiments that compared with other methods, the results presented were visual in nature. The lack of quantifiable evaluation indicators makes it impossible to assess the performance improvement of the proposed method precisely. Therefore, it is recommended to present the comparative results with specific indicators.
Author Response
Dear Reviewer,
Thank you for your insightful and constructive comments on our manuscript. Your concerns and suggestions have pinpointed key areas of our paper, prompting us to conduct deeper experiments and provide more rigorous discussions. We have made substantial additions and revisions to the paper based on your guidance, and we are confident that the current version more clearly presents our contributions. Below are our point-by-point responses to your comments.
Comments 1: The challenges are not clear. In the abstract, the description of the challenges is “Existing 3D VTON approaches commonly face challenges such as high computational costs, substantial memory requirements, and cross-view inconsistencies”. According to the content of the entire manuscript, it seems that the memory requirements is the challenge to be solved, not the computational costs. Alternatively, a more precise expression is needed here.
Response 1: Thank you for your feedback. This point has made us realize the need for greater precision in defining the challenges. The term we intended to convey was actually "equipment cost." Since memory requirements directly drive up hardware acquisition costs, which in turn hinders the widespread adoption of 3D VTON technology, we will revise "computational costs" in the abstract to the more accurate term "Equipment Cost." Furthermore, at the end of the abstract, we will change "computational costs" to "memory requirements" to more precisely define the primary challenge we aim to address. Thank you for pointing this out, as it helps to clarify the contributions of our research.
Comments 2: The unclear of the proposed improvement ideas has led to doubts about the feasibility. Although the CatVTON is chosen for lightweight model, reference driven strategy is adopted to solve the shortcomings of diffusion. However, this approach requires generating a series of images. It will increase the complexity of the model again. No explanation about the complexity of proposed approach and other VTON models.
Response 2: Thank you for giving us the opportunity to elaborate on our design philosophy in more detail. In the current mainstream of 3DGS-based scene editing for 3D VTON, the workflow invariably includes a necessary step: using a 2D VTON model to generate try-on reference images for multiple views. This means that the efficiency of the 2D VTON model directly impacts the overall complexity of the entire pipeline.
Our core improvement idea targets this very point. By employing the extremely lightweight CatVTON model at this stage, its advantages of low parameter count and high efficiency are amplified with every single view-image generation. Compared to heavyweight 2D VTON models that use complex encoders and cross-attention mechanisms, our choice can reduce computational resource consumption and memory occupancy by several fold, or even orders of magnitude, in this essential step. This is not only key to our lightweight design but also the foundation that allows the entire framework to run on consumer-grade hardware. Furthermore, this approach also significantly reduces computational complexity compared to previous physics-based 3D VTON methods.
Comments 3: In section 3.3, the image set are manually partitioned into three independent subsets. The criteria for manual segmentation have not been explained. This will cast doubt on the reproducibility of the method.
Response 3: Thank you for pointing out this omission in our method description. A clear explanation of the LoRA module selection mechanism is crucial for the reproducibility of our method.
We have now added a detailed explanation to the methods section of the paper (Section 3). Our partitioning criterion is based on a simple and explicit rule: the azimuth angle of the camera relative to the subject. The specific ranges are as follows:
- Front: Azimuth angle between -60° and +60°.
- Side: Azimuth angles between 60° and 120° and between -60° and -120°. These two regions share the same "side expert LoRA."
- Back: Azimuth angle between 120° and 240°.
Comments 4: In the experiment, the GS-VTON could not be deployed on RTX 4090 with 24 GB of memory. This is not a good excuse for not comparing this method. It is suggested to deploy GS-VTON in a better configuration and compare the proposed method with its performance, not just the complexity.
Response 4: You have raised a critical experimental requirement, and we fully agree. An incomplete comparison cannot sufficiently demonstrate the superiority of our method.
Therefore, during the revision period, we have made a special effort to successfully deploy and reproduce GS-VTON on higher-specification hardware (a V100 GPU). We have now included in the new experiments section a comprehensive comparison with GS-VTON on the same tasks, covering both qualitative visual results and quantitative performance metrics. We are confident that these new experimental results will provide readers with a more complete perspective for evaluation.
Comments 5: In the experiment, the basic indicators such as "flops" and "parameters" were all lacking in comparison. These are the key evaluation indicators that demonstrate the effectiveness of the proposed method in addressing computational complexity.
Response 5: You are absolutely correct. For a paper that emphasizes efficiency, these basic metrics are indispensable. This was a significant oversight on our part.
In the revised manuscript, we have added a model complexity comparison table to the experiments section. This table details the specific data for our method and GS-VTON in terms of GFLOPs and the number of parameters. The data clearly show that our framework substantially reduces computational complexity, providing strong quantitative support for our claims of "high efficiency" and "lightweight." We only compare with GS-VTON because the other two studies we reproduced are fundamentally 3D scene editing research, making their overall pipeline design and methodology less comparable.
Comments 6: In the experiments that compared with other methods, the results presented were visual in nature. The lack of quantifiable evaluation indicators makes it impossible to assess the performance improvement of the proposed method precisely. Therefore, it is recommended to present the comparative results with specific indicators.
Response 6: We completely agree with your view that visual comparisons alone are insufficient and that objective, quantitative metrics are needed to evaluate performance. The reason we did not provide them in the initial draft is that the 3D VTON field currently lacks a widely accepted, unified standard for quantitative evaluation.
To address your concern in the most rigorous way possible, we have designed and implemented a scoring system. This system calculates a quantitative score for each result from multiple perspectives, including using CLIP to obtain text descriptions of the generated image and the original garment for similarity computation, using Canny edge detection to acquire edge features, and analyzing the dominant color to judge color fidelity. These results are now presented in Section 4.2. We believe this rigorous evaluation system can more fairly and comprehensively reflect the true performance of different methods.
Once again, we sincerely thank you for your invaluable feedback. Your rigorous standards have been the greatest motivation for improving the quality of our research.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI thank the authors for their replies to the comments. I have no further comments to make.
Author Response
Thank you for the review committee's comments.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have revised the paper according to the reviewers' comments, and I have no further comments.
Author Response
Thank you for the review committee's comments.
Reviewer 3 Report
Comments and Suggestions for AuthorsThanks for the revision of the manuscript, especially for conducting the experiments on better equipment, which ensured the persuasiveness of the results.
There are still some issues that should be considered.
1. The expression of equipment cost is not common. It is recommended to use some words about deployment instead.
2. It is suggested to provide a more fundamental explanation for the reduction in deployment difficulty, rather than simply adopting lightweight models. Because improvements on a lightweight model might lead to more complexity .
Author Response
Dear Reviewer,
Thank you once again for your insightful feedback on our revised manuscript. We completely agree that strengthening the precision and completeness of the descriptions for these points significantly enhances the quality of our paper. We sincerely appreciate your diligence and the opportunity to further clarify our work.
Below are our point-by-point responses to your comments:
Comments 1: The expression of equipment cost is not common. It is recommended to use some words about deployment instead.
Response 1: Thank you for pointing out this imprecision. The term you suggested, "deployment difficulty," perfectly captures the challenge we aim to address—namely, that high hardware barriers hinder the widespread adoption and practical application of the technology. We have amended the terminology in both the Abstract and the Conclusions to "barriers to practical deployment" to more accurately reflect our research motivation.
Comments 2: It is suggested to provide a more fundamental explanation for the reduction in deployment difficulty, rather than simply adopting lightweight models. Because improvements on a lightweight model might lead to more complexity .
Response 2: This is a very reasonable and necessary point of clarification, and we thank you for raising it.
Our core design philosophy is to reduce deployment difficulty by decoupling the complex, monolithic 3D virtual try-on task into four manageable and computationally optimized sub-processes. Our contribution lies not in a single disruptive algorithm, but in a "divide and conquer" strategy where optimizations are strategically applied at each stage.
Your concern about the CatVTON architecture is particularly crucial. Its lightweight nature is not achieved by shifting computational complexity to other stages; rather, it fundamentally reduces complexity by removing modules—such as dedicated clothing encoders and cross-attention layers—that were widely considered essential in previous works. As our results clearly demonstrate, this radical simplification does not lead to a sacrifice in generation quality.
For the other stages of our pipeline, our strategies were designed to enhance quality after carefully considering multiple factors:
- 3D Representation: Our choice of 3DGS over NeRF, which can have a smaller file size, was based on its superior performance across multiple aspects, including rendering speed, training efficiency, and lower computational complexity for editing. These factors collectively contribute to a better development and user experience.
- Additional Modules: The integration of SMPL-X and LoRA was guided not only by their extremely lightweight nature (their parameter overhead is almost negligible in this task) but, more importantly, by their ability to significantly boost quality at a minimal cost. They provide crucial improvements in geometric stability and texture clarity, respectively.
We acknowledge that our previous manuscript was too brief in explaining this design philosophy. We are grateful for your feedback, which has prompted us to add more detailed descriptions in the Abstract, Introduction, and Method (Section 3.2) to better articulate the rationale behind our framework's efficiency.
We hope this response has fully addressed your concerns. We are truly grateful for your thoughtful engagement and your commitment to improving the quality of our research.
Author Response File: Author Response.pdf