Review Reports - Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article addresses the issue of improving the diversity and efficiency of 3D shape generation through generative adversarial networks (GANs), with a particular emphasis on cross-category training strategies.

Despite these strengths, the text has a number of shortcomings that limit its scientific value. The language and style of the article are inconsistent and contain typos and grammatical errors in places, which reduces the readability and professional level of the text.
The methodological section lacks sufficient justification for some key choices, such as the use of a specific subset of the ShapeNet dataset or the limitation to four categories of objects.
The authors do not indicate whether statistical tests were performed to verify the significance of the differences between the models, which complicates the interpretation of the results.
The discussion section pays insufficient attention to comparisons with diffusion models and does not sufficiently analyze computational complexity, including time and memory consumption.
There is no in-depth discussion of the practical implications of the proposed method for VR/AR systems, CAD tools, or other applications. Similarly, no potential directions for future research are suggested, such as extending the method to texture generation or dynamic scenes.

Author Response

Comments 1: Inconsistent language and style with typos and grammatical errors reduce readability and professionalism.

Response 1: We sincerely thank the reviewer for highlighting the language and stylistic issues in our manuscript. In this revision, we have conducted a comprehensive proofreading of the entire text to address all concerns raised. Specifically, we have:

Corrected all typographical errors and grammatical mistakes;
Revised ambiguous or poorly structured sentences for clarity;
Eliminated redundant expressions to enhance conciseness;
Standardized technical terminology and resolved inconsistencies in key concept definitions;
Ensured consistent style and professional tone throughout the manuscript.

These improvements enhance the readability and professionalism of the paper. We appreciate the reviewer’s valuable feedback, which has strengthened the overall quality of our work.

Comment 2: Methodology lacks sufficient justification for key choices (e.g., specific ShapeNet subset selection and limitation to four object categories).

Response 2: We thank the reviewer for highlighting the need for stronger justification of our methodological choices. In response, we have added a dedicated paragraph in Section 4.1 to explicitly clarify our ShapeNet subset selection strategy. Specifically:

We selected six categories (Rifle, Chair, Airplane, Motorbike, Car, Table) from ShapeNet’s 55 synsets based on two criteria: (1) data availability (enabling balanced subsampling to exactly 1,000 instances per category to eliminate dataset-size bias when comparing with GET3D and EG3D) and (2) geometric diversity (ensuring significant topological and structural variation across categories). Note: Since ShapeNet's Motorbike synset contains only 812 instances, we supplemented it with 188 high-quality Motorbike models from OmniObject3D dataset to reach the required 1,000 samples, maintaining consistent mesh quality standards through our preprocessing pipeline.
The primary four categories (Rifle, Chair, Airplane, Motorbike) were prioritized for quantitative evaluation and visual comparisons as they represent the most challenging cross-category generalization scenario.
Car and Table serve as supplementary categories to validate scalability beyond core experiments, with results included section 4.

Additionally, the entire Methodology section (Section 3) has been substantially revised to provide deeper technical explanations.

Comment 3: Absence of statistical tests to verify significance of model differences complicates result interpretation.

Response 3: We thank the reviewer for this insightful comment regarding statistical rigor. We agree that statistical tests generally provide a stronger basis for comparison in experimental sciences. However, in the field of deep generative modeling—specifically 3D GANs and neural rendering—performing such statistical tests (e.g., repeating the full training process multiple times to calculate standard deviations and p-values) is computationally prohibitive and arguably uncommon in standard literature. We kindly ask the reviewer to consider the validity of our results based on the following justifications:

High Computational Cost: As detailed in Section 4.1, our model training involves significant computational resources (using four VGPU-48GB GPUs). Retraining the model and all baselines multiple times to generate a distribution for statistical testing would require weeks of GPU time, which is beyond the scope of typical 3D generation studies.
Substantial Performance Gap: We respectfully submit that the magnitude of improvement achieved by our method suggests the results are statistically significant, rather than due to random variance. As shown in Figure 3:
- Our method generates 3D models with accurate geometric structures, overall smoothness, and clarity when processing cross-category datasets. Even for complex composite categories, the model can perform a according to Table 3.
- Other 3D GAN methods suffer from structural distortions, such as uneven surface, broken chair legs, or deformed airplane wings.
- The latent space of these models fails to effectively disentangle features of different categories. For instance, a generated chair has the feature of a rifle. Such mutual interference leads to blurry and implausible geometric structures.
Standard Evaluation Protocol: We have strictly followed the evaluation protocols established by previous SOTA 3D GAN works (EG3D, GET3D). These foundational works also report the performance of the best-converged model without statistical significance tests. By aligning our evaluation method with theirs, we ensure a fair and direct comparison.

Comment 4: Discussion inadequately compares with diffusion models and fails to analyze computational complexity (time/memory consumption).

Response 4: Thank you for this valuable feedback. We have addressed this limitation by adding computational complexity analysis through new experiments. As shown in the additional Table 4, we compare parameter scale, inference time, and VRAM usage across different methods. The results show GAN-based approaches (including ours) typically require fewer parameters and less computational resources than diffusion models like DiffTF. Our method achieves inference in 139 ms with 4.5GB VRAM, compared to DiffTF's approximately 2 s and 8.02GB VRAM.

While our method demonstrates modest efficiency advantages over diffusion-based approaches in terms of inference speed and memory requirements, we acknowledge that diffusion models often achieve superior generation quality and diversity. We have updated Section 4.3 to include this balanced comparison, helping readers understand the practical trade-offs between these methodological approaches based on their specific application needs.

Comment 5: No in-depth exploration of practical applications (VR/AR, CAD tools) or suggested future directions (e.g., texture generation, dynamic scenes).

Response 5: We sincerely thank the reviewer for this valuable suggestion. In response, we have made two significant additions to our paper:

We added "Practical Applications" paragraph in the Conclusions section that specifically examines how our cross-category generation capability can transform VR/AR scene construction. Rather than requiring separate models for each object category, our framework enables a single model to generate diverse objects (vehicles, furniture, buildings) within a unified pipeline, substantially reducing computational overhead and development time for virtual environment creation.
We expanded the "Limitations and Future Directions" paragraph to provide concrete pathways for extending our work. Specifically, we propose: (1) hybrid 3D representations combining tri-plane structures with explicit surface models to enhance geometric detail; (2) unified geometry-appearance modeling via a single implicit function or conditioned texture diffusion for high-fidelity texturing; and (3) domain-specific semantic distillation to mitigate CLIP bias and enhance generation diversity across underrepresented categories.

Reviewer 2 Report

Comments and Suggestions for Authors

The idea you are exploring is interesting, and the results show that the approach has potential. The comparisons with existing methods are helpful, and the improvement in diversity across categories is clear. At the same time, there are a few areas where the paper would benefit from clearer explanation so that readers can fully understand how the method works.

The main place that needs more detail is the methodology. Some parts of the model are described in a way that leaves questions unanswered. For example, the way the semantic feature is aligned with the latent code is not entirely clear, and it would help if you explained whether you use any projection layers or other adjustments to make the features compatible. The separation of the latent vector into semantic and category components is also mentioned but not described in a straightforward way. Adding a short, simple explanation here would make the method easier to follow.

The architecture figures are useful, but some symbols appear without definition in the text. Clarifying these points would help readers who are not already familiar with this type of model. A few of the captions could also be expanded slightly so the figures stand on their own.

Your experiments are generally well designed, but they would be stronger with a few additional details. Basic statistics about the dataset, especially the number of samples per category and the splits, would help support your discussion about balancing categories. Since you mention the differences in training time between models, it would also be helpful to summarise this in the results rather than only mentioning it briefly in the text.

Comments on the Quality of English Language

The manuscript is generally understandable, but some sentences are longer than necessary and a few typos and missing cross-references interrupt the flow. A light language edit would help make the ideas clearer and ensure the technical points are easier to follow.

Author Response

Comments 1: Methodology requires clearer explanation, particularly regarding semantic feature alignment with latent codes (e.g., projection layers/adjustments) and the separation of latent vectors into semantic/category components.

Response 1: We appreciate the reviewer’s feedback regarding the clarity of our methodology. We have revised Section 3.1 (Semantic-Aware Generation) and Section 3.2 to explicitly detail the latent space decomposition and feature alignment process:

Latent Vector Separation: We clarified that the mapped latent code ω is explicitly sliced into two orthogonal 512-dimensional components: ω_s for geometric shape modulation and ω_c for categorical semantic control. This disentanglement allows the network to optimize geometry and semantics independently.
Semantic Feature Alignment: We expanded the explanation of the projection mechanism. The generator synthesizes a specific semantic feature vector ƒ_s alongside the tri-plane features. This vector is aligned with the target CLIP text embedding E_T(l_gt) via a cosine similarity-based Directional Loss (L_semantic) forcing the generated shape to adhere to the semantic constraints of the category.

We have also updated references to Figure 1 to better illustrate this dual-path flow.

Comments 2: Architecture figures need defined symbols and expanded captions to improve self-containment and accessibility for readers unfamiliar with the model type.

Response 2: We thank the reviewer for this valuable suggestion. In response, we have revised Figure 1 to enhance its clarity and self-containment. Specifically, we have explicitly defined key symbols (e.g., noise z, ω, etc. ) in the caption and added a dedicated legend for core components, including the generator , Discriminator , latent code ω (with explanatory notes), and the CLIP encoder. Additionally, we have depicted the entire workflow in greater detail to improve readability.

Comments 3: Experiments would benefit from additional details: dataset statistics (samples per category, splits) to support balancing discussions, and summarized training time comparisons in results rather than brief textual mentions.

Response 3: We appreciate the reviewer's valuable suggestion to enhance experimental transparency. Regarding dataset statistics: We selected six categories (Rifle, Chair, Airplane, Motorbike, Car, Table) from ShapeNet’s 55 synsets based on two criteria:

(1) data availability (enabling balanced subsampling to exactly 1,000 instances per category to eliminate dataset-size bias when comparing with GET3D and EG3D)

(2) geometric diversity (ensuring significant topological and structural variation across categories). Note: Since ShapeNet's Motorbike synset contains only 812 instances, we supplemented it with 188 high-quality Motorbike models from OmniObject3D dataset to reach the required 1,000 samples, maintaining consistent mesh quality standards through our preprocessing pipeline.

The primary four categories (Rifle, Chair, Airplane, Motorbike) were prioritized for quantitative evaluation and visual comparisons as they represent the most challenging cross-category generalization scenario.
Car and Table serve as supplementary categories to validate scalability beyond core experiments, with results included in Section 4.

For training efficiency comparisons: We have added Table 4 in Section 4.3 summarizing the number of model parameters and the inference times across all compared methods (Ours, GET3D, DiffTF). The table explicitly reports: parameter counts, average inference latency, and GPU memory consumption. Our method achieves the best efficiency with 0.32B parameters (comparable to GET3D's 0.34B but significantly smaller than DiffTF's 1.2B), the fastest inference time (139 ms versus GET3D's 151 ms and DiffTF's 2 s), and the lowest VRAM usage (4.5GB). These results demonstrate our approach's computational advantages while maintaining cross-category generation capabilities.

Reviewer 3 Report

Comments and Suggestions for Authors

This work expands the capabilities of 3D GANs by introducing a new framework that enables diverse and high-quality 3D shape generation across categories. Its key strength is combining style-based control with CLIP guidance to achieve both diversity and efficiency without sacrificing quality.

The Introduction effectively organizes related research and clearly presents the background based on numerous references. Additionally, the Related Work and Method sections concisely explain the challenges of 3D GANs and the positioning of the proposed approach, resulting in an overall structure that is very easy to follow. This study is significant as it contributes to enhancing the generative and generalization capabilities of 3D GANs and advancing efficient 3D shape generation techniques.

My overall recommendation for this manuscript is to publish it in its current form. Minor revisions are suggested.

(1) Page 7, Lines 205–207, ““This progression helps…”

Regarding the sentence, it may be helpful to further explain how this progression enables such an assessment. Adding a brief description of the evaluation criteria or observations used to judge robustness and generalization would make the claim clearer.

(2) Page 7 Line 232, “Section 4.3 Quantitative Results”

The results presented in Table 2 are informative; however, the accompanying text is relatively brief. It would be helpful for readers if the results section more explicitly highlighted the key findings from Table 2.

1) “Ours” in Table 2 likely refers to the method proposed by the authors, but this is not clearly described in either the table or the text. Including a brief explanation of the “Ours” method would help clarify the results.

2) “COV (%)” is included in Table 2, but its full name and meaning are not explained. Providing this information in the text would be helpful.

3) The description of Table 2 clearly presents the magnitude of improvements—for instance, a 23.7% gain in FID and an 18.2% reduction in MMD—which is very helpful. However, for the

(3) Page 10, Line 270, “5. Conclusion”

The current Conclusion is concise and clearly summarizes the main ideas; however, slightly expanding it could help reinforce the significance of your contributions. For example, you might briefly highlight:

1) the key achievements of the work

2) the aspects where your method clearly outperforms prior approaches

3) the specific improvements confirmed in the experiments.

Including these points would strengthen the impact and clarity of the concluding remarks.

(4) Page 10, Line 277, “5.1. Limitations and Future Study”

It may be better to remove the “5.1” heading and include this discussion within the Conclusion section, since there is no “5.2.” The Future Research section already provides useful points regarding the remaining limitations. However, it could benefit from a bit more structure, for example by indicating:

1) which limitations are the most critical

2) what potential directions or strategies could address them.

Adding this structure would give readers a clearer sense of the next steps for advancing this line of work.

Comments for author File: Comments.pdf

Author Response

Comments 1: Page 7, Lines 205–207: Clarify how the described progression enables robustness/generalization assessment by explicitly stating evaluation criteria or observational metrics.

Response 1: We appreciate the request for clearer articulation of our evaluation criteria for robustness and generalization. We have revised the 'Experiment Configuration' section to explicitly state the quantitative metrics used in each of the three stages:

Single-Category generation quality: This is assessed using standard metrics like Fréchet Inception Distance (FID), Maximum Mean Discrepancy (MMD), and Coverage Percentage (COV). They are discussed at the metrics paragraph in detail. These metrics are utilized for measuring the fidelity and realism of generated shapes compared to ground-truth data.
Scalability/Robustness Limits: his is measured by tracking the degradation of generation quality (FID/MMD) as we progressively increase semantic complexity (pairwise $\to$ three-way $\to$ full-set). The minimum level of semantic complexity required to cause significant performance drop-off defines the model's robustness limit and overall generalization capability across mixed semantic domains.
Efficiency Analysis: We investigate scalability and practical feasibility by comparing key resource consumption metrics as presented in Table 4 (model parameter scale, inference time, and VRAM usage). This quantitative comparison helps identify the model's generalization capabilities and robustness in real-world deployment scenarios.

Comments 2: Table 2 presentation requires improvements: (a) define "Ours" methodology explicitly, (b) explain "COV (%)" full name and significance, and (c) expand textual discussion to highlight key quantitative findings beyond brief mentions.

Response 2: We appreciate the feedback regarding the presentation of Table 2 and have made the following improvements to enhance clarity and robustness:

Define "Ours": The methodology is now explicitly defined. The row previously labeled "Ours" has been renamed to "Ours (Semantically-Guided 3D GAN)" to clearly indicate the model configuration being evaluated.
Explain "COV (%)": We have added a detailed explanation of the Coverage Percentage (COV) metric, including its full name and significance as a measure of generation diversity, within the metrics paragraph of Section 4.
Expand Textual Discussion: We have expanded the textual discussion to better highlight the key quantitative findings from Table 2, specifically incorporating a critical analysis of these results within the first paragraph of the Conclusion section to better frame our overall contribution.

Comments 3: Conclusion section (Page 10) needs expansion: (a) integrate Limitations/Future Study by removing "5.1" heading, (b) explicitly summarize key achievements and advantages over prior methods, and (c) structure limitations by criticality with concrete solution strategies.

Response 3: Thank you for your valuable feedback. We have comprehensively revised the conclusion section to address all three points:

We integrated limitations and future work directly into the main conclusion text by removing the subsection heading;
We explicitly summarize our key achievements at the beginning, highlighting our CLIP-driven semantic awareness and relativistic pairing difference loss that enable superior cross-category sampling and robustness compared to existing 3D GAN methods;
We structured limitations by criticality (geometric detail, texture generation, CLIP bias) with specific solution strategies for each, including hybrid 3D representations, unified geometry-appearance modeling, and domain-specific semantic distillation.

The revised conclusion now provides a cohesive narrative that balances achievements, applications, limitations, and future directions.

Reviewer 4 Report

Comments and Suggestions for Authors

Review Report: "Enhancing 3D Shape Generation Diversity and Efficiency: A Cross-Category Trained Generative Model"

The manuscript "Enhancing 3D Shape Generation Diversity and Efficiency: A Cross-Category Trained Generative Model" by Weinan Cai et al. proposed a novel 3D GAN-based framework aimed at improving the diversity, efficiency, and generalization of 3D shape generation, particularly in cross-category scenarios. The authors claim a LIP embeddings based semantic-aware generator and modified the loss function from R3GAN. The manuscript is well-structured and the data are solid, by studying the experiments on ShapeNet demonstrate the advantages of the model over existing GAN and other models like diffusion-based methods in terms of FID, MMD, and category coverage. Overall, this paper addresses a pressing challenge in 3D generative modeling that the cross-category generalization,which is highly relevant for applications in VR/AR, robotics, and gaming. This would give broad interests and has promising applications. Also, the manuscript has a comprehensive experiments, both evaluated across across single, dual, triple, and full-category datasets. And compared the results with multiple different models. Based on the comments above, I would recommend this manuscript to be published and here are more detailed suggestions.

First, the reliance on CLIP embeddings is central to the method. A discussion on potential limitations or biases introduced by CLIP would strengthen the paper.

The figures caption are too short without detail description of the figures - Fig. 1 etc.

The study is only based on Fifle, Chair and Airplane. More dataset maybe imply in this study to enhance the statement.

Author Response

Comments 1: Discuss potential limitations or biases introduced by reliance on CLIP embeddings to strengthen the paper's critical analysis and robustness.

Response 1: We agree that a critical analysis of the biases inherent in pre-trained models like CLIP is essential for robust research. We have addressed this concern by adding a Third Limitation to the Discussion section of the manuscript. In this revision, we explicitly discuss the potential for societal and domain-specific biases inherited from CLIP's training data. We detail how this may constrain the semantic space, thus potentially limiting the novelty and diversity of generated shapes, particularly for less common categories. Additionally, we also discuss the potential approach to mitigate the CLIP-related biases and enhance diversity in the last paragraph.

Comments 2: Expand figure captions (e.g., Fig. 1) with detailed descriptions to improve clarity and self-containment of visual content.

Response 2: We thank the reviewer for this valuable suggestion. In response, we have revised Figure 1 to enhance its clarity and self-containment. Specifically, we have explicitly defined key symbols (e.g., noise z, ω, etc. ) in the caption and added a dedicated legend for core components, including the generator G, Discriminator D, latent code ω (with explanatory notes), and the CLIP encoder. Additionally, we have depicted the entire workflow in greater detail to improve readability.

Comments 3: Extend validation beyond the current limited categories (Table, Chair, Airplane) to additional datasets to enhance the generalizability and credibility of the claims.

Response 3: We sincerely thank the reviewer for this insightful suggestion to strengthen the generalizability of our claims. While our core evaluation focused on the four most challenging categories (Rifle, Chair, Airplane, Motorbike) selected via strict criteria—(1) data availability (balanced to 1,000 instances/category to eliminate size bias) and (2) geometric diversity (maximizing topological variation)—we fully agree that broader validation enhances credibility.

Following this recommendation, we have expanded our experiments to include two additional categories: Car and Table. The qualitative results are added to section 4. Our method successfully generates geometrically coherent vehicles and tables on this car-table cross-category dataset.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The article is edited accordingly.