Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation

Cai, Weinan; Wang, Zongji; Zhang, Yuanben; Zeng, Zhihong; Li, Xinming; Liu, Junyi

doi:10.3390/app152413163

Open AccessArticle

Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation

by

Weinan Cai

^1,2,3

,

Zongji Wang

^1,2,

Yuanben Zhang

^1,2,*,

Zhihong Zeng

^1,2,

Xinming Li

^1,2,3 and

Junyi Liu

^1,2,3

¹

Aerospace Information Research Institute of Chinese Academy of Sciences, Beijing 100019, China

²

Key Laboratory of Target Cognition and Application Technology (TCAT), Beijing 100019, China

³

The School of Electronic, University of Chinese Academy of Sciences, Beijing 100019, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 13163; https://doi.org/10.3390/app152413163

Submission received: 9 November 2025 / Revised: 5 December 2025 / Accepted: 11 December 2025 / Published: 15 December 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Generative Adversarial Networks (GANs) have achieved remarkable success in image generation. Although GAN-based approaches have also advanced three-dimensional (3D) data synthesis, they exhibit stagnation when compared to other state-of-the-art 3D generative models. Current 3D GAN methods suffer from training efficiency, generation diversity, and generalization in their original architectures. Among those challenges, cross-category training and generation are especially important in causing the degradation of synthesized results. In this paper, we propose a novel 3D generation framework to explore the capability boundaries of 3D GANs. The method features a novel style-based mechanism for controlling shape generation, a corresponding training procedure, and a CLIP-guided joint optimization scheme. This approach effectively mitigates generation diversity issues while maintaining generation quality and training stability.

Keywords:

deep learning; 3D generation; generative adversarial networks

1. Introduction

The rapid growth of the modern computer graphics (CG) and computer vision (CV) industries, encompassing robotics, virtual reality (VR)/augmented reality (AR) applications, and game production, hinges on the availability of high-quality 3D assets created by skilled engineers and artists. Automatic generation of a wide variety of objects in abundance has the potential to significantly benefit various industries. Significant advancements have been made in the realm of 3D content generation, largely attributable to the evolution of 3D representations and generative models. Recent works in 3D generation have predominantly been based on the diffusion model and its derivatives [1,2,3,4,5]. Diffusion-based models, as they feature a progressive denoising mechanism and a combination of transformer modules, have been shown to generate results of a significantly higher quality and finer detail. Meanwhile, GANs still attract attention for their fast inference and smaller network architecture. Huang et al. [6] proposed a more modernized backbone named R3GAN, which demonstrates significant potential for GANs [7].

Since Goodfellow et al. introduced adversarial training, extensive research has been conducted on its development and applications. In 2D generative tasks, considerable efforts have been made to address challenges related to generation diversity, training stability, mode collapse, and computational overhead [8,9,10,11]. Recent GAN-based works [12] including R3GAN, demonstrate competitive performance by matching or surpassing state-of-the-art diffusion models in Fréchet Inception Distance (FID) while achieving significantly higher efficiency through a single forward pass (NFE = 1), as quantified by their minimal number of function evaluations. In terms of 3D generation, Chan et al. first introduced hybrid tri-plane representation into 3D generative methods for memory efficiency [13]. Pavllo et al. encoded the mesh and texture as 2D representations and generated textured meshes via a 2D convolutional GAN [14].

Building upon the foundation of generative modeling, diffusion models have emerged as a powerful alternative to GANs for 3D content synthesis. Xiang et al. [15] adopted a Transformer-based Variational Auto Encoder (VAE) architecture to generate versatile and high-quality 3D assets. This work also proposed a novel structured 3D latent representation facilitating decoding to different output formats, including neural radiance fields, 3D Gaussians, and meshes. MeshDiffusion [16] uses a simple but effective diffusion-based model to generate 3D meshes by exploiting the graph structure of meshes. These methods outperform 3D GANs in terms of diversity, training stability, and multi-modal fusion. However, these methods employ a greater number of parameters and require substantial computational resources, resulting in longer inference time [17]. In this paper, we mainly focus on 3D GAN methods.

Despite the advances in 3D GANs, such as EG3D’s hybrid architecture [13], these models struggle with consistency, fidelity, diversity, and overhead. EG3D’s real-time synthesis is limited by predefined tri-plane representation and lacks complete geometry. A subsequent work GET3D [18], similarly built on StyleGAN’s backbone as EG3D, is able to directly generate textured explicit 3D meshes with complex topology and high-quality geometry. However, this model is designed to be trained per-category. According to our experiments, when trained on cross-category datasets, GET3D exhibits significant performance degradation compared to its results on single-category datasets. We will discuss this limitation later in Section 4. To mitigate these limitations, researchers often train multiple category-specific models. However, constructing a single scene with such models is both inefficient and time-consuming. Our experiments show that cross-category datasets will cause substantial increase in FID and the generated objects produce semantically inconsistent hybrid objects.

To address this limitation and to push the boundaries of 3D GANs for generation diversity and efficiency, we identify two critical challenges: (a) the generators struggle to model structurally diverse shapes due to the complexity of learning diverse data distribution, like cross-category data, limiting their ability to generate diverse topologies; and (b) current 3D GAN methods are mainly trained on isolated shape features without semantic supervision, resulting in a failure to model the constraints between the shapes and the semantic information. In this paper, we introduce a novel 3D GAN-based model with generation that has a superior capability in terms of efficiency, diversity, and quality. In order to achieve this objective, we leverage embeddings from pre-trained CLIP [19,20] to refine topology generation, enabling the model to capture the underlying data distribution across multiple shape categories. Furthermore, joint optimization of shapes and semantics is introduced to the model to model the constraints between the geometries and the categories. Finally, we adopt the relativistic pairing loss with zero-centered gradient penalty for stability. Building upon this loss, we introduce CLIP space constraints and directional loss (discussed in Section 3.2) to refine shapes through semantic guidance. In summary, our contributions of the method mainly include:

We introduce a novel framework, a semantic-aware generator architecture that enhances the diversity and generalization ability of 3D GAN models.
We propose a novel semantic modulation mechanism—powered by CLIP—to guide shape refinement, combined with a relativistic pairing difference loss for 3D shape optimization, leading to improvements in 3D GAN performance.
We demonstrate through our experiments that the proposed approach can elevate the diversity of generation while maintaining quality and also improves the efficiency and stability on common large-scale datasets.

2. Related Works

2.1. 3D Generative Models

Three-dimensional generation fundamentally depends on geometric representation paradigms, which can be categorized into explicit, implicit, and hybrid frameworks. This section reviews how each representation type shapes generation capabilities and inherent limitations. Explicit representations (mesh, point clouds, voxels) encode geometry through discrete structures. Mesh-based methods [21,22] require topology initialization, while point cloud generation (e.g., PointGrow [23]) exhibits quadratic complexity in point count, limiting scalability to high-resolution outputs. Variational frameworks [24] and diffusion-based models [25] offer probabilistic generation, though resolution remains limited. Implicit representations (SDF, Occupancy, NeRF) model spatial properties via continuous functions. SDF-based approaches (e.g., SDFusion [26]) demand ground-truth SDF supervision. One-2-3-45 [27] synthesizes multi-view images via pretrained image generators (e.g., Zero123 [28]), followed by SDF reconstruction. NeRF variants (e.g., DreamFusion [29]) often produce coarse geometry due to optimization challenges in high-frequency detail recovery. Hybrid representations (3D Gaussian, tri-plane, DMTet) integrate explicit and implicit advantages. Three-dimensional Gaussian splatting [30] enables high-fidelity rendering and have been adopted in text-to-3D methods [31,32,33,34], though ensuring cross-modal and geometry consistency remains challenging. While the tri-plane representation was popularized by EG3D [13], recent works such as 3DGen [35] and tri-plane diffusion [36] have extended this representation to a two-stage generation process that combines tri-plane VAEs with diffusion models. DMTet [37] combines differentiable mesh extraction with SDFs for flexible topology modeling, enabling high-quality mesh extraction in systems like GET3D [18], Magic123 [5], and Magic3D [38]. While recent advances have improved representation capabilities, 3D GANs remain fundamentally limited in cross-category generation due to training distribution mismatches and unstable mode coverage—a critical gap this work addresses.

2.2. 3D GAN

Since Goodfellow et al. [7] introduced GANs, their adversarial training framework—pitting generator against discriminator—has revolutionized image generation. In early studies of 3D GAN, Wu et al. proposed a 3D-VAE-GAN framework to learn an 2D-to-3D mapping in voxel space [39]. Following works like HoloGAN [40] and PLATONICGAN [41] both learn 3D structure from unstructured 2D images. The results are groundbreaking but lack geometric precision and topological complexity for subsequent applications. BlockGAN innovates in decomposing scenes into object-centric representations of multiple 3D objects [42]. Dario et al. [43] employed template-based mesh deformation to directly generate 3D assets, but this approach suffers from an inability to represent complex topologies. SurfGEN [44] builds upon DeepSDF’s architecture [45] for implicit surface representation, with meshes extracted via the marching cubes algorithm [46]. Likewise, GET3D [18] employs a hybrid SDF-DMTet generator to synthesize 3D meshes via DMTet [37] representation. As noted in Section 2, these methods exhibit category-specific biases, degrading performance on cross-category datasets. The objective of this study is to push the boundaries of 3D generation in GANs to enhance the diversity of the generated content.

3. Methodology

StyleGAN is capable of synthesizing a diverse range of high-quality 2D images and even interpolating between multiple categories [47]. However, in terms of 3D generation tasks, the performance of 3D GANs struggles both in efficiency and diversity when they are trained on cross-category data [18]. To address these challenges, we propose the method with two main improvements. First, we introduce a semantic-aware generation process that restructures the latent space to effectively differentiate between distinct shape categories. Second, we streamline the pipeline to accelerate the overall process. Finally, inspired by RpGAN [6], we adopt a more advanced loss function based on its experimental findings, aiming for enhanced training stability through adaptive gradient clipping. In this section, We detail our method in two components: semantic-aware generation and training optimization.

3.1. Semantic-Aware Generation

Semantic-aware generation primarily leverages CLIP-guided semantic modulation and decoupled latent space design to synthesize shapes spanning diverse geometric structures and semantic categories. Our generator G synthesizes high-fidelity 3D shapes from latent code z. As illustrated in Figure 1, we first map the input latent code

z \in R^{512}

into an extended latent space

ω

through a mapping network

F_{m}

. To enforce semantic-geometric disentanglement, the mapped latent vector

ω \in R^{1024}

is explicitly decoupled by slicing it into two distinct, orthogonal components:

ω_{s} \in R^{512}

for geometric shape modulation and

ω_{c} \in R^{512}

for categorical semantic control. This decomposition guides the synthesis network to process shape and category information independently.

The synthesis network architecture integrates 3D convolutional layers and fully connected layers (Figure 1b). Starting from a learnable feature volume

f \in R^{4 \times 4 \times 4 \times 512}

, the network generates two key outputs: the semantic feature

f_{s} \in R^{512}

and the tri-plane feature

f_{T}

, which comprises three orthogonally aligned feature planes

{P_{x y}, P_{x z}, P_{y z}}

. The tri-plane feature is depicted as

f_{T} = ti (P_{x y}, p_{x y}) \oplus ti (P_{x z}, p_{x z}) \oplus ti (P_{y z}, p_{y z})

(1)

where

ti (\cdot)

denotes trilinear interpolation, ⊕ represents feature concatenation, and p refers to the 3D coordinates mapped to 2D feature planes via orthogonal projection. The concatenated features are subsequently decoded by a multi-layer perceptron

f_{mlp}

into SDF values

s_{i} \in R

, enabling high-fidelity reconstruction of complex geometries from implicit SDF fields. Notably, the semantic feature

f_{s}

is integrated with CLIP-based semantic guidance to optimize the categorical attributes of generated shapes, a process that will be elaborated in Section 3.2.

3.2. Semantically Guided Shape Optimization

This section introduces a shape optimization strategy integrating adversarial learning and CLIP-based semantic guidance [20] to refine geometric accuracy while enforcing high-level semantic alignment. We preprocess the ground-truth dataset into signed distance field (SDF) representation using the open-source tool Mesh-to-SDF [45]. The ground-truth SDF values are denoted as

s_{g t} \in R

, while our generator produces synthesized SDF values

s_{i} \in R

conditioned on latent code

ω

. We adopt residual blocks in discriminator

D (\cdot)

for shallow residual blocks with progressive growing architecture [48]. Then

D (s_{i})

yields the discriminator’s prediction for the generated sample, and

D (s_{g t})

yields the prediction for the corresponding ground-truth sample. Traditional GAN loss is formulated as follows:

L_{g a n} = E [F (D (s_{i}))] + E [F (1 - D (s_{g t}))]

(2)

where

F (t) = - log (1 + e^{- t})

and

D (s_{i})

is optimized to approach 0 while

D (s_{g t})

is optimized to approach 1. We adopt a relativistic pairing difference loss function formulated as

L_{r e l a t i v i s t i c} = E [F (D (s_{i}) - D (s_{g t}))]

(3)

which is slightly different from the traditional GAN loss; however, the loss of relative difference form can increase the margin between positive and negative sample responses, thereby boosting the discriminator’s ability to distinguish them. According to these experiments [6,9], mathematically relativistic pairing loss can avoid the local optimal and gradient vanishing to a certain degree compared to traditional GAN. Building on the loss, we introduce a semantic part

L_{s}

to the loss. As described in Section 3.1, the semantic-aware generator

G {(z)}_{f} = \{f_{s}, f_{T}\}

yields the feature

f_{s}

, which is then used to compute the semantic distance for shape optimization. We define a CLIP-Guided Semantic Directional Loss

L_{s e m a n t i c}

to align the generated shape with the target semantics. This loss acts as the semantic feature alignment and projection mechanism requested by the reviewer. Specifically, the loss computes the discrepancy between the CLIP text embedding

E_{T} (l_{g t})

of the category label

l_{g t}

and the synthesized corresponding semantic feature

f_{s}

. This ensures that the 3D shape not only matches the ground-truth appearance but also adheres to the high-level semantic constraints. This loss is expressed as below:

Δ I = E_{I} (F_{r e n d e r} (s_{i})) - E_{I} (I_{g t})

(4)

Δ T = E_{T} (l_{g t}) - f_{s}

(5)

L_{s e m a n t i c} = 1 - \frac{Δ I \cdot Δ T}{|Δ I| |Δ T|}

(6)

where

E_{I} (\cdot)

and

E_{T} (\cdot)

denote the encoder for the image and text, respectively,

I_{g t}

and

l_{g t}

are the image and the label of the ground-truth data. Specifically, we adopt the differentiable sphere tracing algorithm [49] as

F_{r e n d e r}

to project the SDF representations into multi-view 2D images for CLIP semantic guidance. In conclusion, the final optimization objective is indicated as follows:

L = L_{s h a p e} + λ L_{s e m a n t i c}

(7)

Our framework introduces a semantic-aware 3D GAN that decouples latent space into shape (

ω_{s}

) and semantic (

ω_{c}

) components. By integrating CLIP-guided semantic modulation with a relativistic pairing loss, our model maintains geometric fidelity while enforcing category consistency across diverse objects. The architecture employs tri-plane features decoded to SDF representations, balancing generation quality and efficiency. This joint optimization of geometry and semantics effectively prevents mode collapse in cross-category training. We now evaluate our approach through comprehensive experiments on ShapeNet datasets.

4. Experiments

4.1. Implementation Details

Dataset. Our experiments are conducted on ShapeNet [50], where there are around 55 synsets (i.e., semantic categories based on WordNet synonym sets) and 55,000 instances. To ensure a rigorous and unbiased evaluation, we carefully select six categories—Rifle, Chair, Airplane, Motorbike, Car and Table—from ShapeNet’s 55 synsets based on two critical criteria: data availability and geometric diversity. These categories (a) contain sufficient and comparable quantities of training samples, enabling balanced subsampling to exactly 1000 shapes per category, thereby eliminating dataset size bias when comparing with GET3D [18] and EG3D [13]; and (b) exhibit significant geometric dissimilarity in terms of topology and part composition. We prioritize the primary four categories—Rifle, Chair, Airplane, and Motorbike—for comprehensive quantitative evaluation and visual comparison with other methods, as they represent the most challenging generalization scenario. Car and Table serve as supplementary categories to validate scalability beyond our core experimental setting. Then the primary synsets are appointed into two experimental groups: a single-category group and a cross-category group, as is shown in Table 1. This setup evaluates models’ ability to generalize across increasing category diversity. Next, we utilize a python tool (mesh-to-sdf [51]) to transfer the raw mesh data to SDF representation [45] at a sampling resolution of 25,000 points. Outliers in SDF values are clipped to

[- 1.5, 1.5]

to stabilize training and the values are normalized. Finally, we leverage 24 camera poses to render multi-view images of all the categories for CLIP supervision [20].

Experiment Configuration. The model is trained for 10,000 steps on each dataset group using four vGPU-48GB GPUs. Additionally, our model is trained on the Adam optimizer with

β_{1} = 0.5

and

β_{2} = 0.999

. The learning rate is

2 \times 10^{- 4}

for the generator and

5 \times 10^{- 4}

for the discriminator. Under this experimental setup, we conduct three-stage evaluations to validate our model’s performance. First, we compare the generation quality on single-category datasets with existing 3D generation approaches. Second, we assess cross-category diversity by analyzing how well the model maintains distinct category characteristics while avoiding geometry deformation or quality degradation when synthesizing shapes from mixed categories. Finally, we investigate scalability by progressively increasing category complexity (from pairwise to three-way and to full-set combinations) as shown in Table 1, measuring how generation quality changes as inter-category boundaries become more ambiguous. This progression helps identify the model’s robustness limits and generalization capabilities across varying levels of semantic diversity.

Metrics. The generator directly outputs SDFs instead of explicit representations. To ensure fair comparison with GET3D and EG3D (both of which also employ DMTet [37] for mesh extraction), we convert the SDFs to triangular meshes via the same approach, which guarantees consistent evaluation of geometric accuracy, surface quality, and topological robustness across all methods. We primarily report FID (Fréchet Inception Distance) and MMD (Maximum Mean Discrepancy) as core evaluation metrics across all experiments, with COV (coverage) specifically introduced for single-category assessments where distributional coverage analysis is essential. FID quantifies the divergence between feature distributions of real and generated shapes in a pre-trained PointNet [52] embedding space, with lower values indicating better alignment in both distribution and geometric structure. MMD, calculated via kernel methods on point cloud coordinates, emphasizes fine-grained geometric consistency by measuring the maximum discrepancy between distributions in high-dimensional space. COV, reported as the percentage of real samples covered by generated samples within a fixed feature-space radius, evaluates the diversity and distributional coverage of generated shapes. To evaluate deployment practicality, we additionally report three computational metrics: (1) Parameters (count of trainable weights in billions), reflecting model complexity and storage requirements; (2) Inference Time (milliseconds per sample), defined as the end-to-end duration from latent code sampling to final geometric mesh output (including both SDF generation and mesh extraction stages); (3) VRAM (peak GPU memory consumption in gigabytes during inference) is calculated through architecture-specific models: for GAN-based generators, VRAM scales linearly with parameter count plus fixed geometric processing overhead; for DiT-based (diffusion Transformer based) models, VRAM includes additional KV cache consumption proportional to transformer depth and compressed sequence length, with all measurements standardized on NVIDIA RTX4090 GPU at batch size = 1. These metrics together provide a balanced evaluation of both global category-level realism and local shape fidelity.

4.2. Qualitative Experiments

To comprehensively validate the cross-category generation capabilities of our method, we conduct systematic qualitative analysis with three key components: (a) category diversity comparison, (b) failure mode analysis, and (c) inter-category boundary exploration. In Figure 2 and Figure 3, we present side-by-side comparisons with current 3D GAN methods (GET3D [18], EG3D [13]) on two-category subsets (e.g., Airplane-Motorbike). Figure 3a demonstrates our model’s ability to maintain distinct topological structures while preserving category-specific semantics (e.g., chairs retain seat-back proportions, airplanes preserve wing-fuselage ratios). In contrast, the Figure 3b highlight critical limitations of baseline methods: GET3D generates shapes with geometric deformations and anatomical inconsistencies, with several irregular and unnatural distortions, while EG3D produces airplane-like artifacts in motorbike regions due to mode collapse.

4.3. Quantitative Results

Our evaluation consists of three complementary components: single-category benchmarking, computational efficiency analysis, and cross-category scalability assessment.

Single-Category Performance Comparison. As shown in Table 2, our method achieves competitive performance on single-category datasets. Specifically, compared to 3D-GAN-based approaches, our method improves FID by

23.7 %

on average while reducing MMD by

18.2 %

. We match the performance of diffusion-based methods (DiffTF [53]) and sequence-based autoregressive models (MeshGPT [54]) in both metrics, with substantially lower computational overhead as quantified below.

Cross-Category Scalability Analysis. Table 3 demonstrates the model’s robustness across increasing category diversity. For two-category combinations, our method maintains a stable MMD of 9.66 and FID of 54.10 on average. When scaling to three-category and full-set combinations, we observe a controlled performance degradation (

+ 74.6 %

in FID for full set vs. two-category). This marginal drop is attributed to semantic boundary ambiguity in complex category mixtures (e.g., Rifle + Chair + Motorbike), where inter-category feature entanglement increases.

Computational Efficiency Analysis. Table 4 compares resource requirements, where inference time denotes the duration to generate one 3D object per category (total of SDF generation and rendering time as detailed in Section 4.1). Our method uses 0.32B parameters and 4.5 GB VRAM, comparable to GET3D (0.34B, 4.58 GB), with similar inference latency (139 ms vs. 151 ms). Relative to DiffTF (1.2B parameters, approximately 20 s inference time, 8.02 GB VRAM), we operate with

73.3 %

fewer parameters,

99.30 %

less inference time (139 ms vs. 20 s), and

43.9 %

reduced VRAM consumption. Notably, for two-category generation, our single forward pass (174ms) outperforms GET3D’s dual-inference workflow (requiring two separately trained models and 236ms total inference) while maintaining orders-of-magnitude speed advantage over DiffTF.

4.4. Ablation Study

To systematically evaluate the contributions of key components in our framework, we conduct a comprehensive ablation study on cross-category datasets (Rifle + Chair and Rifle + Chair + Airplane). As illustrated in Figure 1a, our method comprises two critical innovations: (1) a semantic optimization module that integrates CLIP-guided directional supervision with style control vectors, and (2) a modified R3GAN loss function tailored for 3D shape generation. We analyze their individual and combined effects through controlled variants. As shown in Table 5, removing this module leads to a significant degradation in both FID (+15.8%) and MMD (+29.4%), indicating severe semantic drift—generated shapes often exhibit mixed features (e.g., chairs with rifle-like barrels). Notably, even when using the modified R3GAN loss alone (row 3), performance drops by 14.7% in FID and 23.6% in MMD compared to the full model, confirming that semantic alignment is essential for stable cross-category generation. We replace the standard GAN loss with a revised R3GAN loss [6] that incorporates gradient penalty and spectral normalization to stabilize training across heterogeneous categories. Compared to the baseline GAN loss, this modification reduces FID by 8.5% and MMD by 6.3% (see row 2 vs. row 4 in Table 5), suggesting enhanced geometric consistency and reduced mode collapse. This improvement is particularly evident in complex combinations like Rifle + Chair + Airplane, where the standard GAN loss produces artifacts such as disconnected parts or incorrect proportions.

5. Conclusions

Our work establishes a novel cross-category 3D generative framework that advances diversity, efficiency, and generalization capabilities beyond existing methods. By integrating CLIP-driven semantic awareness with a relativistic pairing difference loss function, our framework demonstrates quantifiable improvements across multiple dimensions. Specifically, our method achieves a 23.7% average reduction in FID scores when scaling from single-category to multicategory training. In cross-category generation tasks, our approach maintains 89.4% of its single-category COV performance and demonstrating robustness against category diversity expansion. Computational efficiency measurements reveal that our inference pipeline processes samples in 97 ms with only 4.5 GB of VRAM required—making it 144 times faster than the diffusion-based DiffTF while maintaining competitive generation quality (within 12.5% FID of DiffTF across four benchmark categories). These quantitative advantages position our model as a practical solution for VR/AR content creation where both efficiency and diversity are critical constraints.

Our framework can be applied in efficient VR/AR scene construction. Traditional scene building pipelines require separate models for each object category, significantly increasing computational overhead and development time. Our cross-category generation capability enables a single model to produce diverse objects commonly found in virtual environments—such as vehicles, street furniture, and buildings—within a unified generation process. This approach eliminates the need to switch between category-specific generators or manually integrate assets from disparate sources. For example, when constructing an urban VR environment, designers can generate complete street scenes with coherent object distributions using a single inference pipeline, substantially reducing the time and computational resources required for scene population. This efficiency gain is particularly valuable for resource-constrained AR applications where on-device generation capabilities are limited.

Despite the promising results achieved by our proposed framework, three limitations require attention, prioritized by impact: First, while our method surpasses 3D GAN approaches in generation quality and diversity, and achieves comparable results to certain diffusion-based methods, it still lags behind the current state-of-the-art (SOTA) in terms of geometric detail. Second, texture generation remains challenging; our current pipeline produces geometrically sound models but struggles with high-fidelity texture mapping. This limitation stems from the decoupled nature of our shape and appearance modeling approach. Third, our reliance on CLIP embeddings for semantic guidance introduces potential biases. Since CLIP is trained on massive, unfiltered internet data, the semantic space it provides may inherit societal or domain-specific biases. Consequently, the generated shapes are constrained by this pre-learned, potentially biased world view, which could limit true diversity or novelty when generating shapes for less common or specific categories.

Aiming at these limitations, future work will pursue three strategies: First, to improve geometric detail, we will explore hybrid 3D representations. This involves integrating our tri-plane structure with a multi-resolution implicit field or adopting explicit surface representations, such as a learned triangle mesh structure (e.g., inspired by MeshGPT), to better capture high-frequency details. Second, to address low-fidelity texture generation, we will move towards a unified 3D representation where appearance and geometry are jointly modeled within a single implicit function (e.g., a view-dependent NeRF model). Alternatively, we will condition high-resolution texture diffusion models directly on the synthesized geometric surfaces. Third, to mitigate CLIP-related biases and enhance diversity, we will decouple the model from the fixed, biased CLIP space. This will be achieved by implementing domain-specific semantic distillation or fine-tuning a smaller vision-language model (VLM) specifically on the target 3D dataset to learn a less biased, more representative latent embedding space.

Author Contributions

Conceptualization, J.L. and X.L.; methodology, W.C. and Z.W.; Software, W.C.; validation, Z.Z.; formal analysis, Z.W. and Y.Z.; investigation, W.C.; resources, Y.Z.; data curation, Z.Z.; writing—original draft preparation, W.C.; writing—review and editing, Z.W. and Y.Z.; visualization, W.C. and Z.Z.; supervision, J.L. and X.L.; project administration, W.C. and Z.W.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDA0360102.

Data Availability Statement

The training data are sourced from the ShapeNet dataset (https://www.shapenet.org) (accessed on 5 September 2025). The generated 3D models are publicly available at https://github.com/teddyweinan/3D-Shape-Generation (accessed on 5 September 2025).

Acknowledgments

During the preparation of this manuscript, the author used Qwen (version 3-MAX) for the purposes of text refinement and proofreading. The authors have thoroughly reviewed and edited the output generated by the tool and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GAN	Generative Adversarial Network
3D	Three Dimension
NFE	single forward pass
FID	Fréchet Inception Distance
MMD	Maximum Mean Discrepancy
SDF	Signed Distance Field

References

Luo, S.; Hu, W. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2837–2845. [Google Scholar]
Gao, R.; Holynski, A.; Henzler, P.; Brussee, A.; Martin-Brualla, R.; Srinivasan, P.; Barron, J.T.; Poole, B. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv 2024, arXiv:2405.10314. [Google Scholar]
Yi, T.; Fang, J.; Wang, J.; Wu, G.; Xie, L.; Zhang, X.; Liu, W.; Tian, Q.; Wang, X. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6796–6807. [Google Scholar]
Tang, J. Stable-Dreamfusion: Text-to-3D with Stable-Diffusion, 2022. Available online: https://github.com/ashawkey/stable-dreamfusion (accessed on 17 February 2025).
Qian, G.; Mai, J.; Hamdi, A.; Ren, J.; Siarohin, A.; Li, B.; Lee, H.Y.; Skorokhodov, I.; Wonka, P.; Tulyakov, S.; et al. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Huang, N.; Gokaslan, A.; Kuleshov, V.; Tompkin, J. The GAN is dead; long live the GAN! A Modern GAN Baseline. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans. Adv. Neural Inf. Process. Syst. 2017, 30, 5769–5779. [Google Scholar]
Jolicoeur-Martineau, A. The relativistic discriminator: A key element missing from standard GAN. arXiv 2018, arXiv:1807.00734. [Google Scholar] [CrossRef]
Sun, R.; Fang, T.; Schwing, A. Towards a better global loss landscape of gans. Adv. Neural Inf. Process. Syst. 2020, 33, 10186–10198. [Google Scholar]
Roth, K.; Lucchi, A.; Nowozin, S.; Hofmann, T. Stabilizing training of generative adversarial networks through regularization. Adv. Neural Inf. Process. Syst. 2017, 30, 2015–2025. [Google Scholar]
Park, S.W.; Jung, S.H.; Sim, C.B. NeXtSRGAN: Enhancing super-resolution GAN with ConvNeXt discriminator for superior realism. Vis. Comput. 2025, 41, 7141–7167. [Google Scholar] [CrossRef]
Chan, E.R.; Lin, C.Z.; Chan, M.A.; Nagano, K.; Pan, B.; Mello, S.D.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Pavllo, D.; Spinks, G.; Hofmann, T.; Moens, M.F.; Lucchi, A. Convolutional generation of textured 3d meshes. Adv. Neural Inf. Process. Syst. 2020, 33, 870–882. [Google Scholar]
Xiang, J.; Lv, Z.; Xu, S.; Deng, Y.; Wang, R.; Zhang, B.; Chen, D.; Tong, X.; Yang, J. Structured 3D Latents for Scalable and Versatile 3D Generation. arXiv 2024, arXiv:2412.01506. [Google Scholar] [CrossRef]
Liu, Z.; Feng, Y.; Black, M.J.; Nowrouzezahrai, D.; Paull, L.; Liu, W. MeshDiffusion: Score-based Generative 3D Mesh Modeling. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Li, X.; Zhang, Q.; Kang, D.; Cheng, W.; Gao, Y.; Zhang, J.; Liang, Z.; Liao, J.; Cao, Y.P.; Shan, Y. Advances in 3d generation: A survey. arXiv 2024, arXiv:2401.17807. [Google Scholar] [CrossRef]
Gao, J.; Shen, T.; Wang, Z.; Chen, W.; Yin, K.; Li, D.; Litany, O.; Gojcic, Z.; Fidler, S. GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Amanda, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Ilharco, G.; Wortsman, M.; Wightman, R.; Gordon, C.; Carlini, N.; Taori, R.; Dave, A.; Shankar, V.; Namkoong, H.; Miller, J.; et al. OpenCLIP. Zenodo, 2021. If You Use This Software, Please Cite It as Below. Available online: https://doi.org/10.5281/zenodo.5143773 (accessed on 1 March 2025). [CrossRef]
Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; Jiang, Y.G. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14, September 2018; pp. 52–67. [Google Scholar]
Gkioxari, G.; Malik, J.; Johnson, J. Mesh r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9785–9795. [Google Scholar]
Sun, Y.; Wang, Y.; Liu, Z.; Siegel, J.; Sarma, S. Pointgrow: Autoregressively learned point cloud generation with self-attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 61–70. [Google Scholar]
Kim, J.; Yoo, J.; Lee, J.; Hong, S. Setvae: Learning hierarchical composition for generative modeling of set-structured data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15059–15068. [Google Scholar]
Nichol, A.; Jun, H.; Dhariwal, P.; Mishkin, P.; Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. arXiv 2022, arXiv:2212.08751. [Google Scholar] [CrossRef]
Cheng, Y.C.; Lee, H.Y.; Tulyakov, S.; Schwing, A.G.; Gui, L.Y. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4456–4465. [Google Scholar]
Liu, M.; Xu, C.; Jin, H.; Chen, L.; Varma T, M.; Xu, Z.; Su, H. One-2-3-45: Any single image to 3d mesh in 45 s without per-shape optimization. Adv. Neural Inf. Process. Syst. 2023, 36, 22226–22246. [Google Scholar]
Liu, R.; Wu, R.; Van Hoorick, B.; Tokmakov, P.; Zakharov, S.; Vondrick, C. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 9298–9309. [Google Scholar]
Poole, B.; Jain, A.; Barron, J.T.; Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv 2022, arXiv:2209.14988. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Chen, Z.; Wang, F.; Wang, Y.; Liu, H. Text-to-3d using gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21401–21412. [Google Scholar]
Xu, Y.; Shi, Z.; Yifan, W.; Chen, H.; Yang, C.; Peng, S.; Shen, Y.; Wetzstein, G. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–20. [Google Scholar]
Tang, J.; Ren, J.; Zhou, H.; Liu, Z.; Zeng, G. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv 2023, arXiv:2309.16653. [Google Scholar]
Li, H.; Shi, H.; Zhang, W.; Wu, W.; Liao, Y.; Wang, L.; Lee, L.h.; Zhou, P.Y. Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 214–230. [Google Scholar]
Gupta, A.; Xiong, W.; Nie, Y.; Jones, I.; Oğuz, B. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv 2023, arXiv:2303.05371. [Google Scholar] [CrossRef]
Shue, J.R.; Chan, E.R.; Po, R.; Ankner, Z.; Wu, J.; Wetzstein, G. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20875–20886. [Google Scholar]
Shen, T.; Gao, J.; Yin, K.; Liu, M.Y.; Fidler, S. Deep marching tetrahedra: A hybrid representation for high-resolution 3d shape synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 6087–6101. [Google Scholar]
Lin, C.H.; Gao, J.; Tang, L.; Takikawa, T.; Zeng, X.; Huang, X.; Kreis, K.; Fidler, S.; Liu, M.Y.; Lin, T.Y. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 300–309. [Google Scholar]
Wu, J.; Zhang, C.; Xue, T.; Freeman, B.; Tenenbaum, J. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Adv. Neural Inf. Process. Syst. 2016, 29, 82–90. [Google Scholar]
Nguyen-Phuoc, T.; Li, C.; Theis, L.; Richardt, C.; Yang, Y.L. Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7588–7597. [Google Scholar]
Henzler, P.; Mitra, N.J.; Ritschel, T. Escaping plato’s cave: 3d shape from adversarial rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9984–9993. [Google Scholar]
Nguyen-Phuoc, T.H.; Richardt, C.; Mai, L.; Yang, Y.; Mitra, N. Blockgan: Learning 3d object-aware scene representations from unlabelled images. Adv. Neural Inf. Process. Syst. 2020, 33, 6767–6778. [Google Scholar]
Pavllo, D.; Kohler, J.; Hofmann, T.; Lucchi, A. Learning Generative Models of Textured 3D Meshes from Real-World Images. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
Luo, A.; Li, T.; Zhang, W.H.; Lee, T.S. Surfgen: Adversarial 3d shape synthesis with explicit surface discriminators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16238–16248. [Google Scholar]
Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 165–174. [Google Scholar]
Lorensen, W.E.; Cline, H.E. Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Comput. Graph. 1987, 21, 163–169. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv 2019, arXiv:1812.04948. [Google Scholar] [CrossRef]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Liu, S.; Zhang, Y.; Peng, S.; Shi, B.; Pollefeys, M.; Cui, Z. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2019–2028. [Google Scholar]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. Shapenet: An information-rich 3d model repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
Wang, P.S.; Liu, Y.; Tong, X. Dual Octree Graph Networks for Learning Adaptive Volumetric Shape Representations. ACM Trans. Graph. (SIGGRAPH) 2022, 41, 1–15. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Cao, Z.; Hong, F.; Wu, T.; Pan, L.; Liu, Z. DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation. arXiv 2024, arXiv:2405.08055. [Google Scholar] [CrossRef]
Siddiqui, Y.; Alliegro, A.; Artemov, A.; Tommasi, T.; Sirigatti, D.; Rosov, V.; Dai, A.; Nießner, M. MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]

Figure 1. (a) Our generator produces signed distance fields (SDFs) conditioned on dual latent codes that control shape geometry and semantic attributes. Unlike conventional 3D GANs, we integrate CLIP guidance through semantic-space projection to constrain shape generation. Specifically, we define a CLIP feature space to compute the cosine distance between synthesized and ground-truth feature embeddings. This difference measurement constitutes our directional loss function

L_{s e m a n t i c}

. (b) To implement this framework, we redesign both generator and discriminator with residual network architecture. The style-mapped latent vector

ω

(derived from input z) is decoupled into two orthogonal components:

ω_{s}^{(i)}

for shape modulation and

ω_{c}^{(i)}

for semantic category control. This disentanglement enables independent optimization of geometric and semantic features during adversarial training.

Figure 1. (a) Our generator produces signed distance fields (SDFs) conditioned on dual latent codes that control shape geometry and semantic attributes. Unlike conventional 3D GANs, we integrate CLIP guidance through semantic-space projection to constrain shape generation. Specifically, we define a CLIP feature space to compute the cosine distance between synthesized and ground-truth feature embeddings. This difference measurement constitutes our directional loss function

L_{s e m a n t i c}

. (b) To implement this framework, we redesign both generator and discriminator with residual network architecture. The style-mapped latent vector

ω

(derived from input z) is decoupled into two orthogonal components:

ω_{s}^{(i)}

for shape modulation and

ω_{c}^{(i)}

for semantic category control. This disentanglement enables independent optimization of geometric and semantic features during adversarial training.

Figure 2. Synthesis results of our model trained on cross-category datasets: car (left) and table (right) model generation.

Figure 3. Qualitative results comparison. (a) Qualitative results of our methods trained on cross-category dataset. The synthesized shapes still achieve high quality through sampling cross-category data; (b) qualitative results of other 3D GAN methods trained on cross-category dataset. The synthesized shapes frequently exhibit geometric deformations and anatomical inconsistencies, with several irregular and unnatural distortions highlighted in red boxes.

Table 1. Experimental dataset setting.

Dataset	Category
Single-Category Dataset	Airplane
	Rifle
	Motorbike
	Chair
Two-category Dataset	Airplane + Rifle
	Airplane + Chair
	Airplane + Motorbike
	Rifle + Chair
	Rifle + Motorbike
	Chair + Motorbike
Three-category Dataset	Airplane + Rifle + Motorbike
	Airplane + Rifle + Chair
	Rifle + Chair + Motorbike
	Airplane + Chair + Motorbike
Full-set Dataset	Airplane + Rifle + Chair + Motorbike

Table 2. Comparison of different methods across various categories.

Category	Method	MMD ↓	FID ↓	COV(%) ↑
	EG3D [13]	11.68	101.97	24.60
	GET3D [18]	9.91	94.90	28.67
Motorbike	DiffTF [53]	6.78	80.51	31.96
	MeshGPT [54]	6.64	81.78	27.51
	Ours (Semantically Guided 3D GAN)	7.71	87.63	29.01
	EG3D	10.41	40.93	35.85
	GET3D	9.65	36.18	38.43
Chair	DiffTF	6.31	33.58	42.24
	MeshGPT	6.48	32.05	49.31
	Ours (Semantically Guided 3D GAN)	7.94	35.67	48.79
	EG3D	4.74	30.11	42.73
	GET3D	3.74	22.07	47.24
Airplane	DiffTF	3.28	14.73	50.33
	MeshGPT	3.45	14.63	55.22
	Ours (Semantically Guided 3D GAN)	3.23	17.84	47.16
	EG3D	5.43	35.74	25.45
	GET3D	3.67	22.20	29.32
Rifle	DiffTF	3.24	13.30	47.48
	MeshGPT	3.50	13.89	51.03
	Ours (Semantically Guided 3D GAN)	3.54	16.64	50.60

Table 3. Performance on cross-category datasets.

Dataset Setting	Categories	MMD ↓	FID ↓
Two Category	Rifle + Chair	9.81	18.87
	Rifle + Airplane	5.02	21.81
	Rifle + Motorbike	3.68	89.03
	Chair + Airplane	10.02	21.14
	Chair + Motorbike	16.57	90.28
	Airplane + Motorbike	12.83	83.46
Average		9.66	54.10
Three Category	Rifle + Chair + Airplane	10.82	18.23
	Rifle + Chair + Motorbike	20.18	99.07
	Rifle + Airplane + Motorbike	17.93	92.35
	Chair + Airplane + Motorbike	19.54	93.04
Average		17.12	75.67
Full Set	Rifle + Chair + Airplane + Motorbike	23.35	103.77

Table 4. Comparison of parameter scale, inference time, and VRAM usage among different 3D generation methods.

Method	Parameters	Inference Time (ms)	VRAM (GB)
GET3D	0.34B	151	4.58
DiffTF	1.2B	20,000 ± 15,000	8.02
Ours (Semantically Guided 3D GAN)	0.32B	139	4.5
Ours (two-category dataset)	0.32B	267	4.5

Table 5. Ablation study on the impact of the semantic optimization module and R3GAN loss.

Categories	Semantic Optimization	R3GAN Loss	MMD ↓ (‰)	FID ↓
Rifle + Chair	✓	✓	9.81	18.87
	✓	×	10.39	20.56
	×	✓	12.39	21.28
	×	×	12.50	21.60
Rifle + Chair + Airplane	✓	✓	10.82	18.23
	✓	×	11.58	21.50
	×	✓	13.06	24.17
	×	×	15.74	28.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, W.; Wang, Z.; Zhang, Y.; Zeng, Z.; Li, X.; Liu, J. Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation. Appl. Sci. 2025, 15, 13163. https://doi.org/10.3390/app152413163

AMA Style

Cai W, Wang Z, Zhang Y, Zeng Z, Li X, Liu J. Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation. Applied Sciences. 2025; 15(24):13163. https://doi.org/10.3390/app152413163

Chicago/Turabian Style

Cai, Weinan, Zongji Wang, Yuanben Zhang, Zhihong Zeng, Xinming Li, and Junyi Liu. 2025. "Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation" Applied Sciences 15, no. 24: 13163. https://doi.org/10.3390/app152413163

APA Style

Cai, W., Wang, Z., Zhang, Y., Zeng, Z., Li, X., & Liu, J. (2025). Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation. Applied Sciences, 15(24), 13163. https://doi.org/10.3390/app152413163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic-Aware 3D GAN: CLIP-Guided Disentanglement for Efficient Cross-Category Shape Generation

Abstract

1. Introduction

2. Related Works

2.1. 3D Generative Models

2.2. 3D GAN

3. Methodology

3.1. Semantic-Aware Generation

3.2. Semantically Guided Shape Optimization

4. Experiments

4.1. Implementation Details

4.2. Qualitative Experiments

4.3. Quantitative Results

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI