CoFaDiff: Coordinating Identity Fidelity and Text Consistency in Diffusion-Based Face Generation

Jiahui Ming; Shi Qiu

doi:10.3390/app16010414

and

¹

Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

²

University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2026, 16(1), 414;https://doi.org/10.3390/app16010414
(registering DOI)

Version Notes

Order Reprints

Abstract

Personalized face image generation is essential for Artificial Intelligence-Generated Content (AIGC) applications such as personalized digital avatars and user-customized media creation. However, existing diffusion-based approaches still suffer from insufficient identity consistency and limited text editability. In this work, we propose CoFaDiff, a diffusion-based face generation framework designed for coordinating identity consistency and text-driven editability. To enhance identity consistency, our method integrates a dual-encoder scheme that jointly leverages CLIP and ArcFace to capture both semantic and discriminative facial features, incorporates a progressive curriculum learning strategy to obtain more robust identity representations, and adopts a hybrid IdentityNet–IPAdapter architecture that explicitly models facial location, pose, and corresponding identity embeddings in a unified manner. To enhance text-driven editability, we introduce three complementary optimization strategies: First, long-prompt fine-tuning is employed to reduce the model’s dependency on identity conditions. Second, a semantic alignment loss is incorporated to regularize the influence of identity embeddings within the semantic space of the pretrained diffusion model. Third, during classifier-free guided sampling, we modulate the strength of the identity condition by stacking different numbers of zero-valued identity tokens, enabling users to flexibly balance identity consistency and text editability according to their needs. Experiments on FFHQ and IMDB-WIKI demonstrate that CoFaDiff achieves superior identity consistency and text editability compared to recent baselines.

Keywords:

image synthesis; diffusion models; text-to-image; personalization; identity-customized

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.