Abstract
Personalized face image generation is essential for Artificial Intelligence-Generated Content (AIGC) applications such as personalized digital avatars and user-customized media creation. However, existing diffusion-based approaches still suffer from insufficient identity consistency and limited text editability. In this work, we propose CoFaDiff, a diffusion-based face generation framework designed for coordinating identity consistency and text-driven editability. To enhance identity consistency, our method integrates a dual-encoder scheme that jointly leverages CLIP and ArcFace to capture both semantic and discriminative facial features, incorporates a progressive curriculum learning strategy to obtain more robust identity representations, and adopts a hybrid IdentityNet–IPAdapter architecture that explicitly models facial location, pose, and corresponding identity embeddings in a unified manner. To enhance text-driven editability, we introduce three complementary optimization strategies: First, long-prompt fine-tuning is employed to reduce the model’s dependency on identity conditions. Second, a semantic alignment loss is incorporated to regularize the influence of identity embeddings within the semantic space of the pretrained diffusion model. Third, during classifier-free guided sampling, we modulate the strength of the identity condition by stacking different numbers of zero-valued identity tokens, enabling users to flexibly balance identity consistency and text editability according to their needs. Experiments on FFHQ and IMDB-WIKI demonstrate that CoFaDiff achieves superior identity consistency and text editability compared to recent baselines.