Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.
Overview of the proposed StableIdentity. Given a single face image, we first employ a FR-ViT encoder and MLPs to capture identity representation, and then land it into our constructed celeb embedding space to master identity-consistent editability. In addition, we design a masked two-phase diffusion loss including \(\mathcal{L}_{noise}\) and \(\mathcal{L}_{rec}\) for training.
Generated results with the proposed StableIdentity for different identities (including various races) under various contexts (covering decoration, action, attribute).
Additional customized results with StableIdentity for diverse artistic styles.
1. Pose-controlled customized image generation (StableIdentity & ControlNet) and zero-shot identity-driven customized video generation (StableIdentity & ModelScopeT2V).
2. Zero-shot identity-driven customized 3D generation (StableIdentity & LucidDreamer). Here, we use "\(v^*_1~v^*_2\)" as the input prompt to show the 3D reconstruction for the learned identities.
3. More image/video/3D customized generation results for celeb photos as input.
@article{wang2024stableidentity,
title={StableIdentity: Inserting Anybody into Anywhere at First Sight},
author={Wang, Qinghe and Jia, Xu and Li, Xiaomin and Li, Taiqing and Ma, Liqian and Zhuge, Yunzhi and Lu, Huchuan},
journal={arXiv preprint arXiv:2401.15975},
year={2024}
}