StableIdentity: Inserting Anybody into Anywhere at First Sight

1Dalian University of Technology, 2ZMO AI Inc.
* Corresponding Author
Inserting anybody's identity into anywhere for customized image/video/3d generation.

Given a single input image, the proposed StableIdentity can generate diverse customized images in various contexts. Notably, we present the learned identity combined with ControlNet and even injected into video (ModelScopeT2V) and 3D (LucidDreamer) generation. We omit the placeholders \(v^*_1~v^*_2\) of prompts such as "\(v^*_1~v^*_2\) wearing glasses" for brevity.


Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.


Overview of the proposed StableIdentity. Given a single face image, we first employ a FR-ViT encoder and MLPs to capture identity representation, and then land it into our constructed celeb embedding space to master identity-consistent editability. In addition, we design a masked two-phase diffusion loss including \(\mathcal{L}_{noise}\) and \(\mathcal{L}_{rec}\) for training.



Generated results with the proposed StableIdentity for different identities (including various races) under various contexts (covering decoration, action, attribute).

Interpolate start reference image.


Additional customized results with StableIdentity for diverse artistic styles.

Interpolate start reference image.

StableIdentity & Image/Video/3D Models

1. Pose-controlled customized image generation (StableIdentity & ControlNet) and zero-shot identity-driven customized video generation (StableIdentity & ModelScopeT2V).

Interpolate start reference image.

2. Zero-shot identity-driven customized 3D generation (StableIdentity & LucidDreamer). Here, we use "\(v^*_1~v^*_2\)" as the input prompt to show the 3D reconstruction for the learned identities.

Interpolate start reference image.

3. More image/video/3D customized generation results for celeb photos as input.

Interpolate start reference image.


  title={StableIdentity: Inserting Anybody into Anywhere at First Sight},
  author={Wang, Qinghe and Jia, Xu and Li, Xiaomin and Li, Taiqing and Ma, Liqian and Zhuge, Yunzhi and Lu, Huchuan},
  journal={arXiv preprint arXiv:2401.15975},