Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models



Published on




Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., “a photo of person”) to one with style (e.g., “a photo of person with smile”) while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based imageediting algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at

This work was presented at CVPR 2023.

Please cite our work using the BibTeX below.

    author    = {Wu, Qiucheng and Liu, Yujian and Zhao, Handong and Kale, Ajinkya and Bui, Trung and Yu, Tong and Lin, Zhe and Zhang, Yang and Chang, Shiyu},
    title     = {Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {1900-1910}
Close Modal