Given a factorization of an image into various components, we present a method to independently control these components through diffusion model sampling. For example, decomposing an image into low and high spatial frequencies allows us to produce images whose low frequencies align with one prompt, and whose high frequencies align with another prompt. That is, we are able to produce hybrid images. We also explore a decomposition into {\it Lab} color space, allowing us to produce images that appear to be one thing when viewed in greyscale, but changes appearance when color is added back. Our method is simple and only modifies the sampling procedure of a pretrained text-conditional image diffusion model. It works by denoising with a composite noise estimate, where each component of the estimate comes from a noise estimate conditioned on a different prompt. We provide qualitative results showing that this method is effective, give intuition for why this approach succeeds, and derive conditions on the image decomposition for the method to work. In addition, we provide quantitative evaluations demonstrating that our method is better than prior work on hybrid image generation, and we generate hybrid images with three different contents.
Live content is unavailable. Log in and register to view live content