An Image is Worth More Than a Thousand Words:
Towards Disentanglement in the Wild

Aviv Gabbay     Niv Cohen     Yedid Hoshen
[paper] [code]

Abstract

Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.

Neural Information Processing Systems (NeurIPS), 2021

Human Face Manipulation

Input
Kid
Asian
Gender
Glasses
Shades
Beard
Blond Hair
Red Hair

Animal Species Translation

Input
Boerboel
Labradoodle
Shiba-Inu
Husky
Chihuahua
Cheetah
Jaguar
Bombay Cat
Arctic Fox

Car Type and Color Manipulation

Input
Jeep
Sports
Family
Black
White
Blue
Red
Yellow

BibTeX

@inproceedings{gabbay2021zerodim,
  author    = {Aviv Gabbay and Niv Cohen and Yedid Hoshen},
  title     = {An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild},
  booktitle = {Neural Information Processing Systems (NeurIPS)},
  year      = {2021}
}