The Contextual Loss
Technion – Israel Institute of Technology
ECCV 2018 (Oral) [paper 1]
arXiv 2018 [paper 2]
Our Contextual loss is effective for many image transformation tasks:
(top) when the ground-truth targets can not be compared pixel-to-pixel to the generated images:
It can make a Trump cartoon imitate Ray Kurzweil,
give Obama some of Hillary's features,
and, turn women more masculine or men more feminine.
(bottom) when statistical properties are combined in the CNN training process: single-image super resolution, and, high-resolution surface normal estimation from texture images.
Our approach is easy to train and yields networks that produce images that exhibit natural internal statistics.
Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image.
Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations.
However, for many tasks, aligned training pairs of images will not be available.
We present an alternative loss function that does not require alignment, thus providing an effective and simple solution for a new space of problems.
Our loss is based on both context and semantics -- it compares regions with similar semantic meaning, while considering the context of the entire image.
Hence, for example, when transferring the style of one face to another, it will translate eyes-to-eyes and mouth-to-mouth.
We also show that with the contextual loss it is possible to train a CNN to maintain natural image statistics.
Maintaining natural image statistics is a crucial factor in restoration and generation of realistic looking images.
When training CNNs, photorealism is usually attempted by adversarial training (GAN), that pushes the output images to lie on the manifold of natural images.
GANs are very powerful, but not perfect.
They are hard to train and the results still often suffer from artifacts.
The contextual loss is a complementary approach, whose goal is to train a feed-forward CNN to maintain natural internal statistics.
We look explicitly at the distribution of features in an image and train the network to generate images with natural feature distributions.
Our approach reduces by orders of magnitude the number of images required for training and achieves state-of-the-art results on both single-image super-resolution,
and high-resolution surface normal estimation.
To train a generator network G we compare the output image G(s)=x with the corresponding target image y via a statistical loss
--- the Contextual loss --- that compares their feature distributions.
Semantic Style Transfer:
[faces and more]
[Puppet Control Video]
Gender Translation (no GAN):
Domain Transfer (with GAN):
Perceptual Super Resolution Resualts:
In many image translation tasks the desired output images are not spatially aligned with any of the available
(a) In semantic style transfer objects in the output image should share the style of corresponding objects in the target, e.g., the palm leaves, ocean water and sand should be styled like those in the target painting.
(b) In single-image animation we animate a single target image according to input animation images.
(c) In puppet control we animate a target ``puppet'' according to an input ``driver'' but we have available training pairs of driver-puppet images.
(d) In domain transfer, e.g, gender translation, the training images are not even paired, hence, clearly the outputs and targets are not aligned.
“The Contextual Loss for Image Transformation with Non-Aligned Data”
“Maintaining Natural Image Statistics with the Contextual Loss”
Try Our Code
Code to reporduce the experiments described in our paper is available in
Recent Related Work
Template Matching with Deformable Diversity Similarity
Itamar Talmi*, Roey Mechrez*, Lihi Zelnik-Manor In IEEE CVPR, 2017.