Why Unsupervised Learning?
Transfer learning from networks pre-trained on ImageNet has become the de facto standard for improving performance on an impressively large variety of image tasks. The fine-grained and large-scale nature of ImageNet (1000 classes, ~1200 images per class), seemingly allows a network to learn robust features that can generalize across a wide array of domains. But does that mean classification features are the end-all and be-all of representation learning?
One view that the answer is no arises from the field unsupervised representation learning. Unsupervised learning refers to the case where we do not have access to labels. How do we achieve representation learning without any labels? The key is to use the information found in unlabeled data in clever and semantically informative ways. If unsupervised learning is successful, it can potentially harvest information from an unlimited source of unlabeled data.
In this post, we will explore a few of the major avenues of research in unsupervised representation learning for images. The methods are organized into three categories: Context-based methods, Channel-based methods, and recent methods which use simpler self-supervision objectives but achieve the best performance.
Result tables comparing all methods can be found at the bottom of the page, so feel free to skip ahead if you would like to see the result comparisons first.
Context Encoding Methods
A natural analogue for these tasks is Word2Vec. Word2vec proposes to learn meaningful word representations by solving either a [context (surrounding words) -> word] (called continuous bag-of-words) prediction task or [word -> context] (called skip-gram) prediction task. Numerous experimental results demonstrated that in the text domain, word context information can provide a powerful source of automatic supervisory signals. Context encoding, then, constitutes a good “self-supervision task” that allows us to learn representations in an unsupervised way.
Unsupervised Visual Representation Learning by Context Prediction (Doersch 2015)
Self-supervision task description: The proposed task is to train the network to predict the relative location of a context patch with respect to a central patch. For example in the image below, the task is to see the pair of patches, and predict the location index 3.
The authors hypothesized that a good visual representation for this task will need to extract objects and their parts in order to reason about their relative spatial location.
Network Architecture: The architecture employed for the task is a late-fusion siamese AlexNet that processes two patches at once, letting the bulk of the representation encoding for each patch happen individually, then finally fusing the representations with a few fully-connected layers to solve the relative location prediction task.
Avoiding trivial solutions
Low-level cues like boundary patterns and texture continuation between patches can provide a trivial solution to the task. The authors included a gap between patches (half the patch width), and randomly jittered each patch location by up to 7 pixels, respectively, to address the issues.
The authors also note that a more surprising trivial solution was found. In nearest-neighbor retrieval experiments with the encoded representation, they found that some patches retrieved matched patches from the same absolute location in the image, regardless of content, because those patches displayed similar chromatic aberration. Chromatic aberration arises from differences in the way the lens focuses light at different wavelengths. In some cameras, one color channel (commonly green) is shrunk toward the image center relative to the others. A ConvNet, it turns out, can learn to localize a patch relative to the lens itself simply by detecting the separation between green and magenta (red + blue). Once the network learns the absolute location on the lens, solving the relative location task becomes trivial. They resolved this issue by two pre-processing approaches: 1) To shift green and magenta toward gray, and 2) to randomly drop 2 of the 3 color channels from each patch replacing the dropped colors with Gaussian noise. Both methods were similarly effective.
Context Encoders: Feature Learning by Inpainting (Pathak 2016)
Self-supervision task description: In this work, an encoder-decoder is used to encode a patch with a significant central region omitted, so the model has to decode the central pixels using the encoding of the surrounding context. This in-painting task is illustrated below.
This differs from the previous method since it is now a prediction task (of pixels) than a classification task. The authors suggest that raw pixels provide a richer signal for learning, as well as make the network less susceptible to the problematic trivial solutions outlined above. On the other hand it should be pointed out that predicting pixels is much harder than predicting words, and comes with its own set of problems widely addressed in image-generation works.
Network Architecture: The architecture consists of an AlexNet-based encoder, an up-sampling + convolution decoder, and a channel-wise fully connected layer that connects the encoder and decoder. The channel-wise fully connected layer is employed at the end of the encoder to aggregate information globally within each channel, since we no longer have the standard fully connected classification layer. The fully connected layer is immediately connected to a 1×1 convolution layer which propagate information across channels.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles (Noroozi 2016)
Self-supervision task description: Taking the context method one step further, the proposed task is a jigsaw puzzle, made by turning input images into shuffled patches. The authors argue that solving Jigsaw puzzles can be used to teach a system that an object is made of parts and what these parts are, making it a good task for self-supervised representation learning.
Network Architecture: The same architecture from Doersch 2015 is used.
The task creates 9 jigsaw patches from an image. So a shared AlexNet (up until the fully connected layers) with 9 copies is used to process the 9 patches at once. Thus each patch is processed separately, with only the final fully connected layers aggregating contextual information across patches.
Puzzle Task: The puzzle task is created by permuting the tile configurations, i.e. (3, 1, 2, 9, 5, 4, 8, 7, 6) and assigning an index to each permutation. There are 9! = 362,880 permutations. The authors discovered that the size of this permutation set controls the ambiguity of the task. If the permutations are close to each other, the Jigsaw puzzle task is more challenging and ambiguous. For example, if the difference between two different permutations lies only in the position of two tiles and there are two similar tiles in the image, the prediction of the right solution will be impossible. They found that the optimal setting was to use a subset of 1000 permutations and select them based on their Hamming distance.
Cross Channel Encoding Methods
These methods exploit the fact that images come with multiple channels of semantically correlated information, i.e.) Lab colorspace or RGB-D. By predicting a, b color channels from light channel L or depth channel D from RGB, we can create a pretext task for unlabeled images.
Colorful Image Colorization (Zhang 2016)
Self-supervision task description: The proposed task is to simply color grayscale images by predicting a, b color channels from the L channel.
Network Architecture: The network architecture is a standard CNN.
Objective Function: A natural objective for the colorization task is the Euclidean loss between groundtruth and predicted colors. However, this loss is not robust to the inherent ambiguity and multimodal nature of the colorization problem. If an object can take on a set of distinct ab values, the optimal solution to the Euclidean loss will be the mean of the set. In color prediction, this averaging effect favors grayish, desaturated results.
Instead, the authors turn the problem into a multinomial classification problem by quantizing the color space. They quantized the ab output space into bins with grid size 10 and kept the 313 in-gamut values.
As shown in the network diagram, the network predicts a probability distribution over 313 quantized ab values, allowing for training with multinomial cross entropy.
Finally to get images, the softmax probabilities are converted to point estimates by readjusting the temperature of the softmax distribution and taking the mean of the result, which they found to be a good interpolation of simply taking the mode of the distribution and simply taking the mean.
Split-brain Autoencoders (Zhang 2016)
Self-supervision task description: This work extends the colorization task to include a color->grayscale task as well. More generally, the idea is to split images by channel (applied to Lab or RGB-D), and cross-predict the channels, hence the name split-brain autoencoder.
Network Architecture: The network is an AlexNet which is split in half along the channel dimension. The final model uses the layer-wise concatenation of features F1 and F2 from each split network as the final representation.
Loss function and representation aggregation: The authors perform experiments on various combinations of losses for training the split representations F1 and F2. They experiment with classification/regression losses for the channel prediction tasks, as well as using a single model instead of a split model to perform both prediction tasks. They found that using classification losses for both colorization and grayscale prediction worked best, consistent with the results from the colorization work. They also found that a split architecture worked best, rather than a single architecture, which they attributed to the split architectures learning complementary representations.
Recent methods have focused on finding simple but useful facts about the structure of the data, achieving better results than the context based and channel based methods introduced above, indicating that the design of simple but semantically powerful self-supervision objectives is key to unsupervised learning.
Representation Learning by Learning to Count (Noroozi 2017)
Self-supervision task description: The paper proposes to learn image representations by counting visual primitives. Crucially, they exploit the equivariance of visual primitives to transformations, that is, if an image of a dog is transformed (down-sampled, split into patches), the features from the transformed image should still reflect, in totality, the fact that the image contains a dog’s two eyes, a nose, two ears, and so on.
For instance, in the image above, the number of visual primitives in the whole image should match the sum of the number of visual primitives in each tile (dashed red boxes)
Objective Function: The counting objective is defined as follows: The feature vector (or counting vector) of an image down-sampled by a factor of 2 should be the same as the sum of the feature vectors of 4 sub-patches of the same image. This is reflect as an l2 loss term between a downsampled image feature d, and the sum of transformed (subdivided into patches) image features t. To avoid the trivial solution of the network making all feature vectors equal to 0, a contrastive loss term is added, enforcing that the counting feature should be different between two randomly chosen different images. The final loss term is shown at the bottom of the diagram below.
Network Architecture: The network architecture is a 6-way siamese AlexNet with shared weights.
Unsupervised Representation Learning by Predicting Image Rotations (Gidaris 2018)
Self-supervision task description: This paper proposes an incredibly simple task: The network must perform a 4-way classification to predict four rotations (0, 90, 180, 270). Suprisingly, this simple task provides a strong self-supervisory signal that puts this method ahead of all previous methods.
Network Architecture: A standard network architecture for classification is employed in a 4-way classification setting.
But how does it work?: The task appears somewhat un-intuitive at first. But consider that, without successfully localizing the object of interest and its orientation, it is essentially impossible for a ConvNet model to effectively perform the above rotation recognition task. Therefore, the task induces a strong supervisory signal toward localizing, classifying (to an extent), and figuring out the rotation of the object of interest in the image. Although it relies on the fact that ImageNet data have a ‘default’ orientation, and therefore may not be applicable to some other datasets, this is an excellent example of a self-supervision task that is simple but rich in the information it provides.
Deep Clustering for Unsupervised Learning of Visual Features (Caron 2018)
Self-supervision task description: The paper proposes to a similarly simple task: To use ConvNet output features as inputs to a clustering algorithm, and to use the cluster assignments to train the network. This process is repeated each epoch. Although the process is straight-forward, it achieves excellent results, even slighly surpassing the rotation-based approach above, setting the state-of-the-art in as of this writing.
Network Architecture: The network is a standard classification AlexNet. K-means is used on PCA-reduced network features to obtain cluster assignments. The K value for K-means was found to be 10000, despite the fact that ImageNet has 1000 classes, suggesting that some amount of over-segmentation is beneficial.
Why does it work? The authors cite an interesting example for why bootstrapping a classification network in this way might work: It turns out that putting a multilayer perceptron classifier on top of the last convolutional layer of a random AlexNet achieves 12% in accuracy on ImageNet while the chance is at 0.1%. This implies that the structure of a ConvNet itself provides a strong semantic prior on the input signal. Essentially, you can gain some information, albeit the signal may be weak, by simplying forward propagating an image through a ConvNet, making using the pseudo-labels from clustering ConvNet outputs feasible.
Sobel filtering: The paper reports that without using Sobel filtering to remove color information, the algorithm performs significantly worse. This is likely due to the fact that the pseudo-label bootstrapping process can be misled by the clustering algorithm creating cluster assignments biased towards color. As a solution, the authors apply a fixed linear transformation based on Sobel filters to remove color and increase local contrast.
Transfer Learning Results
The ImageNet results are obtained by transferring the learned weights from unsupervised pre-training, freezing them, and fine-tuning on ImageNet labels on the layers above the indicated layer. For example, for Conv5, the layers above Conv5, which are the fully connected layers, are trained. The ImageNet labels entry transfers weights learned via the standard ImageNet classification training, and in the non-linear case, recover their full performance when finetuned.
ImageNet Top-1 Classification (linear fully connected layers)
ImageNet Top-1 Classification (non-linear fully connected layers)
PASCAL VOC 2007 Results
Approaching supervised representations
We can see that in the case of RotNet and DeepCluster, unsupervised representation learning reduces the gap between unsupervised and supervised representation learning. For PASCAL VOC, the two methods come very close to transfer learning from supervised ImageNet. These results point towards unsupervised learning being a promising future direction for representation learning.