• Introduction

Many, if not all, of the near-human or super-human performances achieved by deep learning algorithms are powered by equally impressive annotation efforts. While there is nothing inherently wrong with the supervised learning approach, the question remains whether we can either A) move towards requiring far less supervision (semi-supervised learning) or B) go complete label-free (unsupervised learning).

In this post, we will discuss how research has progressed in the latter approach of unsupervised learning. In a previous post, we introduced several unsupervised learning methods which attempt to induce meaningful features in the network from unlabeled data. They are characterized by auxiliary objectives such as colorization, auto-encoding, solving jigsaw puzzles, or in-painting. The learned features are then used to solve a separate, “downstream task”.

Today we will introduce ‘Revisiting Self-Supervised Visual Representation Learning’ (Kolesnikov et al, 2019), a paper which thoroughly examines a set of four pretext tasks on ResNet variants. These tasks are: Exemplar networks, jigsaw puzzle solving, relative patch location prediction, and rotation prediction.

Since they are not the focus of this post, we link the papers for the details of architectures and pretext tasks explored:


  1. RevNet
  2. ResNet v1
  3. ResNet v2

Pretext Tasks

  1. Unsupervised Representation Learning By Predicting Image Rotations
  2. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks
  3. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
  4. Unsupervised Visual Representation Learning by Context Prediction
  • Evaluating Self-supervised Features on ImageNet

In this post, we will focus on the efficacy of unsupervised learning on the ImageNet classification task. The following two tables show the main results. Table 1 is from the DeepCluster paper, and all of the reported accuracies were produced with AlexNet variants on the ImageNet valid split. Table 2 is from the ‘Revisiting’ paper. The values under the ‘Ours’ column are reported for ResNet variants, and they’re compared with the previous results from other works under the ‘Prev’ column. Let’s dive right in:

Table 1

Screen Shot 2019-03-12 at 11.42.00 AM

Table 2

Screen Shot 2019-03-11 at 5.37.30 PM

All of the accuracy values are calculated on the valid split of ImageNet. The authors employ the standard procedure for evaluating learned features on ImageNet, which is widely adopted in many self-supervised learning works. The procedure is as follows: First, features are trained via self-supervision on the train split of ImageNet, without labels. Then, a linear classifier is trained on the train split, using fixed features from these networks. Finally, the linear classifier is evaluated on the ImageNet valid split.

This means that the row titled “ImageNet labels” in Table 1 indicates taking an AlexNet which has been trained on ImageNet classification, removing the fully connected layers, fixing the convolution layers, then training a linear classifier on ImageNet classification again.

A Note: You may have noticed that there’s something artificial about this procedure, because although it assumes we have no access to ImageNet labels during the unsupervised training phase, the downstream evaluation procedure requires labels for exactly those same images to train the linear classifier. This begs the question, if you have the labels, why not just use them?. We will address this concern in a later section.

  • So how good are self-supervised features?

First, let’s go over the AlexNet based results in Table 1. For reference, the top-1 accuracy of an AlexNet trained end-to-end on ImageNet classification is around ~58%, which is about 14% higher than the “ImageNet labels” entry, which has an accuracy of 50.5%. The best unsupervised features are at around the 40% accuracy mark, which puts them, relatively speaking, at ~70% of the end-to-end supervised results, and ~80% of the ‘ImageNet labels’ entry.

Now onto Table 2: At 55.4%, the Rotation pretext task with RevNet 50 is able to reach 75% of the performance of the supervised end-to-end RevNet50. In relative terms, this number is in line with the Table 1 results on AlexNet (70~80% relative to end-to-end supervised). Nevertheless, the absolute numbers are impressive in themselves. For example,  the best Rotnet architecture above almost matches the performance of a supervised AlexNet(~58%).

So what does that mean? For one, it means that no one has yet figured out how to push unsupervised learning results beyond the 80% of end-to-end supervision threshold in the past 3 years. This may or may not mean that this is an upper bound on the performance of unsupervised learning. In either case, it leaves open the research direction for improving purely unsupervised methods beyond the 80% mark. Achieving this would be a significant step forward, but will probably require a major breakthrough. Secondly, and perhaps more practically, some of these pretext tasks may constitute useful auxiliary tasks in settings outside of unsupervised learning , since they seem to be able to capture important information about the data in a completely cost-free way.

  • Self-supervision is highly sensitive to Architectural Details

One of the key questions in unsupervised learning research is the relationship between the pretext task design and network architecture design. The paper is able to delineate this relationship by repeating all their experiments on all four pretext tasks with both ResNet v1 and v2 variants, which differ only by the ordering of batch-norm and ReLU operations in residual blocks, with an additional variant without ReLU preceding the global average pooling, marked by a “(-)”.

Table 3

Screen Shot 2019-03-13 at 5.00.55 PM

The results in Table 3 show that architectural details are indeed quite important: That is, certain architectures seem to work particularly well with certain pretext tasks. Let’s first compare ResNet v2 and v1, which differ only by the ordering of their components as shown below (v1 corresponds to (a) and v2 corresponds to (e)). In the Rotation task, v2 shows a clear advantage in performance. In the Exemplar task, the difference between v2 and v1 is negligible. In the RelPatchLoc task, v1 is clearly better than v2. Likewise, on the Jigsaw task, v1 is significantly better than v2.

Screen Shot 2019-03-14 at 1.09.48 PM

Additionally, the results show that the “(-)” variants have significantly worse performance across the board. However we already know that RevNet, ResNet v1, ResNet v2 all have similar performance on supervised ImageNet. In contrast, we do not know what effect removing the ReLU preceding the global average pooling has on the network in general, so we probably shouldn’t read too much into the results on the “(-)”  variants.

Screen Shot 2019-03-14 at 1.27.39 PM

Overall, the relationship between architectural details and pretext task feature quality seems close to arbitrary. It’s certainly clear that classification performance has little correlation with pretext task feature quality, since RevNet50, ResNet50 v1/v2, VGG19-BN all have similar performance on the ImageNet classification task. Also, ResNet v2 is known to provide marginal but consistent performance benefits on various classification tasks, but we see no such pattern here.

  • Self-supervision tasks benefit from large network capacity

The paper also conducts a study of the effect of network depth and capacity on the quality of representations learned. The authors experiment with wider networks and bigger feature vector dimensions.

To make networks wider, a simple multiplier k is applied on the number of channels in all convolution layers of the network. I.e. the first layer of the network has 16*k channels, and the last layer has 512*k channels. The “Pre-logits” layer refers to the final dense layer (after global average pooling but before the classification softmax).

Screen Shot 2019-03-13 at 3.53.57 PM

Naturally, the pre-logits layer size depends on the number of channels of the final conv layer and therefore dependent on k.  For k values of {4, 8, 12, 16} this results in pre-logits layers of size {2048, 4096, 6144, 8192}. To see whether the final pre-logits layer size, or, “representation size”, effects the quality of the representations, they add another dense layer before the classification softmax layer to control the representation size. For example, with a k value of 16, the natural final representation size would be 8192, but another dense layer {8192->2048} can transform the vector to a 2048 dimensional one.  

Screen Shot 2019-03-13 at 3.53.10 PM

The colorized matrix shows that wider (more capacity) networks and bigger representation sizes is almost unequivocally better. This is color coded in the diagram. The redder the grid, the higher its downstream accuracy, and we see that in directions of increasing network capacity or representation size, accuracy is almost always better (the grid is redder). Also useful to note about the color: there is a yellow boundary at the [2x width/1024 representation size] row/column. This establishes a baseline of sorts at the 2x width/1024 setting. Although this is pretext task-specific, future self-supervised learning experiments may be well-advised to start with network capacities of k=4.


  • VGG is unsuited for feature learning

Screen Shot 2019-03-14 at 1.30.14 PM

In Table 3 you may have noticed that VGG19-BN’s feature quality is significantly lacking compared to the other networks, even though it achieves equivalent performance on supervised classification. The authors investigate why, by examining the features of VGG19-BN layer-wise. This reveals the likely source of the low performance of VGG nets: Their best features are found in Block3, but beyond Block3, the feature quality successively degrades. In contrast, for ResNet based models, feature quality continues to increase beyond Block3, right up to the final pre-logits layer. In other words, for feature learning, ResNets are able to utilize network depth beyond Block3 while VGG nets cannot.

It is interesting to consider that two networks of equivalent performance on supervised classification can exhibit wildly different performance in feature learning. This reinforces the finding from earlier, that self-supervision performance is very sensitive to the specifics of network architectures. Here, residual connections seem to be the key component in enabling network depth to be utilized effectively in unsupervised learning. We can further posit that methods that improve the learning dynamics or gradient flow of convnets, such as DenseNets, and attention-based modules, may see meaningful gains in feature learning.

  • Pretext Task Accuracy isn’t Reliable Across Architectures 

An important component in unsupervised learning is model selection: We must select a model from the pretext training process then use it to evaluate on the downstream task. Since evaluating all models on the downstream task is computationally expensive, we may opt to select only the model that achieves the best performance on the pretext task. However, the paper shows that pretext task accuracy isn’t necessarily a reliable metric for judging a model’s feature quality, when multiple architectures are involved.

Screen Shot 2019-03-11 at 7.06.15 PM

Let’s look at the figure above – On pretext task accuracy alone, the best network on the rotation task is VGG19-BN, but as we’ve already seen, VGG19-BN is in fact the worst network for all pretext tasks. Similar results can be observed in the Rel. Patch. Loc and Jigsaw tasks: In both cases, VGG is the second best network in terms of pretext task accuracy, but the worst in terms of feature quality.

The effect is somewhat less pronounced when excluding VGG – On RelPatchLoc and Jigsaw tasks, there is a linear correlation between pretext task accuracy and downstream ImageNet accuracy. However, this is untrue for the Rotation task. It is possible that the Rotation task is an outlier here, but it’s also possible that we would have observed similar results if more pretext tasks beyond these four had been considered.

Finally, when comparing within an architecture, we do see a linear relationship between pretext task accuracy and downstream accuracy.

  • Comparison with Semi-supervised Learning

If you recall the earlier description of the ImageNet evaluation procedure, you may have noticed that we use ImageNet as both the pretext task and the downstream task. The fact that ImageNet labels are used during the downstream phase places the procedure somewhere between unsupervised and semi-supervised learning, away from pure unsupervised learning. In fact, if using fewer labels during the downstream phase, the unsupervised procedure becomes identical to semi-supervised learning except for two key differences: 1) In unsupervised learning, the convolutional layers are fixed when fine-tuning on the labeled data. 2) The unsupervised objective and the supervised objective are not jointly trained. To illustrate this, we include the following diagram:

The first figure shows the pretext & downstream phases of the unsupervised learning & evaluation procedure, and the second figure shows a standard semi-supervised learning procedure. Overall, the data that the two procedures consume is the same except that in the semi supervised case, fewer labeled examples are used.

1. Unsupervised pre-training and Evaluation

Screen Shot 2019-03-13 at 6.58.03 PM

2. Semi-supervised learning

Screen Shot 2019-03-13 at 7.17.49 PM

In light of this, it appears worthwhile to ask how the two procedures would compare against each other in the low-data regime. Unfortunately, we don’t have a direct comparison because the paper was not originally intended to study semi-supervised learning, and so does not conduct experiments to explicitly compare the effects of 1) fixing the conv layers when fine-tuning on the labeled data, and 2) jointly training the unsupervised and supervised objectives.

We can only make a tentative comparison with the available results. The 10% label setting in Table 3 allows us to make this comparison.

The semi-supervised model compared is the DGDN + BSVM from Variational Autoencoder for Deep Learning of Images, Labels and Captions (Yunchen Pu et al., 2016). It jointly models an image decoder p(s|X), image encoder p(X|s), and a label distribution conditioned on features, p(Y|s), where s is the features of a convolutional network. In terms of the components, it’s very similar to pre-training on an auto-encoding task then fine-tuning, but with the two steps being jointly optimized. Its results are no longer state-of-the-art, and yet, it still shows a superior performance to the unsupervised method. Note that architecturally, it is similar to AlexNet, so much smaller compared to the ResNet50 variant models used for unsupervised learning.

Screen Shot 2019-03-13 at 5.50.41 PM

Screen Shot 2019-03-13 at 5.48.32 PM

The best unsupervised top-1 accuracy using 10% of ImageNet labels is the 38.4% of a RevNet50 with a 16x multiplier. In comparison, the best semi-supervised result from DGDN is ~50%.

While the semi-supervision performance is superior, and this is not a direct comparison, it may be worthwhile to explore the gap between the unsupervised setting and the semi-supervised setting, especially in the context of different pretext objectives.

  • Looking Forward

The results of the paper offer some useful insights for unsupervised learning going forward. They offer strong evidence for the designing of network architectures specific to a pretext task, rather than simply recycling the ones from the supervised classification task. It also shows that downstream task performance is the one truly reliable way to measure feature quality, supporting the possibility for applying something like Neural Architecture Search to the goal of exploring unsupervised learning architectures.

They also show that some of the limitations apparent in previous works persist even when applying unsupervised learning to modern architectures. Considered in light of the strong performance possible with semi-supervised models in similar settings, they make the case for exploring pretext tasks in settings outside of unsupervised learning as well.


Posted by:cerberusd

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s