Have a look at this object (try to forget that its a cup for a moment =p ) ,

cup.png

And a series of other objects (credit to Google’s image search on word “cup”),

multiple_cups.png

Are you able to agree that the sky blue object above and all the series of other objects seen on the figure above belong to the same category ?

The answer (I hope) is likely “yes”. Although the problem given above may be “easy” and “obvious” to us, it is not so much for machines.

Us humans, are able to generalize from a single or few number of examples. However, the current state-of-the-art methods for intelligence at the moment require a vast amount of data in order for a machine to learn a concept from data. This may be viable for many domains, but it is a requirement difficult to uphold and it may not be a true way that intelligence is acquired.

The not so many research that has been done on this domain hint at the fact that any new concepts learned are dependent on past experience, and used to intake new novel concepts. One way to think of this is as a form of “Transfer Learning”, where the main concept is to take information we have learned from another domain and use it to learn from a new novel data.

The Machine Learning community has seen enormous improvements on object/class recognition, but under the requirement where the model we are teaching has given the chance see great diverse examples per class. This is a very strong requirement since it is realistically infeasible to obtain large amount of diverse examples per class. Hence, this presents itself as a big obstacle for a machine to learn a very large library of classes. To me, this sounds like a fun and important problem to solve !

One Shot Learning:

The act  of learning to generalize from one or a few number of training examples per class.

There can be multiple methods to directly address one-shot learning. The two presented here today will first be obtaining robust features that will perform well on a new class, and a second method which makes use of meta-learning, where we learn how to do one-shot learning and apply it upon seeing a new class.

Note that, one-shot learning is often defined as a discriminative task (e.g. classification), but it does not necessarily have to be !

The first paper presented will train a robust feature learning model such that it learns generic image features useful for making predictions about unknown class distributions even when very few are available.

The full details of the paper can be found in Gregory Koch et al from University of Toronto’s, “Siamese Neural Networks for One-shot Image Recognition”, Deep Learning Workshop, ICML 2015 paper.

The author’s contribution is to do one-shot learning using a trained deep convolutional siamese neural network. A siamese neural network is a twin network that takes a pair of inputs, to a parameter shared network. The output of the network on both inputs are used in this case at a higher abstract layer to calculate the loss. The details are to follow below.  Siamese_NN.png

This is the architecture of the model (where the actual twin of the network is not shown on this diagram. Imagine a copy of the network on the side up to the fully connected + sigmoid layer).

Let’s take a look at how the loss is formulated to train this network,

l1_sigmoid.png
Firstly, p, an important piece of the loss defined below is the sigmoidal output of the L1 distance between the flattened high level feature vectors (h). It scores a similarity/distance between the two high dimensional feature vectors. \alpha is just the weighting factor we learn.

siamese_loss.png
Using p, the similarity/distance score, we can understand the loss (a regularized binary cross entropy) in the two ways below,
y(x_1,x_2) = 1, if both x‘s are from the same class, it implies to update to minimize the distance between x_1 and x_2.
y(x_1,x_2) = 0, if both x‘s are from a different class, it implies to learn to maximize the distance between x_1 and x_2.

A quick note on the Omniglot dataset by Lake et al,

This is one of the baseline dataset used for one-shot, zero-shot etc learning tasks, and I think it’s worth to mention some of the details here.
– Contains examples from 50 alphabets, some fictitious some not.
– Number of letters in each alphabet varies from 15-40.
– All letters are drawn once by each of 20 drawers
– Can Vary: 30 alphabet used for training (12 drawers), 20 alphabet used for testing/val (8 drawers).

The training environment on the Omniglot dataset on the Siamese model is like the following,

Train Set: 30 alphabet with 12 unique drawers used for train set,
– Each Alphabet is uniformly sampled for fairness.
– Affine distortions of 8 kind were added to each training sample. (for a sample, have a look at the figure 5 below)
– Random sampling for data pairs of same + different pairs done.

Val Set: 10 alphabet with 4 unique drawers,
– used to create one-shot recognition trials to determine the termination criterion.

Test Set: 10 alphabet with 4 unique drawers

affine_distortion.png

Now, during the actual training phase, the model takes a pair of input, and learns until a sufficient verification task result is satisfied. Verification task here is a measure for the progress of training by checking if the input pair belong to the same class. It is when the model performs well on this verification task , the author makes the assumption that the model has learned features which should be sufficient to perform a task on a novel class (a.k.a do one-shot learning !).

verification_one_shot_task.png

 

At testing a.k.a one-shot task (a task here known as “20-way within Alphabet”) using the learned robust features from the Siamese Net, I have attempted to create a pseudocode on how it can be done.

for alph in alphabets: # recall, total 10 in eval
    sample 20 letters from alph
    drawer1 = 20 drawings of above sample obtained from first drawer
    drawer2 = 20 drawings of above sample obtained from second drawer
    for x1 in drawer1:
        P= []
        for x2 in drawer2: # 20 classes
            p = f(x1,x2) # recall p is a sigmoid L1 measure of our x, x_c input pair.
            P.append(p)
        class_x1 = argmax_c(P) # collect the assigned character classes

Finally, the classification accuracy is calculated by whether the predicted class is correct or not. 

For both papers I will be talking about here, we will only focus on the results from the Omniglot dataset.

verification_task.png

The first table here is the output of accuracy on the Omniglot verification task. Recall that we use this to measure the performance during the training phase. Notice how with affine distortions and increased training set size, we obtain better results all the way up to 93.42% accuracy.

one_shot_results.png

The second table is a result of test on one-shot performance (described above) from multiple other models. Notice the Convolutional Siamese Net performs second best after the Hierarchical Bayesian Program Learning method which makes use of extra prior knowledge about characters or strokes. Another downside of HBPL is the computationally expensive inference at test time.

Koch et al presented a strong method to perform one-shot learning, by way of feature extraction, but more recently, a different approach to one-shot learning by use of meta-learning is also presented by Bertinetto et al 2016, “Learning feed-forward one-shot learners”.

The idea is quite straightforward and simple. The approach learns another network called a learnet, that predicts the parameters of a pupil network (a binary classifier) from a single example (the pupil network will be trained by loss generated from it’s attempt to do one-shot learning).

ff_oneshot_learner.png

The learnet takes as input (x,z,l) where z is the input to the learnet to predict the pupil’s weights. The pupil uses those weights and takes input x to classify it as label l. If x and z belong to the same class, l = 1, different class l = 0. (In the actual paper class is defined by negative and positive, but just to keep things in line with Koch’s paper)

The contributions are 2 fold,
1. This is the first paper to explore methods that learns the parameters of complete discriminative models in one-shot.
2. Demonstrates that deep NN can learn at the ‘meta-lvl’ to predict weights of a second network.

The Objective Function:

The network is learning to minimize the loss which is dependent on the learnet’s parameters, W^\prime.

\min_{W^\prime}\frac{1}{n} \sum_{i=1}^{n} L(\phi(x_i;w(z_i;W^\prime)), l_i).

where \phi() is the pupil network’s output, w() is the learnet.

Similarly in a Siamese construction, the pupil will take as input z_i and x_i and use a chosen metric to compute a similarity score.

\min_{W}\frac{1}{n} \sum_{i=1}^{n} L([\phi(x_i,W), \phi(z_i, W)], l_i)

While not so salient in this formulation, an important point is that, the output of w(z;W^\prime) is used to parameterize layers which determine the intermediate representations in the pupil network. Take a look at Figure 2’s Siamese Learnet below to get an easier understanding of what that means.

Difficulty in training a learnet:
Without any special tricks the network will be performing naive parameter predictions. Hence, there is a difficulty in training of a learnet.
If we think of just y = w(z)x + b(z) a simple feed forward pupil neural net, There needs to be a large output space of the learnet, such that w: R^m -> R^{(d \times k)}. This space of the learnet can grow quadratically with increasing number of units.

The authors propose a solution: Factorized Linear Layers.
By following the idea of Singular Value Decomposition, they learn a matrix M which projects ‘x’ onto w(z), a disentangled factor of variation. Also, the network learns a M` which maps the result back from the projected space. By learning just M and M`, we can use the diagonal of w(z), hence reducing the overall number of parameters to be learned. The output than becomes y = M`diag(w(z))Mx + b(z).

This on a convolution becomes slightly more complicated.

svm_on_convolution.png

The output is still of a similar form y = M^\prime \ast w(z) \ast_d M \ast x + b(z), where \ast denotes a convolution. M and M^\prime will disentangle the feature channels allowing w(z) to learn to identify ‘basis filters’ that is predicted by the network. This greatly reduces the number of parameters that need to be predicted by the network. For more details, the authors go into detail as to why w(z) would represent ‘basis filters’ in the appendix of the paper.

The Dataset:
As shown on the figure above, the input is a triplet (x,z,l). x and z pair is randomly chosen but if both of them have the same class l = 1, otherwise l=0. The train set will contain a uniform number of l = 1 and l = 0.

Train data: 30 alphabets, 20 sample images for each character, average of 32 characters chosen for each alphabet.
Evaluation data: 20 alphabets
Image is 28 \times 28, and the network is much much shallower than Koch’s.


ff_one_shot_architectures.png

Experiment Architecture:

Siamese Learnet: Uses a learnet to predict only some of the intermediate shared stream parameters instead of the whole thing. The Siamese learnet takes input z to predict some of the weights of the Siamese Network, which takes input (z,x) to perform binary classification on whether its from same or different class.

Learnet: single-stream learnet, author uses only a single stream of the Siamese structure and predict it’s parameter using the learnet. It predicts the other half’s parameters as well as the final comparison parameters (similar to h in Koch’s model). 

The experiment we will look into is exactly the same as Koch’s. The Omniglot 20-way within Alphabet is used to evaluate the models.

ff_one_shot_experiment_results.png

The authors use 3 different metrics to measure performance. The dot/inner product, euclidean distance and the weighted L1-norm between the two high dimensional vectors of the pair input. As the results show, the learnet (single stream architecture described above) performs best at 28.6% error rate.

Conclusion:

On this post, we have gone through an overview of two distinct approaches to solving one-shot learning. To the best of my understanding, Koch et al’s paper extracting robust features with help of a Siamese Neural Network to be able to classify on novel classes or the meta-learning method of Bertinetto et al’s to do one-shot learning are just two unique attempts at this problem, and it is at the moment difficult to state with confidence which approach is definitely better.

 

References

Koch, G,  Zemel, R, and Salakhudinov, R. Siamese neural networks for one-shot image recognition. In ICML 2015 Deep Learning Workshop, 2016.

Fei-Fei, Li, Fergus, Robert, and Perona, Pietro. One-shot learning of object categories. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(4):594– 611, 2006.

Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua B. One shot learning of simple visual concepts. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, volume 172, 2011.

Betinetto, L, Henriques, J, F, Valmadre, J, Philip H, S, T, Vedaldi, A. Learning feed-forward one-shot learners. In NIPS, 2016.

Posted by:Chris Dongjoo Kim

One thought on “Siamese One-Shot Learners and Feed-Forward One-Shot Learners

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s