Introduction

This post is aimed at explaining the concept of uncertainty in deep learning. More often than not, when people speak of uncertainty or probability in deep learning, many different concepts of uncertainty are interchanged with one another, confounding the subject in hand altogether. To see this, consider such questions.

– Is my network’s classification output a probability of getting it right?

– How much is the model certain about the output? And if the model is certain, does that mean it is going to be more likely to be correct?

– What about the 2016 incidence where autonomous driving vehicle mistakenly took the tractor-trailer to be a sky? How can this be prevented?

– What is model uncertainty?

– How much can you be sure that the model you have is not going to change dramatically if you have a slightly different set of training data?

– Have you heard the following terms? Aleatoric, Epistemic?

– What is Bayesian Neural Network?

Above questions are touching on different topics, all under the terminology of “uncertainty.” This post will try to answer the questions above by scratching the surface of the following topics: calibration, uncertainty within a model, Bayesian neural network.

Calibration

What is calibration? A good paper to learn in depth about calibration is Guo, Chuan, et al, 2017. Calibration is the concept that refers to the following statement:

the probability associated with the predicted class label should reflect its ground truth correctness likelihood.

In a classification network, the model is usually equipped with a sigmoid or softmax function to map the logit output to between 0 and 1. Because this output is between 0 and 1, many people get the sense that this is a probability. However, It may not necessarily true that the model’s output has the property of a probability, meaning, the output’s value should be indicative of the probability of being correct.

Consider the following diagram from Guo, Chuan, et al, 2017. Guo, Chuan, et al, 2017 compares LeNet and ResNet on CIFAR-100 data. The first row shows the bin count of the output’s value. ResNet has most of its values around 0.9~1.0.

Before looking at the second-row, we must consider the following scenario to define two concepts.

Suppose we have 100 outputs from a model. Some will be 0.6, some will be 0.9. We can bin the outputs to, for example [0~0.1], [0.1~0.2], … , [0.9~1.0]. We can define the following terms for each bin.

accuracy = $\frac{1}{|B_m|} \sum_{i \in B_m}\mathbf{1}(\hat{y_i} = y_i)$

confidence = $\frac{1}{|B_m|} \sum_{i \in B_m}\hat{p_i}$

Accuracy is the proportion of correct over the number of outputs in a bin. Confidence is the mean of the outputs in a bin. The second-row of the figure above shows a plot of sample accuracy as a function of confidence. The more the blue part deviate from the red, the less the model is calibrated. The meaning of the plot is simple: it shows whether the model’s output reflects how likely you are going to be correct.

But can we take this at the face value? Can we expect the model’s output to mean the probability of getting it right if the model is calibrated like the left model on the figure? The answer is “maybe”. It is maybe because it only applies to the data that is “similar in distribution” to the data used for creating this calibration graph. So the correct thing to say for the interpretation of the calibration would be

For the perfectly calibrated model, the value of the output, given a data, reflects how likely the predict is going to be correct, given that the data distribution of the data used in calibration represents the data distribution of the data in question.

For one thing, the model has to be calibrated perfectly, (because if not, we must also account for the error in calibration) and secondly the data should be similar, for the calibration to be reliable. But when can we be sure that the data we are testing is going to be “similar” (which in itself is a vague term) to the data used in calibration? Maybe never.

Therefore, one should be careful in using the calibration and be mindful of the interpretation and the underlying assumptions. For a detailed explanation of how to calibrate a model, read Guo, Chuan, et al, 2017.

Uncertainty within a model

Moving on to the next concept, uncertainty within a model. Anyone who has taken the linear regression course in statistics would be somewhat familiar to the concept. It is also about how we are going to model the noise in the data. Yes, I am going to talk about the variance in the model.

Before we proceed to look at a simple example, keep this definition from  Kendall and Gal (2017) in mind.

1. data uncertainty (aleatoric): randomness that arises from the nature of data. Depends on what you decide to “not explain” with the model (as a noise).
2. model uncertainty (epistemic): uncertainty that arises from the model complexity and the number of data.

It will become more clear once we look at an example.

Simple model

Consider this data generating mechanism as the true distribution of samples. $Y = X \mathbf{\beta} + \epsilon$ $\epsilon \sim N(0,1)$

and $\mathbf{\beta} = [1,1]$ which is a simple line with error.

See few datasets that can be generated. Suppose we decide that this data is going to be fit with a linear model. $Y = X \mathbf{\beta}$

Recall the good old regression. Find $\hat{\beta}$ such that minimizes the L2 norm of Y and $\hat{Y}$, our prediction. $\arg\min_{\hat{\beta}} || Y - \hat{Y}||_2 \rightarrow 0= \frac{d || Y - \hat{Y}||_2}{d\hat{\beta}}$ $0 = \frac{d}{d\hat{\beta}}(Y - \hat{Y} )^T(Y - \hat{Y} )$ $= \frac{d}{d\hat{\beta}}(Y - X\hat{\beta})^T(Y - X\hat{\beta})$ $= \frac{d}{d\hat{\beta}} Y^TY - 2 Y^TX\hat{\beta} + \hat{\beta}X^TX\hat{\beta}$ See that depending on what samples are sampled, we get slightly different fit. But we rarely get to see multiple datasets. We only see one dataset.  Linear regression allows you to infer the variance of the line’s parameters without having to see the data generating distributions. It is in essence, a systematic way to quantify variance given the data and the model’s complexity. It is this simple yet powerful concept we are going to extend to the Neural network. $var(\hat{\beta}) = E[(\hat{\beta}- \beta)(\hat{\beta}- \beta)^T]$ $= E[((X^TX)^{-1}X^T\epsilon)((X^TX)^{-1}X^T\epsilon)^T]$ $= E[(X^TX)^{-1}X^T \epsilon \epsilon^T X(X^TX)^{-1}]$ $= (X^TX)^{-1}X^T E[\epsilon \epsilon^T] X(X^TX)^{-1}$

Since we assuemd $\epsilon \sim N(0,1)$, $E[\epsilon \epsilon^T] = 1\mathbf{I}$.

If $\epsilon \sim N(0,\sigma^2 \mathbf{I})$, then $E[\epsilon \epsilon^T] = \sigma^2\mathbf{I}$ $var(\hat{\beta}) = \sigma^2(X^TX)^{-1}$

Estimate $\sigma^2$ with the unbiased sample variance of residuals. $\frac{(y-\hat{y})^T(y-\hat{y})}{n-k}$ where $k$ is the number of parameters.

The interpretation of our $\hat{\beta}$ and $var(\hat{\beta})$ is tricky.

• Suppose our calculated $\hat{\beta}$ is [1.254, 0.731]
• Suppose our calculated $var(\hat{\beta})$ is [[0.167, 0.0357]

Then with some asumptions about normality and independence, 95 percent confidence interval for $\hat{\beta_1}$ is $\hat{\beta_1} \pm 1.96*\sqrt{var(\hat{\beta_1})} = [0.36, 1.10]$
If we repeat this many times, these intervals will contain the true beta (=1) for 95 percent of the time.

When a new data comes in, we can expect the prediction to have variance as well by the following definition. $var(y^*) = var(X^*\hat{\beta} + \epsilon)$ $= X^* var(\hat{\beta}) X^{*T} + var(\epsilon)$ $= epistemic + aleatoric$

Here we see that epistemic uncertainty is due to the variance of our parameters and aleatoric uncertainty is due to the noise not accounted for by the model.

Non-Linear models

We can extend this concept of uncertainty to non-linear models like Neural networks. But unlike the simple ones, we cannot get the closed form solutions to the variance.

In non-linear models, such as NN, we cannot do the following:

• Find a closed form solution for $\hat{\beta} = \arg\min_{\hat{\beta}} || Y - \hat{Y}||_2$
• Therefore, we cannot calculate $var(\hat{\beta})$ in closed form.

With too many numbers of parameters, we cannot do the following:

• Even if the closed form solution for the variance is known, we must assume that the covariance structure is diagonal (independence)

With X as a high dimensional data such as pictures, we cannot easily do the following:

• Calculate $\sigma$ if there is one. (Classification may not have $\sigma$)

As an example of a non-linear model, I will point out one thing about logistic regression model, and move on the Bayesian Neural Network. There is a striking similarity between logistic regression and a deep neural network with binary cross entropy loss.

Logistic Regression: $y_i \sim binomial(y = 1, prob = sigmoid(x_i B) )$
When $p(Y|X)$ follows a binomial distribution, we are solving, $\arg\max_B p(Y|X) = \prod_{i=1}^n sigmoid(x_iB)^{y_i} (1-sigmoid(x_iB))^{(1-y_i)}$

• Binomial distribution is the random component.
• $sigmoid(XB)$ part is the systematic component.
• What is $\arg\max p(y|x)$? Best conditional probability fit of the data using the logistic regression model.

Deep neural network with binary cross entropy loss:

Deep learning with sigmoid activation and cross entropy loss is very similar to Logistic Regression. $y_i \sim binomial(y = 1, prob = NN(x_i, W) )$

where NN is the deep neural network.

• If the model is fitted correctly, the conditional probability[/latex]p(y \mid x)[/latex] should have the probability like property.
• Namely, sigmoid output close to 1 should have the higher chance of being the designated class. This is called calibration problem.

Bayesian Neural Network

So far we have only talked about any neural networks, not limited to Bayesian neural networks. But when we are talking about uncertainty, we cannot forego the discussion of Bayesian Neural nets. What are they? What do Bayesian Neural nets do that conventional Neural nets can’t?

Consider the following diagram, which shows what I have elaborated so far. All models can theoretically calculate the confidence interval (variance) of its prediction. But because of intractability and non-linearity, deep neural networks cannot calculate the variance of its output.

Here comes Bayesian neural network to rescue. To put it simply, it approximates the distribution of the output by generating multiple sample outputs, which is possible because Bayesians assume all the parameters are random. I have stated in a simple term but for more in-depth study guide of Bayesian regression and Bayesian Neural networks, I can recommend an introductory book on Bayesian statistics: A First Course in Bayesian Statistical Methods and papers on Bayesian Neural Network Gal (2016) and Kendall and Gal (2017).

To illustrate the drawbacks of the standard Neural network, we can consider two scenarios.  We see that in the first scenario, the data in question’s output does not change much even when the training data and model’s initializations are different slightly. However, for the second scenario, when you have a slightly different dataset and a different initialization, the model’s output changes heavily. The conventional model can only see one output, leaving us clueless whether it is the first scenario or the second scenario.

But the Bayesian neural networks try to estimate the variance around the output which serves as a proxy to our question. Example and Experiment

In this section, I will show you an example of the Bayesian Neural Network trained on CUB 200 dataset using Dropout Bayesian Neural Network (Gal and Ghahramani, 2016)

To introduce the model briefly,

• Gal and Ghahramani (2016) showed that Dropout CNN after every weight is equivalent to Variational Approximation of the Bayesian Neural network with Bernoulli Distribution prior.
• It averages over T forward passes through the network at test time (as opposed to upscaling the weight by the dropout ratio in the conventional dropout at test time.)
• It is the Monte Carlo estimation of the predictive distribution.
• Scalable and shown to have good generalization performance

Some notations:

• Let $W_i$ be the weights in ith layer of dimension $K_i \times K_{i-1}$
• $p_i$ be the dropout probability for ith layer.
• Define the variational distribution $q(w)$ to be the following:
• $W_i = M_i * diag((z_{i,j})^{K_i}_{j=1})$
• $z_{i,j} \sim Bernoulli(p_i) \text{ for } i=1,...,L,\quad j=1,...,K_{i-1}$
• We will optimize with respect to $M_i$

Despite its simplicity, there is one thing we must keep in mind. The prior is binomial, which means there is only one parameter for binomial distribution which controls both mean and the variance together. Therefore, when the dropout probability is a hyperparameter, the posterior distribution will not be expressive enough to reduce the variance to zero. In other words, the shape of our prior and the variational parameter we are optimizing is not fully expressive.

Experiment dataset

CUB 200 data

• Number of categories: 200
• Number of images: 6,033
• train test split 0.9:0.1 by stratifying

Model:

GoogleNet (Not LeNet) + global average pooling head. Bayesian Network with same architecture shows a better validation score.

Some visualizations using Class Activation Map

The first picture is the input. The second picture is the class activation map from the standard neural network. The third picture is the multiple outputs from the Bayesian neural network.

Conclusion

In this post, we have distinguished the concept of calibration from the model uncertainty. Furthermore, we looked at how all models have uncertainty within them but not all models can calculate uncertainty because of the non-linearity. We saw how Bayesian Networks is one way to circumvent this problem and estimate the output’s variance. Bayesian Neural network is a field still not fully explored. Furthermore, agreement on how to assess the uncertainty quality in classification task might stimulate more research on this topic.

###### Posted by:Minchul David Kim

Researcher @Lunit