This post is aimed at introducing the tool, Tensorflow Probability. The content of the article is heavily borrowed from the following pages.
- https://www.youtube.com/watch?v=CkD4PKwn9Dk (seminar)
A high-level description of the Tensorflow Probability (TFP) is that it is a tool that can chain probability distributions to make a probabilistic inference. Probabilistic modeling is quite popular in the setting where the domain knowledge is quite embedding in the problem definition.
In the seminar above, TFP is described as
A open source Python library built using TF which makes it easy to combine deep learning with probabilistic models on modern hardware.
It is for:
- Statisticians/data scientists. R-like capabilities that run out-of-the-box on TPUs + GPUs.
- ML researchers/practitioners. Build deep models which capture uncertainty.
In order to create TFP models, we need to use distributions and bijectors.
TFP distributions: a collection of probability distributions
- Ex) Normal, Binomial, Poisson, Gamma, Multivariate Normal, Dirichlet, etc
- Python class which encodes some useful properties of a random variable.
- Bijectors transform inputs to outputs and back again.
EX) Real -> (0,1) (0,1) -> Real
- They are volume preserving, bijective, differentiable maps.
- They are useful because sometimes it is faster to do inference on a transformation of a distribution than the original distribution.
TFP Use Case Example
The book “Bayesian Method for Hackers” linked above provides a text message count data example.
The data comprises of the text message count for 74 days.
How can we model this data? Assume that the person’s text message count follows the Poisson distribution.
This distribution expresses the count data with the parameter lambda.
The higher the lambda, the more likely to get a sample from the higher value.
And looking at the count data, it appears that the number of text message becomes bigger for the later period. Therefore, we can consider lambda to be changing by the following logic.
This is called a switch point. Before a certain time period tau, lambda is equal to lambda1. And after tau, lambda is equal to lambda2.
We are going to infer what lambda1, lambda2, and tau are. They are all probability distributions.
If, in reality, no sudden change occurred and indeed lambda1 is equal to lambda2, then the s posterior distributions should look about equal.
1. Prior distribution setting
To use Bayesian inference, we need to assign prior probabilities to the different possible values of lambda1, lambda2, and tau.
Lambda1 and lambda2 can only be positive. Therefore, it is suitable to say lambda1 and lambda2 follows an exponential distribution.
Alpha is a hyperparameter which controls the exponential distribution. It is known that the mean of the exponential distribution is equal to 1/alpha, we can set our prior distribution’s alpha to be 1/(mean of the total count)
For tau, we can say that tau~ uniform(1,74) since we do not know when is the breakpoint. So we should be able to say that it can be every possible day.
So to put all of our distributions together, we have,
2. Model definition
The randomness in our model is in lambda1, lambda2, and tau.
We are interested in knowing the following distribution,
By Bayes rule we have,
Doing inference using this model in TFP requires creating a joint log probability function which takes an input of samples and returns the log probability of the given sample in the model.
This can be done easily with the TFP distributions.
To aid the understanding of the pipeline, here I provide an example of an input which goes into the joint_log_prob function.
And lambda_ is an array which gets gathered by the boolean of whether the day is smaller than the sample of tau.
Therefore, the output of the joint_log_prob function is the summation of all individual part log probability.
We are going to use MCMC to generate posterior samples using the model defined above. MCMC can be used with different kinds of kernels, and in our example, we are going to use HMC, which is known to be quite efficient.
HMC samples live in Real number space. But our exponential distribution samples and uniform distribution samples live in R+ and (0,1). So we add bijectors that convert them to real space. (faster convergence)
In the order of lambda1, lambda2, and tau, we set
For the last step, we set the initial starting points for our sampler.
The following code puts together all our building blocks and runs the MCMC algorithm. The unnormalized_log_posterior function is the the joint_log_prob function with count_data closed out of the input.
The posterior distribution of lambda1 and lambda2 are part from each other, meaning that the effect of the change is significant. The posterior distribution of tau suggests that the change most likely occurred between day 42 and day 44.
Here is the expected number of the text message received.
Our analysis shows
- strong support for believing the user’s behavior did change (lambda_1 would have been close in value to lambda_2 had this not been true)
- the change was sudden rather than gradual (as demonstrated by tau’s strongly peaked posterior distribution).
For more an interactive tutorial on this example with the complete code, check out the tutorial from the Bayesian Method for Hackers above.
When the data of interest is not big enough to be trained on a neural network or the question at hand is quite structured and domain-specific, we can use the probabilistic model to draw out meaningful insight out of the small dataset.