Stochastic Variational Inference

Suggested reading

  • Murphy: Chapter 18

Overview

  • Introduce TrueSkill model as an example problem.
  • Review Variational Inference
  • Derive the variational objective
  • ELBO intuition
  • Stochastic optimization

The TrueSkill latent variable model

The model we'll be examining is the TrueSkill model, a player ranking system for competitive games originally developed for Halo 2. It is a generalization of the Elo rating system in Chess. For the curious, the original paper introducing the trueskill paper can be found here: The next assignment will be based on doing variational inference in this model.

The problem that this model is supposed to solve is inferring the skill of a set of players in a competitive game, based only on observing who beats who when they play against each other. In the TrueSkill model, each player has a fixed level of skill, denoted ziz_i. We initially don't know anything about anyone's skill, but for simplicity we'll assume everyone skill is independent. Perhaps we start with an independent Gaussian prior.

We never get to observe the players' skills directly, which makes this a latent variable model. Instead, we observe the outcome of a series of matches between different players. For each game, the probability that player ii beats player jj is given by σ(zizj)\sigma(z_i - z_j) where sigma is the logistic function: σ(y)=11+exp(y)\sigma(y) = \frac{1}{1 + \exp(-y)}. So we can write:

p(i beats jzi,zj)=11+exp((zi,zj)p(\textnormal{i beats j}|z_i, z_j) = \frac{1}{1 + \exp(-(z_i, z_j)}

The exact form of the prior and likelihood aren't particularly important, as long as the win probability for player 1 is nondecreasing in ziz_i and nonincreasing for zjz_j. So the higher the skill level, the higher the chance of winning each game.

We can write the entire joint likelihood of a set of players and games as:

p(z1,z2,zN,game 1, game 2, game T)=[i=1Np(zi)][i,jgamesp(i beats jzi,zj)]p(z_1, z_2, \dots z_N, \textnormal{game 1, game 2, \dots game T}) = \left[ \prod_{i=1}^N p(z_i) \right] \left[ \prod_{i,j \in \textnormal{games}} p(\textnormal{i beats j}|z_i, z_j) \right]

Let's draw the prior, likelihood, and posterior for two players. What happens if we see two players both beat each other? Is that different than not seeing any matches at all?

It's hard to draw a graphical model for this joint distribution, but we can draw the graphical model for 3 players that have two games between them, and see that given the outcome of some matches, the players' skills are no longer independent, even if they've never played each other.

Computing even the posterior over two players' skills requires integrating over all the other players' skills:

p(z1,z2game 1, game 2, game T)=p(z1,z2x)= ⁣p(z1,z2,z3zNx)dz3dzNp(z_1, z_2 | \textnormal{game 1, game 2, \dots game T}) \\ = p(z_1, z_2 | x) \\ = \int \dots \int p(z_1, z_2, z_3 \dots z_N | x) dz_3 \dots dz_N

Posterior Inference in Latent Variable Models

Consider the probabilistic model p(x,z)p(x, z) where

  • x1:Tx_{1:T} are the observations
  • z1:Nz_{1:N} are the unobserved latent variables

The conditional distribution of the unobserved variables given the observed variables (the posterior inference) is

p(zx)=p(xz)p(x)=p(xz)p(z)p(x,z)dz p(z | x) = \frac{p(x | z)}{p(x)} = \frac{p(x | z) p(z)}{\int p(x, z)d_z}

which we will denote as pθ(zx)p_{\theta}(z | x).

Whenever the number of values that zz can take is largte, the computation (\int p(x, z)d_z) is intractable, making the computation of the conditional distribution itself intractable. Thus we have to use approximate inference. In today's lecture, we'll learn about variational inference.

Approximating the Posterior Inference with Variational Methods

Approximation of the posterior inference with variational methods works as follows:

  1. Introduce a variational family qϕ(z)q_\phi(z) with parameters ϕ\phi.
  2. Encode some notion of "distance" between p(zx)p(z | x) and qϕ(z)q_\phi(z).
  3. Minimize this distance.

This turns Bayesian Inference into an optimization problem, and if enough parts of our model are differentiable and well-approximated by simple Monte Carlo, we can use stochastic gradient descent to solve this optimization problem scalably.

Note that whatever parametric family of functions we choose for qϕ(z)q_\phi(z), it is unlikely that this variational family will have the true posterior p(zx)p(z|x) in it.

The earliest versions of variational inference optimized directly in function space, and required the calculus of variations, but now we use a simpler method that only requires standard finite-dimensional calculus.

Show autograd demo:

Answering questions with an approximate posterior:

Once we have this approximate posterior, what can we do with it?

We can find approximate answers to questions about the true posterior. For example: in the Trueskill example, the sorts of questions we might want to answer are:

  1. Is player A better than player B?
  2. What is the player with closest skill to B? (for matchmaking)

We can express question as an expectation:

p(zA>zBx)=Ep(zA,zBx)[I(zA>zB)]p(z_A > z_B | x) = {\operatorname{\mathbb{E}}}_{p(z_A, z_B | x)} [ I(z_A > z_B) ]

which we could easily approximate with simple Monte Carlo if we could sample from p(zA,zBx)p(z_A, z_B | x). However, we can't. But if we had a good approximate posterior, we could! For example, we could sample from q(zx)q(z|x).

p(zA>zBx)Eq(zA,zBx)[I(zA>zB)]p(z_A > z_B | x) \approxeq {\operatorname{\mathbb{E}}}_{q(z_A, z_B | x)} [ I(z_A > z_B) ]

We could answer question two by simply examining the mean of q(zix)q(z_i | x) for each ii, or estimate it by simple Monte Carlo if it's not available in closed form.

Kullback-Leibler Divergence

We will measure the distance between qϕq_\phi and pp using the Kullback-Leibler divergence.

Note

Kullback–Leibler divergence has lots of names, we will stick to "KL divergence".

We compute DKLD_{KL} as follows:

DKL(qϕ(zx)p(zx))=qϕ(zx)logqϕ(zx)p(zx)dz=Ezqϕlogqϕ(zx)p(zx) \begin{aligned} D_{KL}(q_\phi(z | x) || p(z | x)) &= \int q_\phi(z | x) \log \frac{q_\phi(z | x)}{p(z | x)}dz \\ &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{q_\phi(z | x)}{p(z | x)} \end{aligned}
Properties of the KL Divergence
  1. DKL(qϕp)0D_{KL}(q_\phi || p) \ge 0
  2. DKL(qϕp)=0qϕ=pD_{KL}(q_\phi || p) = 0 \Leftrightarrow q_\phi = p
  3. DKL(qϕp)DKL(pqϕ)D_{KL}(q_\phi || p) \not = D_{KL}(p || q_\phi)

The significance of the last property is that DKLD_{KL} is not a true distance measure.

Variational Objective

We want to approximate pp by finding a qϕq_\phi such that

qϕpDKL(qϕp)=0 q_\phi \approx p \Rightarrow D_{KL}(q_\phi || p) = 0

but the computation of DKL(qϕp)D_{KL}(q_\phi || p) is intractable (as discussed above).

Note

DKL(qϕp)D_{KL}(q_\phi || p) is intractable because it contains the term p(zx)p(z | x), which we have already established, is intractable.

To circumvent this issue of intractability, we will derive the evidence lower bound (ELBO), and show that maximizing the ELBO \Rightarrow minimizing DKL(qϕp)D_{KL}(q_\phi || p).

DKL(qϕ(zx)p(zx))=Ezqϕlogqϕ(zx)p(zx)=Ezqϕ[log(qϕ(zx)p(x)p(z,x))]=Ezqϕlogqϕ(zx)p(z,x)+Ezqϕlogp(x)=L(ϕ;x)+logp(x) \begin{aligned} D_{KL}(q_\phi (z | x) || p (z | x)) &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{q_\phi(z | x)}{p(z | x)} \\ &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \Bigg [ \log \Bigg ( q_\phi(z | x) \cdot \frac{p(x)}{p(z, x)} \Bigg ) \Bigg ] \\ &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{q_\phi(z | x)}{p(z, x)} + \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log p(x) \\ &= -\mathcal L(\phi ; x) + \log p(x) \\ \end{aligned}

Where L(ϕ;x)\mathcal L(\phi ; x) is the ELBO.

Note

Notice that logp(x)\log p(x) is not dependent on zz.

Rearranging, we get

DKL(qϕ(zx)p(zx))=L(ϕ;x)+logp(x)L(ϕ;x)+DKL(qϕ(zx)p(zx))=logp(x) \begin{aligned} D_{KL}(q_\phi (z | x) || p (z | x)) &= -\mathcal L(\phi ; x) + \log p(x) \\ \Rightarrow \mathcal L(\phi ; x) + D_{KL}(q_\phi (z | x) || p (z | x)) &= \log p(x) \\ \end{aligned}

Because DKL(qϕ(zx)p(zx))0D_{KL}(q_\phi (z | x) || p (z | x)) \ge 0

L(ϕ;x)logp(x) \mathcal L(\phi ; x) \le \log p(x)

\therefore maximizing the ELBO \Rightarrow minimizing DKL(qϕ(zx)p(zx))D_{KL}(q_\phi (z | x) || p (z | x)).

Optimizing the ELBO

We have that

L(ϕ)=Ezqϕlogqϕ(zx)p(x,z)=Ezqϕ[logp(x,z)logqϕ(zx)] \begin{aligned} \mathcal L(\phi) &= - \underset{z \sim q_{\phi}}{\operatorname{\mathbb{E}}} \log \frac{q_\phi(z | x)}{p(x, z)} \\ &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \Big [ \log p({x, z}) - \log {q_\phi(z | x)} \Big ] \\ \end{aligned}

If we want to optimize this with gradient methods, we will need to compute ϕL(ϕ)\nabla_\phi \mathcal L(\phi):

ϕL(ϕ)=ϕEzqϕ(zx)[logp(x,z)logqϕ(zx)] \nabla_\phi \mathcal L(\phi) = \nabla_\phi \mathbb{E}_{z \sim q_\phi(z | x)} \Big [ \log p(x, z) - \log q_\phi(z|x) \Big ]

we know from a previous lecture that one can get an unbiased estimator of any expectation as long as we can sample from the distribution and evaluate the function, using simple Monte Carlo.

Pathwise Gradient

Here, we have the gradient of an expectation. Let's turn it into an expectation by switching the derivative and expectation operators. However, in general, we can't switch the derivative and the expectation if the distribution we're taking the expectation over depends on the parameter. So let's factor out the randomness from qq, and put it unto a parameterless, fixed source of noise p(ϵ)p(\epsilon). To do this, we need to find a function T(ϵ,ϕ)T(\epsilon, \phi) such that if:

ϵp(ϵ)z=T(ϵ,ϕ)implieszqϕ(z)\epsilon \sim p(\epsilon)\\ z = T(\epsilon, \phi)\\ \textnormal{implies}\\ z \sim q_\phi(z)

Usually we start with p(ϵ)p(\epsilon) being uniform or Normal. It's not always easy to find these functions here's a list, and we already know one for transforming standard normals into Gaussians with a given mean and variance:

ϵN(ϵ0,1)z=σϵ+μimplieszN(zμ,σ)\epsilon \sim \mathcal{N}(\epsilon|0, 1)\\ z = \sigma \epsilon + \mu\\ \textnormal{implies}\\ z \sim \mathcal{N}(z|\mu, \sigma)

Using this trick, we can write our expectation, and then move the gradient inside:

ϕL(ϕ)=ϕEzqϕ(zx)[logp(x,z)logqϕ(zx)]=ϕEϵp(ϵ)[logp(x,T(ϕ,ϵ))logqϕ(T(ϕ,ϵ)x)]=Eϵp(ϵ)ϕ[logp(x,T(ϕ,ϵ))logqϕ(T(ϕ,ϵ)x)] \nabla_\phi \mathcal L(\phi) = \nabla_\phi \mathbb{E}_{z \sim q_\phi(z | x)} \Big [ \log p(x, z) - \log q_\phi(z|x) \Big ] \\ = \nabla_\phi \mathbb{E}_{\epsilon \sim p(\epsilon)} \Big [ \log p(x, T(\phi, \epsilon)) - \log q_\phi(T(\phi, \epsilon)|x) \Big ] \\ = \mathbb{E}_{\epsilon \sim p(\epsilon)} \nabla_\phi \Big [ \log p(x, T(\phi, \epsilon)) - \log q_\phi(T(\phi, \epsilon)|x) \Big ]

And that's it! Now we have a differentiable function that we can estimate with simple Monte Carlo. Additionally, if p(xz)p(x|z) factorizes over the points in the dataset, we can speed up computation by randomly subsampling the data at each iteration as well. These two tricks together mean that SVI can be used to fit models with millions of parameters on millions of datapoints.

We simply need to code up \(T(\epsilson, \phi), \log p(x|z), \log p(z)\) and logqϕ(zx)) \log q_\phi(z|x)) in a language or package that supports automatic differentiation, and we will get the gradient for free.

There are some notable extensions we'll cover or mention in this course: 1. We can make qϕq_\phi arbitrarily expressive, for instance using a mixture of Gaussians, or a normalizing flow. 2. We can fit both ϕ\phi and parameters of p(zx)p(z|x) at the same time. For instance, we might want to learn the shape of the function that determines the probability of a win given both players' skills.

Appendix

Extra Demos:

Bayesian neural network demo: or youtube video of demo

Useful Resources


Tutorial

Alternative Derivation

Starting with Jenson's inequality,

f(E[X])E[f(x)] f(E[X]) \le E[f(x)]

if XX is a random variable and ff is a convex function.

Given that log\log is a concave function, we have

logp(x)=logp(x,z)dz=logp(x,z)qϕ(zx)qϕ(zx)dz=logEzqϕp(x,z)qϕ(zx)logEzqϕp(x,z)qϕ(zx)Ezqϕlogp(x,z)qϕ(zx)=Ezqϕlogqϕ(zx)p(x,z)=L(ϕ;x) \begin{aligned} \log p(x) &= \log \int p(x, z)dz \\ &= \log \int p(x, z) \frac{q_\phi(z | x)}{q_\phi(z | x)} dz \\ &= \log \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \frac{p(x, z)}{q_\phi(z | x)} \\ \Rightarrow \log \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \frac{p(x, z)}{q_\phi(z | x)} & \ge \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{p(x, z)}{q_\phi(z | x)} \\ &= - \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{q_\phi(z | x)}{p(x, z)} \\ &= \mathcal L(\phi ; x) \end{aligned}

Alternative Forms of ELBO and Intuitions

We have that

L(ϕ;x)=ELBO=Ezqϕlogqϕ(zx)p(x,z) \mathcal L(\phi ; x) = \text{ELBO} = - \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{q_\phi(z | x)}{p(x, z)}

1) The most general interpretation of the ELBO is given by

L(ϕ;x)=Ezqϕlogqϕ(zx)p(x,z)=Ezqϕlogp(x,z)qϕ(zx)=Ezqϕlogp(z)p(xz)qϕ(zx)=Ezqϕ[logp(xz)+logp(z)logqϕ(zx)] \begin{aligned} \mathcal L(\phi ; x) &= - \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{q_\phi(z | x)}{p(x, z)} \\ &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{p(x, z)}{q_\phi(z | x)} \\ &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \log \frac{p(z)p(x | z)}{q_\phi(z | x)} \\ &= \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \Big [ \log p({x | z}) + \log p({z}) - \log {q_\phi(z | x)} \Big ]\\ \end{aligned}

2) We can also re-write 1) using entropy

Ezqϕ[logp(xz)+logp(z)]H[qϕ(zx)] \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \Big [ \log p({x | z}) + \log p({z}) \Big ] \mathbb{H} \Big [ q_\phi(z | x) \Big ] \\

3) Another re-write and we arrive at

Ezqϕ[logp(xz)]DKL(qϕ(zx)p(z)) \underset{z \sim q_\phi}{\operatorname{\mathbb{E}}} \Big [ \log p({x | z}) \Big ] - D_{KL}(q_\phi(z | x) || p(z))

This frames the ELBO as a tradeoff. The first term can be thought of as a "reconstruction likelihood", i.e. how probable is xx given zz, which encourages the model to choose the distribution which best reconstructs the data. The second term acts as regularization, by enforcing the idea that our parameterization shouldn't move us too far from the true distribution.

Mean Field Variational Inference

In mean field variational inference, we restrict ourselves to variational families qq that full factorizes to qϕ(z)q_\phi(z) (no xx!). I.e., we approximate p(zx)p(z|x) with qϕ(z)q_\phi(z)

qϕ(z)=i=1Nq(ziϕi) q_\phi(z) = \prod_{i=1}^N q(z_i | \phi_i)

Traditional Variational Inference (ASIDE)

If qqs are in the same family as pps, we can optimize via coordinate ascent.

  1. Fix all other variables → optimize local
  2. Aggregate local → optimize global
  3. Repeat until KL divergence

Score function Gradient etimator

Also called the likelihood ratio, or REINFORCE, was independently developed in 1990, 1992, 2013, and 2014 (twice). It is given by

ϕEzqϕ(z)f(z)=ϕf(z)qϕ(z)dz \nabla_\phi \mathbb{E}_{z \sim q_\phi(z)} f(z) = \nabla_\phi \int f(z) q_\phi (z) dz

if we assume that qϕ(z)q_\phi(z) is a continous function of ϕ\phi, then

=ϕf(z)qϕ(z)dz=f(z)ϕqϕ(z)dz \begin{aligned} &= \int \nabla_\phi f(z) q_\phi(z) dz \\ &= \int f(z) \nabla_\phi q_\phi (z) dz \end{aligned}

using the log-derivative trick (ϕlogqϕ=ϕqϕqϕ)\big ( \nabla_\phi \log q_\phi = \frac{\nabla_\phi q_\phi}{q_\phi} \big ):

=f(z)qϕ(zx)ϕ[logqϕ(zx)]dz=Ezqϕ(z)[f(z)ϕ[logqϕ(zx)]] \begin{aligned} &= \int f(z) q_\phi(z | x ) \nabla_\phi \big [ \log q_\phi(z | x ) \big ] dz \\ &= \mathbb{E}_{z \sim q_\phi(z)} \Big [ f(z) \nabla_\phi \big [ \log q_\phi(z | x ) \big ] \Big ] \\ \end{aligned}

where qϕ(zx)q_\phi(z | x ) is the score function. Finally, we have

ϕL(ϕ;x)=Ezqϕ(z)[(logp(x,z)logqϕ(zx))ϕ[logqϕ(zx)]] \nabla_\phi \mathcal L(\phi ; x) = \mathbb{E}_{z \sim q_\phi(z)} \Big [ \big( \log p(x, z) - \log q_\phi(z | x) \big) \nabla_\phi \big [ \log q_\phi(z | x ) \big ] \Big ]

which is unbiased, just like the pathwise gradient, but usually much higher variance. One of the main reasons variational inference wasn't used much until recently was that it wasn't widely understood how big the improvement in variance is in high dimensions.