# Week 10: Stochastic Variational Inference / Automatic Differentiation Variation Inference (SAD VI)¶

### Assigned Reading¶

- Murphy: Chapter 18

### Overview¶

- Review Variational Inference
- Derive the variational objective
- ELBO intuition
- Stochastic optimization

## Posterior Inference for Latent Variable Models¶

Imagine we had the following latent variable model

which represents the probabilistic model \(p(x, z ; \theta)\) where

- \(x_{1:N}\) are the observations
- \(z_{1:N}\) are the unobserved local latent variables
- \(\theta\) are the global latent variables (i.e. the parameters)

The conditional distribution of the unobserved variables given the observed variables (the posterior inference) is

which we will denote as \(p_{\theta}(z | x)\).

Because the computation \(\int\int p(x, z, \theta)d_zd_{\theta}\) is intractable, making the computation of the conditional distribution itself intractable, we must turn to variational methods.

### Approximating the Posterior Inference with Variational Methods¶

Approximation of the posterior inference with variational methods works as follows:

- Introduce a variational family, \(q_\phi(z | x)\) with parameters \(\phi\).
- Encode some notion of "distance" between \(p_\theta\) and \(q_\phi\).
- Minimize this distance.

This process effectively turns Bayesian Inference into an optimization problem (and we *love* optimization problems in machine learning).

It is important to note that whatever function we choose for \(q_\phi\), it is unlikely that our variational family will have the true distribution \(p_\theta\) in it.

#### Kullback-Leibler Divergence¶

We will measure the distance between \(q_\phi\) and \(p_\theta\) using the **Kullback-Leibler divergence**.

Note

Kullback–Leibler divergence has lots of names, we will stick to *"KL divergence"*.

We compute \(D_{KL}\) as follows:

##### Properties of the KL Divergence¶

- \(D_{KL}(q_\phi || p_\theta) \ge 0\)
- \(D_{KL}(q_\phi || p_\theta) = 0 \Leftrightarrow q_\phi = p_\theta\)
- \(D_{KL}(q_\phi || p_\theta) \not = D_{KL}(p_\theta || q_\phi)\)

The significance of the last property is that \(D_{KL}\) is *not* a true distance measure.

### Variational Objective¶

We want to approximate \(p_\theta\) by finding a \(q_\phi\) such that

but the computation of \(D_{KL}(q_\phi || p_\theta)\) is intractable (as discussed above).

Note

\(D_{KL}(q_\phi || p_\theta)\) is intractable because it contains the term \(p_\theta(z | x)\), which we have already established, is intractable.

To circumvent this issue of intractability, we will derive the **evidence lower bound (ELBO)**, and show that maximizing the ELBO \(\Rightarrow\) minimizing \(D_{KL}(q_\phi || p_\theta)\).

Where \(\mathcal L(\theta, \phi ; x)\) is the **ELBO**.

Note

Notice that \(\log p_\theta(x)\) is *not* dependent on \(z\).

Rearranging, we get

Because \(D_{KL}(q_\phi (z | x) || p_\theta (z | x)) \ge 0\)

\(\therefore\) maximizing the ELBO \(\Rightarrow\) minimizing \(D_{KL}(q_\phi (z | x) || p_\theta (z | x))\).

#### Alternative Derivation¶

Starting with **Jenson's inequality**,

if \(X\) is a random variable and \(f\) is a convex function.

Given that \(\log\) is a concave function, we have

### Alternative Forms of ELBO and Intuitions¶

We have that

1) The most general interpretation of the ELBO is given by

2) We can also re-write 1) using entropy

3) Another re-write and we arrive at

Tip

The instructor suggest that this would be useful for assignment 3.

This frames the ELBO as a tradeoff. The first term can be thought of as a "reconstruction likelihood", i.e. how probable is \(x\) given \(z\), which encourages the model to choose the distribution which best reconstructs the data. The second term acts as regularization, by enforcing the idea that our parameterization shouldn't move us too far from the true distribution.

Note

The instructor recommends we read "sticking the landing".

### Mean Field Variational Inference¶

In mean field variational inference, we restrict ourselves to variational families, \(q\), that we can compute the gradient of, and assume the approximate distribution \(q\) fully factorizes to \(q_\phi(z)\) (no \(x\)!). I.e., we approximate \(p_\theta(z|x)\) with \(q_\phi(z)\)

where \(\phi = (\phi_\theta, \phi_{1:N})\).

If \(q\)s are in the same family as \(p\)s, we can optimize via coordinate ascent.

#### Traditional Variational Inference (ASIDE)¶

- Fix all other variables → optimize local
- Aggregate local → optimize global
- Repeat until KL divergence

Warning

I think this was meant to be an aside.

#### Optimizing ELBO¶

We have that

If we want to optimize this with gradient methods, we will need to compute \(\nabla_\phi \mathcal L(\phi ; x)\). Nowadays, we have automatic differentiation (AD). We can optimize with gradient methods if:

- \(z\) is continuous
- dependence on \(\phi\) is exposed to AD

If these are both true, then

but, this is difficult because we are taking the gradient of an expectation and we are trying to compute this gradient from samples. This brings us to our big idea: instead of taking the gradient *of an expectation*, we compute the gradient *as an expectation*.

##### Score Gradient¶

Also called the likelihood ratio, or REINFORCE, was independently developed in 1990, 1992, 2013, and 2014 (twice). It is given by

if we assume that \(q_\phi(z)\) is a continous function of \(\phi\), then

using the log-derivative trick \(\big ( \nabla_\phi \log q_\phi = \frac{\nabla_\phi q_\phi}{q_\phi} \big )\):

where \(q_\phi(z | x )\) is the score function. Finally, we have

which is *unbiased*, but *high variance*.

##### Pathwise Gradient¶

## Appendix¶

### Useful Resources¶

- High level overview on variational inference.