# Week 6: Kullback-Leibler Divergence¶

In the past lectures we have discussed *exact inference* methods, like Variable Elimination.
We also discussed the issues associated with these methods in practice.
In particular computational complexity while marginalizing over high-dimensional or continuous random variables.

To work with probabilistic models in practice, we develop methods to approximate the distributions of interest for inference.

## Approximate Inference¶

Given the joint

we will want to perform inference involving the distribution

Exact methods for marginalizing the required variables \(X_R\) and \(X_E\) is practically difficult.
Instead we approximate either by *sampling* or *variational* methods.
For now we will focus on developing the prerequisites for variational methods.

### Goal of Approximate Inference¶

Approximate

with simpler distribution

and adjust \(\phi\) so the distributions are close, \(p \approx q\). That is,

**Question: How to measure "close" between distributions?**

### Bad Idea: Euclidean Distance in Parameter Space¶

Suppose we have two Gaussian with same variance \(\sigma^2\) but different means \(\mu_1\) and \(\mu_2\). If we measure the closeness of these distributions by \(d(\mu_1,\mu_2)\).

However, given small \(\sigma\) these distributions may share very little support, despite similar means. So we need a better idea!

### Kullback-Leibler (KL) Divergence¶

A way to compare two distributions defined

With the properties

- \(D_{KL}(q \vert \vert p) \geq 0\)
- \(D_{KL}(q \vert \vert p) = 0 \iff q=p\)
- \(D_{KL}(q \vert \vert p) \neq D_{KL}(p \vert \vert q)\)
**KL-Divergence is Not a Distance Metric!**

### \(D_{KL}(q \vert \vert p)\) vs \(D_{KL}(p \vert \vert q)\)¶

To build intuition for difference between \(D_{KL}(q \vert \vert p)\) and \(D_{KL}(p \vert \vert q)\) we consider the difference between \(\log \frac{q(x)}{p(x)}\) and \(\log \frac{p(x)}{q(x)}\).

#### Reverse-KL Information Projection: \(D_{KL}(q \vert \vert p) = \mathbb{E}_{x \sim q(x)} \log \frac{q(x)}{p(x)}\):¶

- \(p \sim q \implies \; D_{KL}\;\text{small}\)
- \(p\; \text{large},\; q\; \text{small} \implies \; D_{KL}\;\text{small}\)
- \(p\; \text{small},\; q\; \text{large} \implies \; D_{KL}\;\text{large}\)

\(\therefore\; D_{KL}(q \vert \vert p)\) penalizes \(q\) having mass where \(p\) has none.

#### Forward-KL Moment Projection: \(D_{KL}(p \vert \vert q) = \mathbb{E}_{x \sim p(x)} \log \frac{p(x)}{q(x)}\):¶

- \(p \sim q \implies \; D_{KL}\;\text{small}\)
- \(p\; \text{large},\; q\; \text{small} \implies \; D_{KL}\;\text{large}\)
- \(p\; \text{small},\; q\; \text{large} \implies \; D_{KL}\;\text{small}\)

\(\therefore\; D_{KL}(p \vert \vert q)\) penalizes \(q\) missing mass where \(p\) has some.

### Which direction of KL we use in practice?¶

The choice of direction in practice does not appeal to any notion of whether information or moment projection is a better idea of closeness between distributions.