Week 6: Kullback-Leibler Divergence

In the past lectures we have discussed exact inference methods, like Variable Elimination. We also discussed the issues associated with these methods in practice. In particular computational complexity while marginalizing over high-dimensional or continuous random variables.

To work with probabilistic models in practice, we develop methods to approximate the distributions of interest for inference.

Approximate Inference

Given the joint

\[p(X) = p(X_1, \dots, X_N) = p(X_{E}, X_{F}, X_{R})\]

we will want to perform inference involving the distribution

\[p(X_{F} \vert X_{E}).\]

Exact methods for marginalizing the required variables \(X_R\) and \(X_E\) is practically difficult. Instead we approximate either by sampling or variational methods. For now we will focus on developing the prerequisites for variational methods.

Goal of Approximate Inference

Approximate

\[p(X) = \frac{1}{Z} \tilde p(X)\]

with simpler distribution

\[q(X; \phi)\]

and adjust \(\phi\) so the distributions are close, \(p \approx q\). That is,

\[ \begin{aligned} \sum_x f(x) p(x) &\approx \sum_x f(x) q(x; \phi)\\ \mathbb{E}_{x \sim p(x)}\left[ f(x)\right] &\approx \mathbb{E}_{x \sim q(x; \phi)} \left[f(x) \right] \end{aligned} \]

Question: How to measure "close" between distributions?

Bad Idea: Euclidean Distance in Parameter Space

Suppose we have two Gaussian with same variance \(\sigma^2\) but different means \(\mu_1\) and \(\mu_2\). If we measure the closeness of these distributions by \(d(\mu_1,\mu_2)\).

However, given small \(\sigma\) these distributions may share very little support, despite similar means. So we need a better idea!

Kullback-Leibler (KL) Divergence

A way to compare two distributions defined

\[ \begin{aligned} D_{KL}(q \vert \vert p) &= \mathbb{E}_{x \sim q(x)} \log \frac{q(x)}{p(x)}\\ &=\sum_{x}q(x)\log \frac{q(x)}{p(x)} \end{aligned} \]

With the properties

  1. \(D_{KL}(q \vert \vert p) \geq 0\)
  2. \(D_{KL}(q \vert \vert p) = 0 \iff q=p\)
  3. \(D_{KL}(q \vert \vert p) \neq D_{KL}(p \vert \vert q)\) KL-Divergence is Not a Distance Metric!

\(D_{KL}(q \vert \vert p)\) vs \(D_{KL}(p \vert \vert q)\)

To build intuition for difference between \(D_{KL}(q \vert \vert p)\) and \(D_{KL}(p \vert \vert q)\) we consider the difference between \(\log \frac{q(x)}{p(x)}\) and \(\log \frac{p(x)}{q(x)}\).

Reverse-KL Information Projection: \(D_{KL}(q \vert \vert p) = \mathbb{E}_{x \sim q(x)} \log \frac{q(x)}{p(x)}\):

  • \(p \sim q \implies \; D_{KL}\;\text{small}\)
  • \(p\; \text{large},\; q\; \text{small} \implies \; D_{KL}\;\text{small}\)
  • \(p\; \text{small},\; q\; \text{large} \implies \; D_{KL}\;\text{large}\)

\(\therefore\; D_{KL}(q \vert \vert p)\) penalizes \(q\) having mass where \(p\) has none.

Forward-KL Moment Projection: \(D_{KL}(p \vert \vert q) = \mathbb{E}_{x \sim p(x)} \log \frac{p(x)}{q(x)}\):

  • \(p \sim q \implies \; D_{KL}\;\text{small}\)
  • \(p\; \text{large},\; q\; \text{small} \implies \; D_{KL}\;\text{large}\)
  • \(p\; \text{small},\; q\; \text{large} \implies \; D_{KL}\;\text{small}\)

\(\therefore\; D_{KL}(p \vert \vert q)\) penalizes \(q\) missing mass where \(p\) has some.

Which direction of KL we use in practice?

The choice of direction in practice does not appeal to any notion of whether information or moment projection is a better idea of closeness between distributions.