Introduction to Probabilistic Models

Assigned Reading


  • Overview of probabilistic models
  • Sufficient statistics
  • Likelihood
  • Maximum likelihood estimation (MLE)
  • Classification

Overview of probabilistic models

In general, we have random variables \(X = (\Xvars)\) that are either observed or unobserved. We want a model that captures the relationship between these variables. The approach of probabilistic generative models is to relate all variables by a learned joint probability distribution \(\pt(\Xvars)\).

The assumption here is that the variables, e.g. supervised learning data in the form of (input, label) pairs, were generated by some distribution \((\Xvars) \sim \ptrue(X)\). "Learning" the joint probability distribution, also called density estimation, is the process choosing the parameters \(\theta\) of a specified parametric joint distribution \(\pt(X)\) to "best match" the underlying data distribution \(\ptrue(X)\). If this is too abstract, for now we can think about \(\pt\) as a black-box machine learning model, like a neural network with "weights" \(\t\).

This course will investigate

  • How should we specify \(\pt\).
  • What it means for \(\pt\) to "best match" \(\ptrue\).
  • How can we find the best parameters \(\t\).

So, with generative models the joint probability distribution is the central object of interest. We assume there is a true joint \(\ptrue\) which we are trying to learn with a model \(\pt\).

A Probabilistic Perspective on ML Tasks

With this perspective, we can think about common machine learning tasks differently, where random variables represent:

  • input data \(X\),
  • discrete outputs ("labels") \(C\),
  • or continuous outputs \(Y\).


It is sometimes nice to talk about all random variables together, \(X\), or individually only through subscripts \(\Xvars\). Here, we see that it is helpful to easily distinguish groups of random variables from each other. Like data, \(X\), from discrete labels, \(C\). Nothing profound is happening here, we're just conveniently naming variables to help us understand their meaning.

If we have the joint probability over these random variables, e.g. \(p(X,C,Y)\), we will see later that we can use it for familiar ML tasks:

  • Regression: \(p(Y \vert X) = p(X,Y) / p(X) = p(X,Y) / \int p(X,Y) dY\)
  • Classification / Clustering: \(p(C \vert X) = p(X,C) / \sum_C p(X,C)\)

Example: Classification

For example, in a supervised classification problem we observe (in our dataset of size \(N\)) pairs of "input data" and "class labels",

\[ \{x_i,c_i\}_{i=1}^N \sim p(X,C) \]

The classification problem will be to learn a distribution over class labels given new input data:

\[p(C \vert X) = p(X,C) / \sum_C p(X,C)\]

We can see above that this can be obtained from an expression involving only the joint distribution \(p(X,C)\). This is why we care about it!

But wait, in a previous machine learning course we might have learned that the goal of classification tasks is to actually assign a class label, \(c^*\). Here we only have a distribution over class labels conditioned on input data, \(p(C \vert X)\). How does that help us?

Now that we have a distribution over class labels we have a choice for how to assign the class label.

  • We can choose the most likely class label under our distribution, or the class with the "maximum likelihood", \(c^* = \argmax_{c} p(C=c \vert X)\).
  • We can sample the class assignment from our distribution, \(c^* \sim p(C \vert X)\).
  • We can output the class assignment (however we chose it) along with its density under our distribution, \((c^*, p(C=c^* \vert X))\).

That last point is very powerful, our model doesn't just give us a prediction but can inform us of the model's uncertainty or confidence of the prediction. This will be discussed more later, but for now know that a primary benefit of developing probabilistic perspective on machine learning is that it allows us to systematically work with uncertainty.

Observed vs Unobserved Random Variables

The distinction between classification vs clustering, or in general supervised vs unsupervised learning, in this probabilistic perspective is given by whether a random variable is observed or unobserved.

In the previous example for supervised classification, we assumed our dataset included input data and class labels

Supervised Dataset: \(\{x_i,c_i\}_{i=1}^N \sim p(X,C)\)

In this case, the class labels are "observed", and finding the conditional distribution \(p(C \vert X)\) satisfies the supervised classification problem.

However, what if our dataset only included the "input data":

Unsupervised Dataset: \(\{x_i\}_{i=1}^N \sim p(X,C)\)

Notice that we did not change the generative assumption, our data \(x_i\) is still distributed according to a class label \(C=c_i\), even though it is unobserved in the dataset. The common way to refer to an unobserved discrete class label is "cluster".

However, whether the label is observed or not, it does not ultimately change our goal, which is to have a model of the conditional distribution over the labels/clusters given the input data \(p(C \vert X)\).

This view allows us to easily accommodate semi-supervised learning, when variables are observed for some, but not all, of the examples in the dataset.

Latent Variables

Further, like clusters, introducing assumptions about unobserved variables is a powerful modelling tool. We will make use of this by modelling variables which are never observed in the dataset, called latent or hidden variables. By introducing and modelling latent variables, we will be able to naturally describe and capture abstract features of our input data.

Operations on Probabilistic Models

The fundamental operations we will perform on a probabilistic model are:

  • Generate Data: For this we will need to know how to sample from the model.
  • Estimate Likelihood: When all variables are either observed or marginalized the result is a single real number which is the probability of the all variables taking on those specific values.
  • Inference: Compute expected value of some variables given others which are either observed or marginalized.
  • Learning: Set the parameters of the joint distribution given some observed data to maximize the probability the observed data.

Desiderata of Probabilistic Models

Given the above operations we have two main goals for our joint distributions:

  1. That they facilitate efficient computation of marginal and conditional distributions.
  2. That they have compact representation so the size of the parameterization scales well for joint distributions over many variables.

To achieve these desiderata we will make modelling assumptions. The most important will be assumptions on the independence between variables, which corresponds to factorizations of the joint distribution. Much more on this later.

Parameterized Distributions

Before we continue with learning the parameters of a joint distribution, let's consider a simple parameterized distribution and develop some intuition.

Distribution over discrete random variables

Let's take a toy example of discrete random variables.

\[ T: \text{Temperature} \; ; \; h = \text{"hot" or } c = \text{"cold"} \] \[ W: \text{Weather} \; ; \; s = \text{"sunny" or } r = \text{"raining"} \]

To parameterize a distribution over these discrete variables, we can assign probabilities for each possible state:

Parameters for marginal distribution over temperature \(P(T)\)

h 0.40
c 0.60

Parameters for marginal distribution over weather \(P(W)\)

s 0.70
r 0.30

\[ P(T=h) = 0.40 \] \[ P(T=c) = 0.60 \]

\[ P(W=s) = 0.70 \] \[ P(W=r) = 0.30 \]

and that these states define a valid probability distribution, so

\[P(X) \ge 0 \\\\ \sum_x P(X = x) = 1\]

We could create a parameterized, probabilistic model, \(P(T, W)\) over the states

\[P(T | \theta_T) ; \theta_T = \begin{bmatrix} 0.4 \\ 0.6 \end{bmatrix}\] \[P(W | \theta_W) ; \theta_W = \begin{bmatrix} 0.7 \\ 0.3 \end{bmatrix}\]

Notice that \(\theta_T\) and \(\theta_W\) are the probability distributions of our random variables. Our parameters define the probability of the data explicitly and store it in a vector.

We can represent the joint distribution \(P(T, W)\), our model, as:

h s 0.28
c r 0.18
h r 0.12
c s 0.42

from the joint distribution we can compute the marginals

\[ P(T=h) = \sum_w P(T=h, W=w) = 0.40 \]
\[ P(T=c) = \sum_w P(T=c, W=w) = 0.60 \]
\[ P(W=s) = \sum_t P(W=s, T=t) = 0.70 \]
\[ P(W=r) = \sum_t P(W=r, T=t) = 0.30 \]

we could also ask questions about conditional probabilities, like

\[P(W = s | T = c) = \frac{P(W=s,T=c)}{P(T=c)} = \frac{P(W=s,T=c)}{\sum_w P(T=c, W=w)} = \frac{0.42}{0.60} = 0.64\]


  • Given a joint distribution, we can compute both marginal and conditional distributions
  • We'll consider distributions as equivalent to their parameters
  • We can represent distributions by arrays of their parameters
  • Operations like marginalizing and conditioning variables can be interpreted as operations on arrays of parameters.

Dimensionality of the Joint Distribution

The size of array representing a joint distribution can be huge if there are many variables and if the variables take on many states.

To see this, it is helpful to think of the joint distribution over a set of random variables as a grid with \(k^n\) cubes, where \(n\) is the number of variables and \(k\) the number of states each variable can take. For simplicity, this assumes each variable takes on the same number of states, \(k\). Though it is not difficult to extend this to the more general case.


For example, the joint distribution over weather, \(W = \{sun,rain\}\), and temperature \(T = \{cold,hot\}\) where each variable can take 2 states can be represented by a \(2 \times 2\) grid of parameters.

If we extend the possible weather states to include snow and fog, \(W = \{sun,rain, snow, fog\}\) then the parameterization of the joint would require a \(2 \times 4\) grid of parameters.

If we also want our model to capture the mode of transportation for how we will get to class, \({M = \{walk, bike, ttc\}}\), then the parameterization would now require a $ 2 \times 4 \times 3$ cube of parameters.


Actually, this is not quite accurate. In all cases we would require 1 fewer parameters to fully specify these distributions. This is due to the requirement that \(\sum_x P(X = x) = 1\). If we know all but 1 parameter, then we can always solve for the remaining parameter.

Reducing the dimensionality of the joint

Most of this course will be concerned with addressing the huge number of parameters required to fully specify a joint distribution over many variables.

The primary way we will achieve this is to make assumptions about the independence between variables.

The above discussion of the dimensionality of the joint is in the case where we assume all random variables are indpenednent. This is maximally expressive, but requires us to parameterize every possible state. Making independence assumptions will restrict the expressive capabilities of our model, but will allow us to significantly reduce the number of parameters to specify the joint distribution.

For example, we can assume that \(T\) and \(W\) are independent (this is maybe a bad assumption!), but that our mode of transportation depends on temperature and weather.

\[P(T, W, M) = P(T)P(W)P(M|T, W)\]

This joint factorization encodes the independence assumptions, and requires fewer parameters to represent.

Much more on this idea throughout the course.

Likelihood function

So far, we have focused on the probability density function \(p(x|\theta)\) which assigns a probability to any joint configuration of variables \(x\) given fixed parameters \(\theta\). But our goal is to learn \(\theta\), which is not fixed.

We are asking, "for a given \(x\) what is the best \(\theta\)?" For this reason, it is helpful to think of the (log) probability as a function of \(\theta\) for a fixed \(x\)!

To do this, we define the log likelihood function of \(\theta\) for a fixed \(x\):

\[\ell(\theta ; x) = \log p(x|\theta)\]


The likelihood function is essentially a notational trick in order to make it easy to talk about our data as a function of our parameters.

For I.I.D. data

\[p(\mathcal D | \theta) = \prod_m p(x^{(m)} | \theta)\] \[\ell (\theta ; D) = \sum_m \log p(x^{(m)} | \theta)\]

The IID assumption turns the likelihood into a product over the individual observations.

The \(\log\) turns the product into a sum, making the derivative easy to compute term by term.


The negative log likelihood, \(NLL(\theta ; D)\), simply introduces a negative sign so our optimization problem becomes a minimization, that is, maximizing \(\ell (\theta ; D)\) is equivalent to minimizing \(NLL(\theta ; D)\).

The process of learning is choosing \(\theta\) to minimize some cost or loss function, \(L(\theta)\) which includes \(\ell (\theta)\). This can be done in a couple of ways, including:

Maximum Likelihood Estimation

Very intuitive idea: pick parameter values which were most likely to have generated the data

\[\hat \theta_{MLE} = \underset{\theta}{\operatorname{argmax}} \ell(\theta ; \mathcal D)\]

Sufficient statistics

A statistic is a (possibly vector valued) deterministic function of a (set of) random variable(s).

A sufficient statistic is a statistic that conveys exactly the same information about the generative distribution as the entire data itself. I.e., Inferences made from sufficient statistic, \(T(x)\), are the same as those obtained from the entire data.

More formally, we say that \(T(X)\) is a sufficient statistic for \(X\) if

\[T(x^{(1)}) = T(x^{(2)}) \Rightarrow L(\theta ; x^{(1)}) = L(\theta; x^{(2)}) \ \forall \theta\]

Put another way

\[P(\theta | T(X)) = P(\theta | X)\]


Why is this useful? Well, if we have a particular large data sample, a lot of the data may be redundant. If we knew the sufficient statistic for that sample, we could use it in place of the full data sample.

Equivalently (by the Neyman factorization theorem) we can write

\[P(\theta | T(X)) = h(x, T(x))g(T(x), \theta)\]

An example is the exponential family

\[p(x | \eta) = h(x)\exp\{\eta^TT(x)-A(\eta)\}\]

or, equivalently

\[p(x | \eta) = h(x)A(\eta)\exp\{\eta^TT(x)\}\]

Example: Sufficient Statistics and MLE for Bernoulli

Let us take the example of flipping an unfair coin, with unfairness favouring heads is captured by parameter \(\theta\). This process that generates our data can be modeled as a Bernoulli distribution

\[X \backsim \text{Ber}(\theta)\]

where \(X\) is a random variable and \(x_i\) represents the result of the ith coin flip

\[x_i = 1 \text{ , if heads with probability} \; \theta \] \[x_i = 0 \text{ , if tails with probability} \; (1 - \theta)\]o

The dataset are \(N\) observations \(X= \{ 1, 0, 0, 1, 0, 1, \dots\}\)

the likelihood (assuming independence between flips of the coin) is

\[ \begin{aligned} l(\theta ; D) &= \log p( D \vert \theta)\\ & =\log \prod_{i=1}^N \theta^{x_i} (1 - \theta)^{(1-x_i)}\\ & =\sum_{i=1}^N \log \theta^{x_i} + \log (1 - \theta)^{(1-x_i)}\\ & =\sum_{i=1}^N \log (x_i)\theta + \log (1-x_i)(1 - \theta)\\ & =\log \theta \sum_{i=1}^N (x_i) + \log (1 - \theta)\sum_{i=1}^N (1-x_i) \\ & =\log \theta N_H + \log (1 - \theta) N_T \\ \end{aligned} \]

So we notice here that our likelihood depends on \(N_H = \sum_{i=1}^N x_i\) and \(N_T\). In other words, the only aspect of our data that affects the likelihood is the counts. This tells us that if we know this summary statistic, which we will call \(T(x) = \sum_{i=1}^N x_i\) then essentially we know everything that is useful from our sample to do inference.


The sufficient statistics are not unique. For example, we could express as normalize counts \(\frac{N_H}{N}\)

Example: Bernoulli MLE

We have just seen that the data affects the likelihood through the sufficient statistics, \(T(X) = N_H\):

\[l(\theta ; D) =\log \theta T(X) + \log (1 - \theta) (N-T(X))\]

We also saw that the maximum likelihood estimate is given by

\[\hat \theta_{MLE} = \underset{\theta}{\operatorname{argmax}} \ell(\theta ; \mathcal D)\]

To obtain the maximum we take the derivative, set equal to zero, and solve:

\[ \frac{\partial l}{\partial \theta} = \frac{t(x)}{\theta} - \frac{n - t(x)}{1-\theta} = 0\\ \rightarrow \hat \theta_{mle} = \frac{t(x)}{n} \\ \]

Therefore, our Maximum Likelihood Estimate for the parameters of the Bernoulli distribution is just the normalized count!

Example: Sufficient Statistics and MLE for Multinomial

A random variable which takes on \(i \in \left[1, \dots, K\right]\) discrete states each with probability \(\theta_i\). E.g. K-bit pixels, classes, unfair dice.

\(D = \{ 1, 3, K, 2, \dots\}\) for \(N\) observations.

The model is \(p(x_n = i \vert \theta) = \theta_i)\) with the constraint \(\sum_i \theta_i = 1\)

\[ \begin{aligned} l(\theta ; D) &= \log p(D \vert \theta)\\ & = \log \prod_n \theta_1^{[x_n = 1]}\cdots \theta_K^{[x_n = K]}\\ & = \log \prod_n \prod_i \theta_i^{[x_n = i]}\\ & = \sum_n \sum_i {[x_n = i]}\log \theta_i\\ & = \sum_i \log \theta_i \sum_n {[x_n = i]}\\ & = \sum_i \log \theta_i N_i \end{aligned} \]

Therefore, the sufficient statistics for the multinomial distribution are the counts \(N_i\).

The MLE for this distribution will be found by taking the derivative, setting equal to zero, and solving:

\[ \frac{\partial l}{\partial \theta_i} = \frac{N_i}{\theta_i} - N = 0\\ \Rightarrow \hat \theta_i^\star = \frac{N_i}{N} \\ \]

Therefore, the maximum likelihood estimate for the class parameters in a multivariate distribution are the normalized counts for each class.


This derivative is tricky to derive manually, because it requires enforcing the constraint that \(\sum_k \theta_k = 1\)

Sufficient Statistics and MLE for general Exponential Family

The result of the previous example distributions show that the MLE are just normalized counts. However, the simplicity of the sufficient statistics and MLE are due to those being members of the exponential family.

In general, exponential family members have simple sufficient statistics and MLE for the natural statistics \(\eta\):

\[p(x | \eta) = h(x)\exp\{\eta^TT(x)-A(\eta)\}\]

with log-likelihood:

\[ \begin{aligned} l(\eta ; D) &= \log p(D ; \eta)\\ &= (\sum_n \log h(x_n)) - N A(\eta) + (\eta^T \sum_n T(x_n)) \end{aligned} \]

And finding the derivative and setting to zero for the MLE derivation:

\[ \frac{\partial l}{\partial \eta} = -N \frac{\partial A(\eta)}{\partial \eta} + \sum_n T(x_n) \]

And the derivatives of the log-normalizer \(A(\eta)\) is

\[ \frac{\partial A(\eta)}{\partial \eta} = \frac{1}{N} \sum_n T(x_n) \]

The MLE for the natural parameters \(\eta\) of a general exponential family is

\[ \eta_\text{MLE} = \frac{1}{N} \sum_n T(x_n) \]

so, normalized counts of the data.

Example: Sufficient Statistics and MLE for Univariate Normal

The data will be \(N\) i.i.d. samples \(\{x_i\}_1^N \in \mathbb{R}\)

The model is

\[ \begin{aligned} p(x \vert \theta) &= p(x | \mu, \sigma) \\ &= \mathcal{N}(x \vert \mu, \theta)\\ &= \frac{1}{\sqrt{2\pi}\sigma}\exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\} \\ \end{aligned} \]

Gaussian distribution is a member of the exponential family, so we can put it into a natural form

\[ \begin{aligned} p(x \vert \theta)&= \frac{1}{\sqrt{2\pi}\sigma}\exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\} \\ &= \frac{1}{\sqrt{2 \pi} \sigma} \exp \{-\frac{1}{2\sigma^2}x^2 + \frac{1}{\sigma^2}x\mu - \frac{1}{2\sigma^2}\mu^2\} \\ &= \frac{1}{\sqrt{2 \pi} \sigma} \exp\{- \frac{1}{2\sigma^2}\mu^2\} \exp \{ \begin{bmatrix}\frac{\mu}{\sigma^2} \frac{-1}{2\sigma^2}\end{bmatrix} \begin{bmatrix}x \\\ x^2\end{bmatrix} \} \end{aligned} \]

from here, it is clear that the natural parameters and the sufficient statistics are

  • \(\eta = \begin{bmatrix}\frac{\mu}{\sigma^2}\\ \frac{-1}{2\sigma^2}\end{bmatrix}\)
  • \(T(x) = \begin{bmatrix}x \\\ x^2\end{bmatrix}\)

re-writing in terms of \(\eta\)

\[ p(x \vert \eta) = (\sqrt{2 \pi})^{-\frac{1}{2}} \cdot (-2\eta_2)^{\frac{1}{2}} \cdot \exp\{\frac{\eta_1^2}{4\eta_2}\} \cdot \exp \{ \eta^T T(x) \} \\ \]

noting that

  • \(h(x) = (\sqrt{2 \pi})^{-\frac{1}{2}}\)
  • \(A(\eta) = (-2\eta_2)^{\frac{1}{2}} \cdot \exp\{\frac{\eta_1^2}{4\eta_2}\}\)

At this point we can use the general result from the exponential family, or take the derivatives in this form to find the MLE for those natural statistics \(\eta\).

However, we often prefer to work with the parameterization by \(\theta = [\mu, \sigma]\), so let's see this instead.

\[ \begin{aligned} l(\theta; D) = \log p(D \vert \theta) &= \log \prod_n p(x_n \vert \theta)\\ &= \log \prod_n \frac{1}{\sqrt{2\pi \sigma^2}}\exp\{-\frac{1}{2\sigma^2}(x_n-\mu)^2\}\\ &= \sum_n -\frac{1}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2}(x_n-\mu)^2\}\\ &= -\frac{N}{2} \log(2\pi \sigma^2) - \frac{1}{2} \sum_n \frac{(x_n-\mu)^2}{\sigma^2}\}\\ \end{aligned} \]

Solving for the derivatives of \(\theta\):

\[ \frac{\partial l}{\partial \mu} = 0 + \frac{1}{\sigma^2}\sum_n {x_n - \mu} = 0\\ \Rightarrow \mu_\text{MLE}= (\frac{1}{N} \sum_n x_n) \]

So the MLE for the mean of a Gaussian is the mean of the data, intuitive!

\[ \frac{\partial l}{\partial \sigma^2} = -\frac{N}{2\sigma^2} + \frac{1}{2 \sigma^4} \sum_n (x_n^2 - \mu_\text{MLE}^2)=0\\ \Rightarrow \sigma_\text{MLE}^2 = \frac{1}{N} \sum_n (x_n^2 - \mu_\text{MLE}^2) \]

Also the MLE for the variance looks like the variance of the data.

That these are functions of the counts of the data is consistent with our general result for members of the exponential family.

Example: MLE for Linear Regression

Consider a dataset that looks like \(N\) observations of data, target pairs:

\[ D = \{ (x_i, y_i) \}_{i=1}^N \]

and our linear regression model is a Gaussian with mean given by linear combination of the parameters and the data, \(\mu = \sigma^T x\) with some noise \(\sigma^2\):

\[ p(y \vert x; \theta) = \mathcal{N}(y \vert \theta^Tx, \sigma^2)\]

So the log-likelihood of the dataset under the model is

\[ l(\theta; D) = -\frac{1}{2\sigma^2} \sum_n^N (y_n - \theta^Tx_n)^2\]

and the MLE is given by finding the derivative, setting to zero, and solving:

\[\frac{\partial l}{\partial \theta} = - \sum_n (y_n - \theta^T x_n) x_n\\ \Rightarrow \theta_\text{MLE} = (x^Tx)^{-1} X^TY\]

These sufficient statistics are also a kind of count:

  • \(X^TX\) is the input correlation matrix
  • \(X^TY\) is the input-output cross correlation matrix


See Lecture 2 slides 10-13 for more examples.

Summary of Probabilistic Models

In general, learning the parameters of a probabilistic model depends on whether our variables are observed or partially observed, continuous or discrete

Continuous Discrete
Fully observed variables Bespoke estimates from calculus Normalized counts
Partially observed variables Variational inference, recognition networks, MCMC Message passing, variable elimination, junction tree


Useful Resources

  • Helpful video on sufficient statistics.

Glossary of Terms