MLE and MAP Estimation

In this short tutorial we review how to do Maximum Likelihood (MLE) and Maximum a Posteriori (MAP) estimation in Pyro.

import torch
from torch.distributions import constraints
import pyro
import pyro.distributions as dist
from pyro.infer import SVI, Trace_ELBO

We consider the simple “fair coin” example covered in a previous tutorial.

data = torch.zeros(10)
data[0:6] = 1.0

def original_model(data):
    f = pyro.sample("latent_fairness", dist.Beta(10.0, 10.0))
    with pyro.plate("data", data.size(0)):
        pyro.sample("obs", dist.Bernoulli(f), obs=data)

To facilitate comparison between different inference techniques, we construct a training helper:

def train(model, guide, lr=0.01):
    adam = pyro.optim.Adam({"lr": lr})
    svi = SVI(model, guide, adam, loss=Trace_ELBO())

    n_steps = 101
    for step in range(n_steps):
        loss = svi.step(data)
        if step % 50 == 0:
            print('[iter {}]  loss: {:.4f}'.format(step, loss))


Our model has a single latent variable latent_fairness. To do Maximum Likelihood Estimation we simply “demote” our latent variable latent_fairness to a Pyro parameter.

def model_mle(data):
    # note that we need to include the interval constraint;
    # in original_model() this constraint appears implicitly in
    # the support of the Beta distribution.
    f = pyro.param("latent_fairness", torch.tensor(0.5),
    with pyro.plate("data", data.size(0)):
        pyro.sample("obs", dist.Bernoulli(f), obs=data)

Since we no longer have any latent variables, our guide can be empty:

def guide_mle(data):

Let’s see what result we get.

train(model_mle, guide_mle)
[iter 0]  loss: 6.9315
[iter 50]  loss: 6.7310
[iter 100]  loss: 6.7301
print("Our MLE estimate of the latent fairness is {:.3f}".format(
Our MLE estimate of the latent fairness is 0.601

Thus with MLE we get a point estimate of latent_fairness.


With Maximum a Posteriori estimation, we also get a point estimate of our latent variables. The difference to MLE is that these estimates will be regularized by the prior.

To do MAP in Pyro we use a Delta distribution for the guide. Recall that the Delta distribution puts all its probability mass at a single value. The Delta distribution will be parameterized by a learnable parameter.

def guide_map(data):
    f_map = pyro.param("f_map", torch.tensor(0.5),
    pyro.sample("latent_fairness", dist.Delta(f_map))

Let’s see how this result differs from MLE.

train(original_model, guide_map)
[iter 0]  loss: 5.6719
[iter 50]  loss: 5.6006
[iter 100]  loss: 5.6004
print("Our MAP estimate of the latent fairness is {:.3f}".format(
Our MAP estimate of the latent fairness is 0.536

To understand what’s going on note that the prior mean of the latent_fairness in our model is 0.5, since that is the mean of Beta(10.0, 10.0). The MLE estimate (which ignores the prior) gives us a result that is entirely determined by the raw counts (6 heads and 4 tails, say). In contrast the MAP estimate is regularized towards the prior mean, which is why the MAP estimate is somewhere between 0.5 and 0.6.