Bayesian Regression - Introduction (Part 1)

Regression is one of the most common and basic supervised learning tasks in machine learning. Suppose we’re given a dataset \(\mathcal{D}\) of the form

\[\mathcal{D} = \{ (X_i, y_i) \} \qquad \text{for}\qquad i=1,2,...,N\]

The goal of linear regression is to fit a function to the data of the form:

\[y = w X + b + \epsilon\]

where \(w\) and \(b\) are learnable parameters and \(\epsilon\) represents observation noise. Specifically \(w\) is a matrix of weights and \(b\) is a bias vector.

Let’s first implement linear regression in PyTorch and learn point estimates for the parameters \(w\) and \(b\). Then we’ll see how to incorporate uncertainty into our estimates by using Pyro to implement Bayesian regression.


Let’s begin by importing the modules we’ll need.

In [1]:
import os
from functools import partial
import numpy as np
import pandas as pd
import seaborn as sns
import torch
import torch.nn as nn

import matplotlib.pyplot as plt

import pyro
from pyro.distributions import Normal, Uniform, Delta
from pyro.infer import SVI, Trace_ELBO
from pyro.optim import Adam
from pyro.distributions.util import logsumexp
from pyro.infer import EmpiricalMarginal, SVI, Trace_ELBO, TracePredictive
from pyro.infer.mcmc import MCMC, NUTS
import pyro.optim as optim
import pyro.poutine as poutine

# for CI testing
smoke_test = ('CI' in os.environ)
assert pyro.__version__.startswith('0.3.0')


The following example is adapted from [1]. We would like to explore the relationship between topographic heterogeneity of a nation as measured by the Terrain Ruggedness Index (variable rugged in the dataset) and its GDP per capita. In particular, it was noted by the authors in [1] that terrain ruggedness or bad geography is related to poorer economic performance outside of Africa, but rugged terrains have had a reverse effect on income for African nations. Let us look at the data [2] and investigate this relationship. We will be focusing on three features from the dataset: - rugged: quantifies the Terrain Ruggedness Index - cont_africa: whether the given nation is in Africa - rgdppc_2000: Real GDP per capita for the year 2000

We will take the logarithm for the response variable GDP as it tends to vary exponentially.

In [2]:
data = pd.read_csv(DATA_URL, encoding="ISO-8859-1")
df = data[["cont_africa", "rugged", "rgdppc_2000"]]
df = df[np.isfinite(df.rgdppc_2000)]
df["rgdppc_2000"] = np.log(df["rgdppc_2000"])
In [3]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 6), sharey=True)
african_nations = data[data["cont_africa"] == 1]
non_african_nations = data[data["cont_africa"] == 0]
ax[0].set(xlabel="Terrain Ruggedness Index",
          ylabel="log GDP (2000)",
          title="Non African Nations")
ax[1].set(xlabel="Terrain Ruggedness Index",
          ylabel="log GDP (2000)",
          title="African Nations")
[Text(0, 0.5, 'log GDP (2000)'),
 Text(0.5, 0, 'Terrain Ruggedness Index'),
 Text(0.5, 1.0, 'African Nations')]

Linear Regression

We would like to predict log GDP per capita of a nation as a function of two features from the dataset - whether the nation is in Africa, and its Terrain Ruggedness Index. Let’s define our regression model. We’ll use PyTorch’s nn.Module for this. Our input \(X\) is a matrix of size \(N \times 2\) and our output \(y\) is a vector of size \(2 \times 1\). The function nn.Linear(p, 1) defines a linear transformation of the form \(Xw + b\) where \(w\) is the weight matrix and \(b\) is the additive bias. We include an extra self.factor term meant to capture the correlation between ruggedness and whether a country is in Africa.

Note that we can easily make this a logistic regression by adding a non-linearity in the forward() method.

In [4]:
class RegressionModel(nn.Module):
    def __init__(self, p):
        # p = number of features
        super(RegressionModel, self).__init__()
        self.linear = nn.Linear(p, 1)
        self.factor = nn.Parameter(torch.tensor(1.))

    def forward(self, x):
        return self.linear(x) + (self.factor * x[:, 0] * x[:, 1]).unsqueeze(1)

p = 2  # number of features
regression_model = RegressionModel(p)


We will use the mean squared error (MSE) as our loss and Adam as our optimizer. We would like to optimize the parameters of the regression_model neural net above. We will use a somewhat large learning rate of 0.05 and run for 500 iterations.

In [5]:
loss_fn = torch.nn.MSELoss(reduction='sum')
optim = torch.optim.Adam(regression_model.parameters(), lr=0.05)
num_iterations = 1000 if not smoke_test else 2
data = torch.tensor(df.values, dtype=torch.float)
x_data, y_data = data[:, :-1], data[:, -1]

def main():
    x_data = data[:, :-1]
    y_data = data[:, -1]
    for j in range(num_iterations):
        # run the model forward on the data
        y_pred = regression_model(x_data).squeeze(-1)
        # calculate the mse loss
        loss = loss_fn(y_pred, y_data)
        # initialize gradients to zero
        # backpropagate
        # take a gradient step
        if (j + 1) % 50 == 0:
            print("[iteration %04d] loss: %.4f" % (j + 1, loss.item()))
    # Inspect learned parameters
    print("Learned parameters:")
    for name, param in regression_model.named_parameters():

[iteration 0050] loss: 3723.7849
[iteration 0100] loss: 1777.9851
[iteration 0150] loss: 1226.7543
[iteration 0200] loss: 927.1514
[iteration 0250] loss: 706.6473
[iteration 0300] loss: 536.0911
[iteration 0350] loss: 408.0940
[iteration 0400] loss: 316.1810
[iteration 0450] loss: 252.9711
[iteration 0500] loss: 211.2545
[iteration 0550] loss: 184.7950
[iteration 0600] loss: 168.6502
[iteration 0650] loss: 159.1673
[iteration 0700] loss: 153.8036
[iteration 0750] loss: 150.8815
[iteration 0800] loss: 149.3482
[iteration 0850] loss: 148.5732
[iteration 0900] loss: 148.1960
[iteration 0950] loss: 148.0193
[iteration 1000] loss: 147.9397
Learned parameters:
factor 0.37248382
linear.weight [[-1.90511    -0.18619268]]
linear.bias [9.188872]

Bayesian modeling offers a systematic framework for reasoning about model uncertainty. Instead of just learning point estimates, we’re going to learn a distribution over variables that are consistent with the observed data.

Bayesian Regression

In order to make our linear regression Bayesian, we need to put priors on the parameters \(w\) and \(b\). These are distributions that represent our prior belief about reasonable values for \(w\) and \(b\) (before observing any data).


In order to do this, we’ll ‘lift’ the parameters of our existing model to random variables. We can do this in Pyro via random_module(), which effectively takes a given nn.Module and turns it into a distribution over the same module; in our case, this will be a distribution over regressors. Specifically, each parameter in the original regression model is sampled from the provided prior. This allows us to repurpose vanilla regression models for use in the Bayesian setting. For example:

loc = torch.zeros(1, 1)
scale = torch.ones(1, 1)
# define a unit normal prior
prior = Normal(loc, scale)
# overload the parameters in the regression module with samples from the prior
lifted_module = pyro.random_module("regression_module", nn, prior)
# sample a nn from the prior
sampled_reg_model = lifted_module()


We now have all the ingredients needed to specify our model. First we define priors over weights, biases, and factor. Note the priors that we are using for the different latent variables in the model. The prior on the intercept parameter is very flat as we would like this to be learnt from the data. We are using a weakly regularizing prior on the regression coefficients to avoid overfitting to the data.

We wrap regression_model with random_module and sample an instance of the regressor, lifted_reg_model. We then run the regressor on x_data. Finally we use the obs argument to the pyro.sample statement to condition on the observed data y_data with a learned observation noise sigma.

In [6]:
def model(x_data, y_data):
    # weight and bias priors
    w_prior = Normal(torch.zeros(1, 2), torch.ones(1, 2)).to_event(1)
    b_prior = Normal(torch.tensor([[8.]]), torch.tensor([[1000.]])).to_event(1)
    f_prior = Normal(0., 1.)
    priors = {'linear.weight': w_prior, 'linear.bias': b_prior, 'factor': f_prior}
    scale = pyro.sample("sigma", Uniform(0., 10.))
    # lift module parameters to random variables sampled from the priors
    lifted_module = pyro.random_module("module", regression_model, priors)
    # sample a nn (which also samples w and b)
    lifted_reg_model = lifted_module()
    with pyro.plate("map", len(x_data)):
        # run the nn forward on data
        prediction_mean = lifted_reg_model(x_data).squeeze(-1)
        # condition on the observed data
                    Normal(prediction_mean, scale),
        return prediction_mean


In order to do inference we’re going to need a guide, i.e. a variational family of distributions. We will use Pyro’s autoguide library to automatically place Gaussians with diagonal covariance on all of the distributions in the model. Under the hood, this defines a guide function with Normal distributions with learnable parameters corresponding to each sample() in the model. Autoguide also supports learning MAP estimates with AutoDelta or composing guides with AutoGuideList (see the docs for more information). In Part II we will explore how to write guides by hand.

In [7]:
from pyro.contrib.autoguide import AutoDiagonalNormal
guide = AutoDiagonalNormal(model)


To do inference we’ll use stochastic variational inference (SVI) (for an introduction to SVI, see SVI Part I). Just like in the non-Bayesian linear regression, each iteration of our training loop will take a gradient step, with the difference that in this case, we’ll use the ELBO objective instead of the MSE loss by constructing a Trace_ELBO object that we pass to SVI.

In [8]:
optim = Adam({"lr": 0.03})
svi = SVI(model, guide, optim, loss=Trace_ELBO(), num_samples=1000)

Here Adam is a thin wrapper around torch.optim.Adam (see here for a discussion). To take an ELBO gradient step we simply call the step method of SVI. Notice that the data argument we pass to step will be passed to both model() and guide(). The complete training loop is as follows:

In [9]:
def train():
    for j in range(num_iterations):
        # calculate the loss and take a gradient step
        loss = svi.step(x_data, y_data)
        if j % 100 == 0:
            print("[iteration %04d] loss: %.4f" % (j + 1, loss / len(data)))

[iteration 0001] loss: 18.3907
[iteration 0101] loss: 3.1146
[iteration 0201] loss: 3.1067
[iteration 0301] loss: 2.8602
[iteration 0401] loss: 2.7770
[iteration 0501] loss: 2.6181
[iteration 0601] loss: 2.4298
[iteration 0701] loss: 2.0160
[iteration 0801] loss: 1.7814
[iteration 0901] loss: 1.5811
In [10]:
for name, value in pyro.get_param_store().items():
    print(name, pyro.param(name))
auto_loc tensor([-2.2026,  0.2936, -1.8873, -0.1607,  9.1753], requires_grad=True)
auto_scale tensor([0.2285, 0.0954, 0.1376, 0.0600, 0.1042], grad_fn=<AddBackward0>)

As you can see, instead of just point estimates, we now have uncertainty estimates (auto_scale) for our learned parameters. Note that Autoguide packs the latent variables into a tensor, in this case, one entry per variable sampled in our model.

Model Evaluation

To evaluate our model, we’ll generate some predictive samples and look at the posteriors. Since our variational distribution is fully parameterized, we can just run the lifted model forward. We wrap the model with a Delta distribution in order to register the values with Pyro. We then store the execution traces in the posterior object with

In [11]:
get_marginal = lambda traces, sites:EmpiricalMarginal(traces, sites)._get_samples_and_weights()[0].detach().cpu().numpy()

def summary(traces, sites):
    marginal = get_marginal(traces, sites)
    site_stats = {}
    for i in range(marginal.shape[1]):
        site_name = sites[i]
        marginal_site = pd.DataFrame(marginal[:, i]).transpose()
        describe = partial(pd.Series.describe, percentiles=[.05, 0.25, 0.5, 0.75, 0.95])
        site_stats[site_name] = marginal_site.apply(describe, axis=1) \
            [["mean", "std", "5%", "25%", "50%", "75%", "95%"]]
    return site_stats

def wrapped_model(x_data, y_data):
    pyro.sample("prediction", Delta(model(x_data, y_data)))

posterior =, y_data)

In [12]:
# posterior predictive distribution we can get samples from
trace_pred = TracePredictive(wrapped_model,
post_pred =, None)
post_summary = summary(post_pred, sites= ['prediction', 'obs'])
mu = post_summary["prediction"]
y = post_summary["obs"]
predictions = pd.DataFrame({
    "cont_africa": x_data[:, 0],
    "rugged": x_data[:, 1],
    "mu_mean": mu["mean"],
    "mu_perc_5": mu["5%"],
    "mu_perc_95": mu["95%"],
    "y_mean": y["mean"],
    "y_perc_5": y["5%"],
    "y_perc_95": y["95%"],
    "true_gdp": y_data,
In [13]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 6), sharey=True)
african_nations = predictions[predictions["cont_africa"] == 1]
non_african_nations = predictions[predictions["cont_africa"] == 0]
african_nations = african_nations.sort_values(by=["rugged"])
non_african_nations = non_african_nations.sort_values(by=["rugged"])
fig.suptitle("Regression line 90% CI", fontsize=16)
ax[0].set(xlabel="Terrain Ruggedness Index",
          ylabel="log GDP (2000)",
          title="Non African Nations")
idx = np.argsort(african_nations["rugged"])
ax[1].set(xlabel="Terrain Ruggedness Index",
          ylabel="log GDP (2000)",
          title="African Nations")
[Text(0, 0.5, 'log GDP (2000)'),
 Text(0.5, 0, 'Terrain Ruggedness Index'),
 Text(0.5, 1.0, 'African Nations')]

The above figure shows the uncertainty in our estimate of the regression line. Note that for lower values of ruggedness there are many more data points, and as such, lesser wiggle room for the line of best fit. This is reflected in the 90% CI around the mean. We can also see that most of the data points actually lie outside the 90% CI, and this is expected because we have not plotted the outcome variable which will be affected by sigma! Let us do so next.

In [14]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 6), sharey=True)
fig.suptitle("Posterior predictive distribution with 90% CI", fontsize=16)
ax[0].set(xlabel="Terrain Ruggedness Index",
          ylabel="log GDP (2000)",
          title="Non African Nations")
idx = np.argsort(african_nations["rugged"])

ax[1].set(xlabel="Terrain Ruggedness Index",
          ylabel="log GDP (2000)",
          title="African Nations")
[Text(0, 0.5, 'log GDP (2000)'),
 Text(0.5, 0, 'Terrain Ruggedness Index'),
 Text(0.5, 1.0, 'African Nations')]

We observe that the outcome from our model and the 90% CI accounts for the majority of the data points that we observe in practice. It is usually a good idea to do such posterior predictive checks to see if our model gives valid predictions.

In [15]:
# we need to prepend `module$$$` to all parameters of nn.Modules since
# that is how they are stored in the ParamStore
weight = get_marginal(posterior, ['module$$$linear.weight']).squeeze(1).squeeze(1)
factor = get_marginal(posterior, ['module$$$factor'])
gamma_within_africa = weight[:, 1] + factor.squeeze(1)
gamma_outside_africa = weight[:, 1]
fig = plt.figure(figsize=(10, 6))
sns.distplot(gamma_within_africa, kde_kws={"label": "African nations"},)
sns.distplot(gamma_outside_africa, kde_kws={"label": "Non-African nations"})
fig.suptitle("Density of Slope : log(GDP) vs. Terrain Ruggedness", fontsize=16)
Text(0.5, 0.98, 'Density of Slope : log(GDP) vs. Terrain Ruggedness')

In the next section, we’ll look at how to write guides for variational inference as well as compare the results with inference via HMC.

See an example with a toy dataset on Github.


  1. McElreath, D., Statistical Rethinking, Chapter 7, 2016
  2. Nunn, N. & Puga, D., Ruggedness: The blessing of bad geography in Africa”, Review of Economics and Statistics 94(1), Feb. 2012