# Bayesian Regression¶

Regression is one of the most common and basic supervised learning tasks in machine learning. Suppose we’re given a dataset \(\mathcal{D}\) of the form

The goal of linear regression is to fit a function to the data of the form:

where \(w\) and \(b\) are learnable parameters and \(\epsilon\) represents observation noise. Specifically \(w\) is a matrix of weights and \(b\) is a bias vector.

Let’s first implement linear regression in PyTorch and learn point estimates for the parameters \(w\) and \(b\). Then we’ll see how to incorporate uncertainty into our estimates by using Pyro to implement Bayesian linear regression.

## Setup¶

As always, let’s begin by importing the modules we’ll need.

```
In [ ]:
```

```
import os
import numpy as np
import torch
import torch.nn as nn
import pyro
from pyro.distributions import Normal
from pyro.infer import SVI, Trace_ELBO
from pyro.optim import Adam
# for CI testing
smoke_test = ('CI' in os.environ)
pyro.enable_validation(True)
```

## Data¶

We’ll generate a toy dataset with one feature and \(w = 3\) and \(b = 1\) as follows:

```
In [ ]:
```

```
N = 100 # size of toy data
def build_linear_dataset(N, p=1, noise_std=0.01):
X = np.random.rand(N, p)
# w = 3
w = 3 * np.ones(p)
# b = 1
y = np.matmul(X, w) + np.repeat(1, N) + np.random.normal(0, noise_std, size=N)
y = y.reshape(N, 1)
X, y = torch.tensor(X).type(torch.Tensor), torch.tensor(y).type(torch.Tensor)
data = torch.cat((X, y), 1)
assert data.shape == (N, p + 1)
return data
```

Note that we generate the data with a fixed observation noise \(\sigma = 0.1\).

## Regression¶

Now let’s define our regression model. We’ll use PyTorch’s `nn.Module`

for this. Our input \(X\) is a matrix of size \(N \times p\) and
our output \(y\) is a vector of size \(p \times 1\). The
function `nn.Linear(p, 1)`

defines a linear transformation of the form
\(Xw + b\) where \(w\) is the weight matrix and \(b\) is the
additive bias. As you can see, we can easily make this a logistic
regression by adding a non-linearity in the `forward()`

method.

```
In [ ]:
```

```
class RegressionModel(nn.Module):
def __init__(self, p):
# p = number of features
super(RegressionModel, self).__init__()
self.linear = nn.Linear(p, 1)
def forward(self, x):
return self.linear(x)
regression_model = RegressionModel(1)
```

## Training¶

We will use the mean squared error (MSE) as our loss and Adam as our
optimizer. We would like to optimize the parameters of the
`regression_model`

neural net above. We will use a somewhat large
learning rate of `0.01`

and run for 500 iterations.

```
In [ ]:
```

```
loss_fn = torch.nn.MSELoss(size_average=False)
optim = torch.optim.Adam(regression_model.parameters(), lr=0.05)
num_iterations = 1000 if not smoke_test else 2
def main():
data = build_linear_dataset(N)
x_data = data[:, :-1]
y_data = data[:, -1]
for j in range(num_iterations):
# run the model forward on the data
y_pred = regression_model(x_data).squeeze(-1)
# calculate the mse loss
loss = loss_fn(y_pred, y_data)
# initialize gradients to zero
optim.zero_grad()
# backpropagate
loss.backward()
# take a gradient step
optim.step()
if (j + 1) % 50 == 0:
print("[iteration %04d] loss: %.4f" % (j + 1, loss.item()))
# Inspect learned parameters
print("Learned parameters:")
for name, param in regression_model.named_parameters():
print("%s: %.3f" % (name, param.data.numpy()))
if __name__ == '__main__':
main()
```

**Sample Output**:

```
[iteration 0400] loss: 0.0105
[iteration 0450] loss: 0.0096
[iteration 0500] loss: 0.0095
[iteration 0550] loss: 0.0095
[iteration 0600] loss: 0.0095
[iteration 0650] loss: 0.0095
[iteration 0700] loss: 0.0095
[iteration 0750] loss: 0.0095
[iteration 0800] loss: 0.0095
[iteration 0850] loss: 0.0095
[iteration 0900] loss: 0.0095
[iteration 0950] loss: 0.0095
[iteration 1000] loss: 0.0095
Learned parameters:
linear.weight: 3.004
linear.bias: 0.997
```

Not too bad - you can see that the regressor learned parameters that were pretty close to the ground truth of \(w = 3, b = 1\). But how confident should we be in these point estimates?

Bayesian
modeling
offers a systematic framework for reasoning about model uncertainty.
Instead of just learning point estimates, we’re going to learn a
*distribution* over values of the parameters \(w\) and \(b\)
that are consistent with the observed data.

## Bayesian Regression¶

In order to make our linear regression Bayesian, we need to put priors on the parameters \(w\) and \(b\). These are distributions that represent our prior belief about reasonable values for \(w\) and \(b\) (before observing any data).

`random_module()`

¶

In order to do this, we’ll ‘lift’ the parameters \(w\) and \(b\)
to random variables. We can do this in Pyro via `random_module()`

,
which effectively takes a given `nn.Module`

and turns it into a
distribution over the same module; in our case, this will be a
distribution over regressors. Specifically, each parameter in the
original regression model is sampled from the provided prior. This
allows us to repurpose vanilla regression models for use in the Bayesian
setting. For example:

```
In [ ]:
```

```
loc = torch.zeros(1, 1)
scale = torch.ones(1, 1)
# define a unit normal prior
prior = Normal(loc, scale)
# overload the parameters in the regression module with samples from the prior
lifted_module = pyro.random_module("regression_module", regression_model, prior)
# sample a regressor from the prior
sampled_reg_model = lifted_module()
```

### Model¶

We now have all the ingredients needed to specify our model. First we
define priors over \(w\) and \(b\). Because we’re uncertain
about the parameters a priori, we’ll use relatively wide priors
\(\mathcal{N}(\mu = 0, \sigma = 10)\). Then we wrap
`regression_model`

with `random_module`

and sample an instance of
the regressor, `lifted_reg_model`

. We then run the regressor forward
on the inputs `x_data`

. Finally we use the `obs`

argument to the
`pyro.sample`

statement to condition on the observed data `y_data`

.
Note that we use the same fixed observation noise that was used to
generate the data.

```
In [ ]:
```

```
def model(data):
# Create unit normal priors over the parameters
loc, scale = torch.zeros(1, 1), 10 * torch.ones(1, 1)
bias_loc, bias_scale = torch.zeros(1), 10 * torch.ones(1)
w_prior = Normal(loc, scale).independent(1)
b_prior = Normal(bias_loc, bias_scale).independent(1)
priors = {'linear.weight': w_prior, 'linear.bias': b_prior}
# lift module parameters to random variables sampled from the priors
lifted_module = pyro.random_module("module", regression_model, priors)
# sample a regressor (which also samples w and b)
lifted_reg_model = lifted_module()
with pyro.iarange("map", N):
x_data = data[:, :-1]
y_data = data[:, -1]
# run the regressor forward conditioned on data
prediction_mean = lifted_reg_model(x_data).squeeze(-1)
# condition on the observed data
pyro.sample("obs",
Normal(prediction_mean, 0.1 * torch.ones(data.size(0))),
obs=y_data)
```

### Guide¶

In order to do inference we’re going to need a guide, i.e. a
parameterized family of distributions over \(w\) and \(b\).
Writing down a guide will proceed in close analogy to the construction
of our model, with the key difference that the guide parameters need to
be trainable. To do this we register the guide parameters in the
ParamStore using `pyro.param()`

.

```
In [ ]:
```

```
softplus = torch.nn.Softplus()
def guide(data):
# define our variational parameters
w_loc = torch.randn(1, 1)
# note that we initialize our scales to be pretty narrow
w_log_sig = torch.tensor(-3.0 * torch.ones(1, 1) + 0.05 * torch.randn(1, 1))
b_loc = torch.randn(1)
b_log_sig = torch.tensor(-3.0 * torch.ones(1) + 0.05 * torch.randn(1))
# register learnable params in the param store
mw_param = pyro.param("guide_mean_weight", w_loc)
sw_param = softplus(pyro.param("guide_log_scale_weight", w_log_sig))
mb_param = pyro.param("guide_mean_bias", b_loc)
sb_param = softplus(pyro.param("guide_log_scale_bias", b_log_sig))
# guide distributions for w and b
w_dist = Normal(mw_param, sw_param).independent(1)
b_dist = Normal(mb_param, sb_param).independent(1)
dists = {'linear.weight': w_dist, 'linear.bias': b_dist}
# overload the parameters in the module with random samples
# from the guide distributions
lifted_module = pyro.random_module("module", regression_model, dists)
# sample a regressor (which also samples w and b)
return lifted_module()
```

Note that we choose Gaussians for both guide distributions. Also, to
ensure positivity, we pass each log scale through a `softplus()`

transformation (an alternative to ensure positivity would be an
`exp()`

-transformation).

## Inference¶

To do inference we’ll use stochastic variational inference (SVI) (for an
introduction to SVI, see SVI Part I). Just like
in the non-Bayesian linear regression, each iteration of our training
loop will take a gradient step, with the difference that in this case,
we’ll use the ELBO objective instead of the MSE loss by constructing a
`Trace_ELBO`

object that we pass to `SVI`

.

```
In [ ]:
```

```
optim = Adam({"lr": 0.05})
svi = SVI(model, guide, optim, loss=Trace_ELBO())
```

Here `Adam`

is a thin wrapper around `torch.optim.Adam`

(see
here for a discussion). The complete
training loop is as follows:

```
In [ ]:
```

```
def main():
pyro.clear_param_store()
data = build_linear_dataset(N)
for j in range(num_iterations):
# calculate the loss and take a gradient step
loss = svi.step(data)
if j % 100 == 0:
print("[iteration %04d] loss: %.4f" % (j + 1, loss / float(N)))
if __name__ == '__main__':
main()
```

To take an ELBO gradient step we simply call the `step`

method of
`SVI`

. Notice that the `data`

argument we pass to `step`

will be
passed to both `model()`

and `guide()`

.

## Validating Results¶

Let’s compare the variational parameters we learned to our previous result:

```
In [ ]:
```

```
for name in pyro.get_param_store().get_all_param_names():
print("[%s]: %.3f" % (name, pyro.param(name).data.numpy()))
```

**Sample Output**:

```
[guide_log_scale_weight]: -3.217
[guide_log_scale_bias]: -3.164
[guide_mean_weight]: 2.966
[guide_mean_bias]: 0.941
```

As you can see, the means of our parameter estimates are pretty close to
the values we previously learned. Now, however, instead of just point
estimates, the parameters `guide_log_scale_weight`

and
`guide_log_scale_bias`

provide us with uncertainty estimates. (Note
that the scales are in log-space here, so the more negative the value,
the narrower the width).

Finally, let’s evaluate our model by checking its predictive accuracy on
new test data. This is known as *point evaluation*. We’ll sample 20
neural nets from our posterior, run them on the test data, then average
across their predictions and calculate the MSE of the predicted values
compared to the ground truth.

```
In [ ]:
```

```
X = np.linspace(6, 7, num=20)
y = 3 * X + 1
X, y = X.reshape((20, 1)), y.reshape((20, 1))
x_data, y_data = torch.tensor(X).type(torch.Tensor), torch.tensor(y).type(torch.Tensor)
loss = nn.MSELoss()
y_preds = torch.zeros(20, 1)
for i in range(20):
# guide does not require the data
sampled_reg_model = guide(None)
# run the regression model and add prediction to total
y_preds = y_preds + sampled_reg_model(x_data)
# take the average of the predictions
y_preds = y_preds / 20
print ("Loss: ", loss(y_preds, y_data).item())
```

**Sample Output**:

```
Loss: 0.00025596367777325213
```

See the full code on Github.