Bayesian Imputation

Real-world datasets often contain many missing values. In those situations, we have to either remove those missing data (also known as “complete case”) or replace them by some values. Though using complete case is pretty straightforward, it is only applicable when the number of missing entries is so small that throwing away those entries would not affect much the power of the analysis we are conducting on the data. The second strategy, also known as imputation, is more applicable and will be our focus in this tutorial.

Probably the most popular way to perform imputation is to fill a missing value with the mean, median, or mode of its corresponding feature. In that case, we implicitly assume that the feature containing missing values has no correlation with the remaining features of our dataset. This is a pretty strong assumption and might not be true in general. In addition, it does not encode any uncertainty that we might put on those values. Below, we will construct a Bayesian setting to resolve those issues. In particular, given a model on the dataset, we will

  • create a generative model for the feature with missing value

  • and consider missing values as unobserved latent variables.

[1]:
# first, we need some imports
import os

import pandas as pd
from IPython.display import set_matplotlib_formats
from matplotlib import pyplot as plt

import numpyro
from jax import numpy as jnp
from jax import ops, random
from jax.scipy.special import expit
from numpyro import distributions as dist
from numpyro.distributions import constraints
from numpyro.infer import MCMC, NUTS, Predictive

plt.style.use("seaborn")
if "NUMPYRO_SPHINXBUILD" in os.environ:
    set_matplotlib_formats("svg")

assert numpyro.__version__.startswith("0.3.0")

Dataset

The data is taken from the competition Titanic: Machine Learning from Disaster hosted on kaggle. It contains information of passengers in the Titanic accident such as name, age, gender,… And our target is to predict if a person is more likely to survive.

[2]:
train_df = pd.read_csv(
    "https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv"
)
train_df.info()
train_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  891 non-null    int64
 1   Survived     891 non-null    int64
 2   Pclass       891 non-null    int64
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64
 7   Parch        891 non-null    int64
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Look at the data info, we know that there are missing data at Age, Cabin, and Embarked columns. Although Cabin is an important feature (because the position of a cabin in the ship can affect the chance of people in that cabin to survive), we will skip it in this tutorial for simplicity. In the dataset, there are many categorical columns and two numerical columns Age and Fare. Let’s first look at the distribution of those categorical columns:

[3]:
for col in ["Survived", "Pclass", "Sex", "SibSp", "Parch", "Embarked"]:
    print(train_df[col].value_counts(), end="\n\n")
0    549
1    342
Name: Survived, dtype: int64

3    491
1    216
2    184
Name: Pclass, dtype: int64

male      577
female    314
Name: Sex, dtype: int64

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Prepare data

First, we will merge rare groups in SibSp and Parch columns together. In addition, we’ll fill 2 missing entries in Embarked by the mode S. Note that we can make a generative model for those missing entries in Embarked but let’s skip doing so for simplicity.

[4]:
train_df.SibSp.clip(0, 1, inplace=True)
train_df.Parch.clip(0, 2, inplace=True)
train_df.Embarked.fillna("S", inplace=True)

Looking closer at the data, we can observe that each name contains a title. We know that age is correlated with the title of the name: e.g. those with Mrs. would be older than those with Miss. (on average) so it might be good to create that feature. The distribution of titles is:

[5]:
train_df.Name.str.split(", ").str.get(1).str.split(" ").str.get(0).value_counts()
[5]:
Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Major.         2
Col.           2
Jonkheer.      1
Don.           1
Lady.          1
Mme.           1
Sir.           1
Ms.            1
the            1
Capt.          1
Name: Name, dtype: int64

We will make a new column Title, where rare titles are merged into one group Misc..

[6]:
train_df["Title"] = (
    train_df.Name.str.split(", ")
    .str.get(1)
    .str.split(" ")
    .str.get(0)
    .apply(lambda x: x if x in ["Mr.", "Miss.", "Mrs.", "Master."] else "Misc.")
)

Now, it is ready to turn the dataframe, which includes categorical values, into numpy arrays. We also perform standardization (a good practice for regression models) for Age column.

[7]:
title_cat = pd.CategoricalDtype(
    categories=["Mr.", "Miss.", "Mrs.", "Master.", "Misc."], ordered=True
)
embarked_cat = pd.CategoricalDtype(categories=["S", "C", "Q"], ordered=True)
age_mean, age_std = train_df.Age.mean(), train_df.Age.std()
data = dict(
    age=train_df.Age.pipe(lambda x: (x - age_mean) / age_std).values,
    pclass=train_df.Pclass.values - 1,
    title=train_df.Title.astype(title_cat).cat.codes.values,
    sex=(train_df.Sex == "male").astype(int).values,
    sibsp=train_df.SibSp.values,
    parch=train_df.Parch.values,
    embarked=train_df.Embarked.astype(embarked_cat).cat.codes.values,
)
survived = train_df.Survived.values
# compute the age mean for each title
age_notnan = data["age"][jnp.isfinite(data["age"])]
title_notnan = data["title"][jnp.isfinite(data["age"])]
age_mean_by_title = jnp.stack([age_notnan[title_notnan == i].mean() for i in range(5)])

Modelling

First, we want to note that in NumPyro, the following models

def model1a():
    x = numpyro.sample("x", dist.Normal(0, 1).expand([10])

and

def model1b():
    x = numpyro.sample("x", dist.Normal(0, 1).expand([10].mask(False))
    numpyro.sample("x_obs", dist.Normal(0, 1).expand([10]), obs=x)

are equivalent in the sense that both of them have + the same latent sites x drawn from dist.Normal(0, 1) prior, + and the same log densities dist.Normal(0, 1).log_prob(x).

Now, assume that we observed the last 6 values of x (non-observed entries take value NaN), the typical model will be

def model2a(x):
    x_impute = numpyro.sample("x_impute", dist.Normal(0, 1).expand([4]))
    x_obs = numpyro.sample("x_obs", dist.Normal(0, 1).expand([6]), obs=x[4:])
    x_imputed = jnp.concatenate([x_impute, x_obs])

or with the usage of mask,

def model2b(x):
    x_impute = numpyro.sample("x_impute", dist.Normal(0, 1).expand([4]).mask(False))
    x_imputed = jnp.concatenate([x_impute, x[4:]])
    numpyro.sample("x", dist.Normal(0, 1).expand([10]), obs=x_imputed)

Both approaches to model the partial observed data x are equivalent. For the model below, we will use the latter method.

[8]:
def model(age, pclass, title, sex, sibsp, parch, embarked, survived=None, bayesian_impute=True):
    b_pclass = numpyro.sample("b_Pclass", dist.Normal(0, 1).expand([3]))
    b_title = numpyro.sample("b_Title", dist.Normal(0, 1).expand([5]))
    b_sex = numpyro.sample("b_Sex", dist.Normal(0, 1).expand([2]))
    b_sibsp = numpyro.sample("b_SibSp", dist.Normal(0, 1).expand([2]))
    b_parch = numpyro.sample("b_Parch", dist.Normal(0, 1).expand([3]))
    b_embarked = numpyro.sample("b_Embarked", dist.Normal(0, 1).expand([3]))

    # impute age by Title
    isnan = jnp.isnan(age)
    age_nanidx = jnp.nonzero(isnan)[0]
    if bayesian_impute:
        age_mu = numpyro.sample("age_mu", dist.Normal(0, 1).expand([5]))
        age_mu = age_mu[title]
        age_sigma = numpyro.sample("age_sigma", dist.Normal(0, 1).expand([5]))
        age_sigma = age_sigma[title]
        age_impute = numpyro.sample(
            "age_impute", dist.Normal(age_mu[age_nanidx], age_sigma[age_nanidx]).mask(False)
        )
        age = ops.index_update(age, age_nanidx, age_impute)
        numpyro.sample("age", dist.Normal(age_mu, age_sigma), obs=age)
    else:
        # fill missing data by the mean of ages for each title
        age_impute = age_mean_by_title[title][age_nanidx]
        age = ops.index_update(age, age_nanidx, age_impute)

    a = numpyro.sample("a", dist.Normal(0, 1))
    b_age = numpyro.sample("b_Age", dist.Normal(0, 1))
    logits = a + b_age * age
    logits = logits + b_title[title] + b_pclass[pclass] + b_sex[sex]
    logits = logits + b_sibsp[sibsp] + b_parch[parch] + b_embarked[embarked]
    numpyro.sample("survived", dist.Bernoulli(logits=logits), obs=survived)

Note that in the model, the prior for age is dist.Normal(age_mu, age_sigma), where the values of age_mu and age_sigma depend on title. Because there are missing values in age, we will encode those missing values in the latent parameter age_impute. Then we can replace NaN entries in age with the vector age_impute.

Sampling

We will use MCMC with NUTS kernel to sample both regression coefficients and imputed values.

[9]:
mcmc = MCMC(NUTS(model), 1000, 1000)
mcmc.run(random.PRNGKey(0), **data, survived=survived)
mcmc.print_summary()
sample: 100%|██████████| 2000/2000 [00:58<00:00, 34.39it/s, 63 steps of size 6.32e-02. acc. prob=0.94]

                     mean       std    median      5.0%     95.0%     n_eff     r_hat
              a      0.18      0.80      0.19     -1.09      1.53   1092.35      1.00
  age_impute[0]      0.22      0.85      0.25     -1.14      1.63   1812.56      1.00
  age_impute[1]     -0.10      0.86     -0.08     -1.46      1.36   1517.43      1.00
  age_impute[2]      0.36      0.80      0.34     -0.86      1.74   1255.78      1.00
  age_impute[3]      0.22      0.86      0.22     -1.18      1.68   1676.17      1.00
  age_impute[4]     -0.65      0.90     -0.61     -2.12      0.85   2134.96      1.00
  age_impute[5]      0.24      0.87      0.24     -1.24      1.52   1615.08      1.00
  age_impute[6]      0.45      0.78      0.45     -0.78      1.68   1486.47      1.00
  age_impute[7]     -0.65      0.90     -0.63     -2.04      0.85   1936.54      1.00
  age_impute[8]     -0.08      0.87     -0.08     -1.43      1.42   1640.22      1.00
  age_impute[9]      0.22      0.88      0.25     -1.33      1.62   1333.55      1.00
 age_impute[10]      0.20      0.87      0.20     -1.26      1.66   1888.15      1.00
 age_impute[11]      0.16      0.85      0.18     -1.25      1.56   1609.64      1.00
 age_impute[12]     -0.65      0.89     -0.62     -2.20      0.70   1712.38      1.00
 age_impute[13]      0.20      0.91      0.18     -1.30      1.71   2097.22      1.00
 age_impute[14]     -0.01      0.85     -0.00     -1.44      1.37   1421.92      1.00
 age_impute[15]      0.37      0.84      0.38     -1.02      1.67   1478.53      1.00
 age_impute[16]     -1.73      0.26     -1.74     -2.15     -1.31   1851.03      1.00
 age_impute[17]      0.22      0.86      0.22     -1.25      1.59   1558.59      1.00
 age_impute[18]      0.22      0.90      0.22     -1.31      1.68   1598.04      1.00
 age_impute[19]     -0.67      0.85     -0.63     -2.19      0.63   2018.16      1.00
 age_impute[20]      0.22      0.90      0.28     -1.44      1.57   1777.55      1.00
 age_impute[21]      0.18      0.90      0.21     -1.33      1.51   2182.77      1.00
 age_impute[22]      0.20      0.92      0.22     -1.27      1.75   1468.75      1.00
 age_impute[23]     -0.14      0.87     -0.13     -1.52      1.33   2195.16      1.00
 age_impute[24]     -0.67      0.89     -0.67     -2.14      0.73   1605.36      1.00
 age_impute[25]      0.17      0.89      0.19     -1.21      1.66   1429.10      1.00
 age_impute[26]      0.19      0.83      0.22     -1.16      1.54   1504.51      1.00
 age_impute[27]     -0.69      0.90     -0.67     -2.02      0.94   1890.80      1.00
 age_impute[28]      0.60      0.77      0.63     -0.70      1.81   1482.31      1.00
 age_impute[29]      0.24      0.84      0.26     -1.11      1.56   2220.98      1.00
 age_impute[30]      0.22      0.84      0.24     -1.00      1.73   2421.73      1.00
 age_impute[31]     -1.72      0.27     -1.72     -2.16     -1.27   2519.05      1.00
 age_impute[32]      0.43      0.86      0.44     -0.93      1.83   1493.55      1.00
 age_impute[33]      0.32      0.87      0.30     -1.02      1.84   1987.36      1.00
 age_impute[34]     -1.73      0.27     -1.74     -2.16     -1.27   2071.75      1.00
 age_impute[35]     -0.42      0.87     -0.42     -1.84      1.02   1434.28      1.00
 age_impute[36]      0.29      0.86      0.29     -1.14      1.64   2013.99      1.00
 age_impute[37]      0.30      0.83      0.34     -0.99      1.73   1863.36      1.00
 age_impute[38]      0.35      0.78      0.33     -1.03      1.55   1626.38      1.00
 age_impute[39]      0.19      0.91      0.19     -1.21      1.84   1117.75      1.00
 age_impute[40]     -0.64      0.90     -0.65     -2.14      0.77   1465.70      1.00
 age_impute[41]      0.19      0.87      0.19     -1.27      1.54   2150.55      1.00
 age_impute[42]      0.22      0.88      0.21     -1.23      1.70   1653.29      1.00
 age_impute[43]      0.21      0.81      0.24     -1.17      1.52   1652.92      1.00
 age_impute[44]     -0.40      0.95     -0.41     -1.78      1.34   1513.71      1.00
 age_impute[45]     -0.34      0.87     -0.34     -1.75      1.05   1518.48      1.00
 age_impute[46]     -0.30      0.91     -0.29     -1.87      1.13   1384.13      1.00
 age_impute[47]     -0.72      0.91     -0.74     -2.22      0.69   1765.85      1.00
 age_impute[48]      0.21      0.89      0.21     -1.23      1.79   2224.53      1.00
 age_impute[49]      0.43      0.77      0.42     -0.68      1.86   1550.53      1.00
 age_impute[50]      0.25      0.86      0.24     -1.12      1.58   2367.45      1.00
 age_impute[51]     -0.29      0.88     -0.34     -1.58      1.23   2111.28      1.00
 age_impute[52]      0.36      0.83      0.38     -1.16      1.59   1686.45      1.00
 age_impute[53]     -0.67      0.93     -0.68     -2.17      0.75   1326.22      1.00
 age_impute[54]      0.25      0.88      0.23     -1.12      1.64   2505.29      1.00
 age_impute[55]      0.33      0.87      0.34     -0.99      1.80   1301.28      1.00
 age_impute[56]      0.38      0.78      0.38     -0.78      1.67   1368.58      1.00
 age_impute[57]     -0.01      0.81     -0.03     -1.27      1.35   1365.21      1.00
 age_impute[58]     -0.69      0.90     -0.66     -2.20      0.71   1543.11      1.00
 age_impute[59]     -0.13      0.88     -0.14     -1.54      1.30   1305.27      1.00
 age_impute[60]     -0.62      0.91     -0.63     -2.05      0.87   1756.78      1.00
 age_impute[61]      0.22      0.83      0.22     -1.06      1.67   1384.66      1.00
 age_impute[62]     -0.59      0.93     -0.59     -2.02      1.05   1503.88      1.00
 age_impute[63]      0.20      0.85      0.19     -1.30      1.49   1312.19      1.00
 age_impute[64]     -0.69      0.86     -0.67     -2.03      0.82   1587.37      1.00
 age_impute[65]      0.41      0.73      0.42     -0.77      1.63   1335.75      1.00
 age_impute[66]      0.24      0.94      0.24     -1.24      1.79   2523.18      1.00
 age_impute[67]      0.33      0.73      0.34     -0.90      1.54   1487.26      1.00
 age_impute[68]      0.34      0.88      0.36     -0.96      1.92   1842.49      1.00
 age_impute[69]      0.24      0.93      0.24     -1.14      1.92   1151.23      1.00
 age_impute[70]     -0.65      0.87     -0.67     -2.13      0.73   1915.60      1.00
 age_impute[71]     -0.67      0.86     -0.67     -2.11      0.66   1482.63      1.00
 age_impute[72]      0.18      0.84      0.17     -1.18      1.51   1792.53      1.00
 age_impute[73]      0.38      0.76      0.38     -0.88      1.69   1806.11      1.00
 age_impute[74]     -0.67      0.92     -0.68     -2.07      0.88   1388.18      1.00
 age_impute[75]      0.43      0.85      0.43     -0.96      1.81   1394.37      1.00
 age_impute[76]      0.20      0.85      0.20     -1.29      1.56   1871.87      1.00
 age_impute[77]      0.18      0.87      0.16     -1.24      1.54   1310.45      1.00
 age_impute[78]     -0.42      0.87     -0.43     -1.84      1.01   2150.02      1.00
 age_impute[79]      0.20      0.84      0.20     -1.26      1.51   2044.15      1.00
 age_impute[80]      0.25      0.85      0.24     -1.22      1.58   1584.69      1.00
 age_impute[81]      0.26      0.87      0.29     -1.28      1.57   2090.98      1.00
 age_impute[82]      0.62      0.83      0.62     -0.71      1.98   1446.20      1.00
 age_impute[83]      0.20      0.85      0.20     -1.11      1.57   2026.06      1.00
 age_impute[84]      0.20      0.86      0.19     -1.17      1.65   1409.70      1.00
 age_impute[85]      0.25      0.89      0.22     -1.21      1.78   2319.33      1.00
 age_impute[86]      0.32      0.76      0.30     -1.04      1.52   1647.68      1.00
 age_impute[87]     -0.12      0.88     -0.10     -1.54      1.42   1929.81      1.00
 age_impute[88]      0.22      0.89      0.25     -1.38      1.52   1794.13      1.00
 age_impute[89]      0.23      0.94      0.22     -1.36      1.72   2988.98      1.00
 age_impute[90]      0.40      0.82      0.39     -0.83      1.91   1332.12      1.00
 age_impute[91]      0.24      0.92      0.25     -1.31      1.74   1964.70      1.00
 age_impute[92]      0.20      0.82      0.18     -1.08      1.49   1748.21      1.00
 age_impute[93]      0.24      0.92      0.26     -1.25      1.67   1815.09      1.00
 age_impute[94]      0.21      0.90      0.17     -1.21      1.65   1969.40      1.00
 age_impute[95]      0.21      0.89      0.22     -1.15      1.77   2072.73      1.00
 age_impute[96]      0.34      0.89      0.30     -1.06      1.79   1955.59      1.00
 age_impute[97]      0.27      0.89      0.25     -1.13      1.85   1855.70      1.00
 age_impute[98]     -0.39      0.93     -0.41     -1.91      1.09   1596.45      1.00
 age_impute[99]      0.16      0.91      0.14     -1.29      1.66   1376.45      1.00
age_impute[100]      0.23      0.86      0.22     -1.11      1.63   1911.15      1.00
age_impute[101]      0.20      0.89      0.19     -1.20      1.76   1928.03      1.00
age_impute[102]     -0.30      0.89     -0.29     -1.81      1.10   1560.14      1.00
age_impute[103]      0.01      0.84      0.00     -1.45      1.31   1473.75      1.00
age_impute[104]      0.25      0.90      0.26     -1.13      1.83   1548.72      1.00
age_impute[105]      0.25      0.88      0.27     -1.22      1.71   1504.16      1.00
age_impute[106]      0.25      0.87      0.23     -1.28      1.63   2309.06      1.00
age_impute[107]      0.21      0.88      0.22     -1.14      1.66   1815.74      1.00
age_impute[108]      0.34      0.86      0.36     -1.04      1.76   1907.62      1.00
age_impute[109]      0.28      0.90      0.26     -1.11      1.76   1199.46      1.00
age_impute[110]      0.34      0.75      0.36     -1.12      1.33   1212.96      1.00
age_impute[111]      0.21      0.89      0.21     -1.38      1.51   2188.74      1.00
age_impute[112]     -0.03      0.87     -0.03     -1.45      1.41   1753.71      1.00
age_impute[113]      0.22      0.87      0.22     -1.22      1.62   1516.32      1.00
age_impute[114]      0.39      0.81      0.38     -0.94      1.68   1343.44      1.00
age_impute[115]      0.20      0.88      0.19     -1.25      1.65   1622.06      1.00
age_impute[116]      0.25      0.87      0.18     -1.16      1.60   1397.06      1.00
age_impute[117]     -0.35      0.91     -0.37     -1.80      1.24   1928.67      1.00
age_impute[118]      0.23      0.93      0.22     -1.13      1.97   2220.75      1.00
age_impute[119]     -0.64      0.93     -0.66     -2.14      0.90   1848.79      1.00
age_impute[120]      0.60      0.80      0.59     -0.72      1.88   1405.44      1.00
age_impute[121]      0.21      0.88      0.19     -1.27      1.53   2130.06      1.00
age_impute[122]      0.21      0.81      0.18     -1.11      1.61   2025.07      1.00
age_impute[123]     -0.37      0.94     -0.39     -1.97      1.06   1499.93      1.00
age_impute[124]     -0.61      0.95     -0.63     -2.29      0.81   1475.74      1.00
age_impute[125]      0.23      0.90      0.22     -1.28      1.63   2277.35      1.00
age_impute[126]      0.22      0.82      0.23     -1.21      1.53   2545.04      1.00
age_impute[127]      0.36      0.87      0.36     -0.95      1.87   1684.15      1.00
age_impute[128]      0.24      0.90      0.26     -1.33      1.61   2068.39      1.00
age_impute[129]     -0.72      0.88     -0.73     -2.30      0.52   1666.76      1.00
age_impute[130]      0.19      0.87      0.18     -1.19      1.72   1505.42      1.00
age_impute[131]      0.27      0.91      0.25     -1.23      1.71   1671.42      1.00
age_impute[132]      0.33      0.87      0.33     -1.15      1.71   2034.05      1.00
age_impute[133]      0.23      0.91      0.22     -1.22      1.71   1598.77      1.00
age_impute[134]     -0.11      0.93     -0.14     -1.55      1.46   2134.48      1.00
age_impute[135]      0.22      0.85      0.24     -1.04      1.77   1861.50      1.00
age_impute[136]      0.18      0.85      0.19     -1.16      1.62   2066.48      1.00
age_impute[137]     -0.68      0.89     -0.65     -2.09      0.79   2300.73      1.00
age_impute[138]      0.19      0.95      0.19     -1.44      1.58   2151.50      1.00
age_impute[139]      0.20      0.84      0.18     -1.23      1.48   1357.01      1.00
age_impute[140]      0.40      0.78      0.41     -0.92      1.61   2210.60      1.00
age_impute[141]      0.25      0.89      0.23     -1.26      1.58   1660.37      1.00
age_impute[142]     -0.32      0.91     -0.32     -1.81      1.19   2112.23      1.00
age_impute[143]     -0.14      0.86     -0.14     -1.58      1.24   1843.93      1.00
age_impute[144]     -0.66      0.93     -0.68     -2.23      0.83   1546.19      1.00
age_impute[145]     -1.75      0.25     -1.74     -2.16     -1.34   1906.18      1.00
age_impute[146]      0.35      0.83      0.34     -1.08      1.62   1852.24      1.00
age_impute[147]      0.26      0.87      0.28     -1.20      1.61   1756.64      1.00
age_impute[148]     -0.67      0.89     -0.65     -2.14      0.71   1865.09      1.00
age_impute[149]      0.29      0.83      0.28     -1.07      1.62   2306.97      1.00
age_impute[150]      0.23      0.87      0.25     -1.15      1.65   2109.41      1.00
age_impute[151]      0.20      0.87      0.20     -1.22      1.60   1568.23      1.00
age_impute[152]      0.03      0.86      0.01     -1.52      1.27   1969.49      1.00
age_impute[153]      0.18      0.92      0.21     -1.38      1.66   2447.20      1.00
age_impute[154]      1.06      0.95      1.09     -0.69      2.43   1360.80      1.00
age_impute[155]      0.21      0.82      0.20     -1.16      1.46   1569.90      1.00
age_impute[156]      0.27      0.93      0.26     -1.29      1.82   1153.90      1.00
age_impute[157]      0.22      0.93      0.21     -1.34      1.70   1789.16      1.00
age_impute[158]      0.22      0.87      0.18     -1.22      1.68   1633.06      1.00
age_impute[159]      0.20      0.90      0.21     -1.17      1.70   1562.19      1.00
age_impute[160]      0.17      0.85      0.13     -1.16      1.51   1331.14      1.00
age_impute[161]     -0.49      0.95     -0.53     -1.97      1.08   1670.27      1.00
age_impute[162]      0.36      0.85      0.35     -0.96      1.77   1983.56      1.00
age_impute[163]      0.31      0.85      0.29     -1.16      1.60   1256.49      1.00
age_impute[164]      0.22      0.86      0.22     -1.13      1.77   1962.87      1.00
age_impute[165]      0.22      0.90      0.22     -1.17      1.67   1889.23      1.00
age_impute[166]     -0.11      0.87     -0.11     -1.39      1.43   1869.36      1.00
age_impute[167]      0.21      0.90      0.21     -1.07      1.79   1814.75      1.00
age_impute[168]      0.21      0.83      0.22     -1.08      1.66   1540.16      1.00
age_impute[169]      0.02      0.87      0.02     -1.35      1.43   2144.81      1.00
age_impute[170]      0.16      0.87      0.18     -1.21      1.60   1963.28      1.00
age_impute[171]      0.41      0.83      0.40     -0.84      1.88   1846.21      1.00
age_impute[172]      0.24      0.89      0.23     -1.23      1.58   1893.32      1.00
age_impute[173]     -0.44      0.88     -0.47     -1.93      0.95   2233.44      1.00
age_impute[174]      0.23      0.83      0.24     -1.14      1.54   2005.25      1.00
age_impute[175]      0.22      0.93      0.23     -1.43      1.66   1919.89      1.00
age_impute[176]     -0.44      0.89     -0.44     -1.90      1.01   2368.38      1.00
      age_mu[0]      0.19      0.04      0.19      0.11      0.26   1487.32      1.00
      age_mu[1]     -0.55      0.08     -0.55     -0.67     -0.43   1108.95      1.00
      age_mu[2]      0.42      0.08      0.43      0.29      0.56   1014.29      1.00
      age_mu[3]     -1.73      0.04     -1.73     -1.80     -1.65   1196.97      1.00
      age_mu[4]      0.85      0.19      0.85      0.55      1.17   1603.23      1.00
   age_sigma[0]      0.88      0.03      0.87      0.83      0.93    559.99      1.00
   age_sigma[1]      0.90      0.05      0.90      0.82      0.99   1035.24      1.00
   age_sigma[2]      0.79      0.05      0.79      0.71      0.88   1110.48      1.00
   age_sigma[3]      0.26      0.03      0.26      0.21      0.31   1371.55      1.00
   age_sigma[4]      0.94      0.13      0.92      0.74      1.16   1093.91      1.00
          b_Age     -0.44      0.13     -0.44     -0.64     -0.23    941.26      1.00
  b_Embarked[0]     -0.29      0.53     -0.29     -1.18      0.53    516.88      1.00
  b_Embarked[1]      0.28      0.54      0.29     -0.67      1.08    568.63      1.00
  b_Embarked[2]      0.01      0.55     -0.01     -0.98      0.80    512.83      1.00
     b_Parch[0]      0.44      0.57      0.45     -0.47      1.41    490.79      1.00
     b_Parch[1]      0.10      0.57      0.10     -0.71      1.14    507.22      1.00
     b_Parch[2]     -0.49      0.56     -0.48     -1.30      0.50    517.89      1.00
    b_Pclass[0]      1.16      0.56      1.17      0.28      2.11    479.10      1.00
    b_Pclass[1]      0.02      0.55      0.03     -0.88      0.92    476.86      1.00
    b_Pclass[2]     -1.23      0.56     -1.20     -2.16     -0.33    465.27      1.00
       b_Sex[0]      1.17      0.71      1.15     -0.03      2.28    690.74      1.00
       b_Sex[1]     -1.00      0.69     -1.02     -2.09      0.15    857.27      1.00
     b_SibSp[0]      0.27      0.63      0.29     -0.77      1.27    764.87      1.00
     b_SibSp[1]     -0.19      0.64     -0.18     -1.19      0.86    753.55      1.00
     b_Title[0]     -0.96      0.57     -0.96     -1.92     -0.06    591.63      1.00
     b_Title[1]     -0.34      0.62     -0.33     -1.33      0.67    686.03      1.00
     b_Title[2]      0.54      0.63      0.54     -0.44      1.58    644.49      1.00
     b_Title[3]      1.46      0.64      1.47      0.35      2.43    786.96      1.00
     b_Title[4]     -0.66      0.62     -0.69     -1.72      0.34    697.13      1.00

Number of divergences: 0

To double check that the assumption “age is correlated with title” is reasonable, let’s look at the infered age by title. Recall that we performed standarization on age, so here we need to scale back to original domain.

[10]:
age_by_title = age_mean + age_std * mcmc.get_samples()["age_mu"].mean(axis=0)
dict(zip(title_cat.categories, age_by_title))
[10]:
{'Mr.': 32.433903,
 'Miss.': 21.75385,
 'Mrs.': 35.85388,
 'Master.': 4.628565,
 'Misc.': 42.070465}

The infered result confirms our assumption that Age is correlated with Title:

  • those with Master. title has pretty small age (in other words, they are children in the ship) comparing to the other groups,

  • those with Mrs. title have larger age than those with Miss. title (in average).

We can also see that the result is similar to the actual statistical mean of Age given Title in our training dataset:

[11]:
train_df.groupby("Title")["Age"].mean()
[11]:
Title
Master.     4.574167
Misc.      42.384615
Miss.      21.773973
Mr.        32.368090
Mrs.       35.898148
Name: Age, dtype: float64

So far so good, we have many information about the regression coefficients together with imputed values and their uncertainties. Let’s inspect those results a bit:

  • The mean value -0.44 of b_Age implies that those with smaller ages have better chance to survive.

  • The mean value (1.11, -1.07) of b_Sex implies that female passengers have higher chance to survive than male passengers.

Prediction

In NumPyro, we can use Predictive utility for making predictions from posterior samples. Let’s check how well the model performs on the training dataset. For simplicity, we will get a survived prediction for each posterior sample and perform the majority rule on the predictions.

[12]:
posterior = mcmc.get_samples()
survived_pred = Predictive(model, posterior)(random.PRNGKey(1), **data)["survived"]
survived_pred = (survived_pred.mean(axis=0) >= 0.5).astype(jnp.uint8)
print("Accuracy:", (survived_pred == survived).sum() / survived.shape[0])
confusion_matrix = pd.crosstab(
    pd.Series(survived, name="actual"), pd.Series(survived_pred, name="predict")
)
confusion_matrix / confusion_matrix.sum(axis=1)
Accuracy: 0.8260382
[12]:
predict 0 1
actual
0 0.874317 0.201754
1 0.156648 0.748538

This is a pretty good result using a simple logistic regression model. Let’s see how the model performs if we don’t use Bayesian imputation here.

[13]:
mcmc.run(random.PRNGKey(2), **data, survived=survived, bayesian_impute=False)
posterior_1 = mcmc.get_samples()
survived_pred_1 = Predictive(model, posterior_1)(random.PRNGKey(2), **data)["survived"]
survived_pred_1 = (survived_pred_1.mean(axis=0) >= 0.5).astype(jnp.uint8)
print("Accuracy:", (survived_pred_1 == survived).sum() / survived.shape[0])
confusion_matrix = pd.crosstab(
    pd.Series(survived, name="actual"), pd.Series(survived_pred_1, name="predict")
)
confusion_matrix / confusion_matrix.sum(axis=1)
confusion_matrix = pd.crosstab(
    pd.Series(survived, name="actual"), pd.Series(survived_pred_1, name="predict")
)
confusion_matrix / confusion_matrix.sum(axis=1)
sample: 100%|██████████| 2000/2000 [00:38<00:00, 52.15it/s, 63 steps of size 6.48e-02. acc. prob=0.94]
Accuracy: 0.82042646
[13]:
predict 0 1
actual
0 0.872495 0.204678
1 0.163934 0.736842

We can see that Bayesian imputation does a little bit better here.

Remark. When using posterior samples to perform prediction on the new data, we need to marginalize out age_impute because those imputing values are specific to the training data:

posterior.pop("age_impute")
survived_pred = Predictive(model, posterior)(random.PRNGKey(3), **new_data)

References

  1. McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan.

  2. Kaggle competition: Titanic: Machine Learning from Disaster