{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Probabilistic Topic Modeling\n", "\n", "This tutorial implements the ProdLDA topic model from [Autoencoding Variational Inference For Topic Models](https://arxiv.org/abs/1703.01488) by Akash Srivastava and Charles Sutton. This model returns consistently better topics than vanilla LDA and trains much more quickly. Furthermore, it does not require a custom inference algorithm that relies on complex mathematical derivations. This tutorial also serves as an introduction to probabilistic modeling with Pyro, and is heavily inspired by [Probabilistic topic models](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) from David Blei.\n", "\n", "## Introduction\n", "Topic models are a suite of unsupervised learning algorithms that aim to discover and annotate large archives of documents with thematic information. Probabilistic topic models use statistical methods to analyze the words in each text to discover common themes, how those themes are connected to each other, and how they change over time. They enable us to organize and summarize electronic archives at a scale that would be impossible with human annotation alone. The most popular topic model is called latent Dirichlet allocation, or LDA.\n", "\n", "## Latent Dirichlet Allocation: Intuition\n", "LDA is a statistical model of document collections that encodes the intuition that documents exhibit multiple topics. It is most easily described by its generative process, the idealized random process from which the model assumes the documents were generated. The figure below illustrates the intuition:\n", "\n", "\n", "