1 Bayesian Inference with Sampling

Introduction

In a Bayesian inference with sampling, you learn about the model parameters through the sample taken from its posterior distribution, the joint distribution probability of observing the data with the model, a function of the model parameters values.

This post presents the Bayesian inference concepts through their application in a practical example, solving “by hand” a simple regression model in a fully Bayesian way and then using stan, a tool for Bayesian inference with sampling. By matching the steps we do by hand with what stan does, you will deeply understand how modeling works with stan and other packages that use stan in the background like brms or rstanarm.

Context

Except for relatively simple models, there are no straightforward solutions that allow obtaining Bayesian estimates. This limitation has led to the development of various approximation methods. When precision is essential, simulation-based methods offer a valuable alternative. These methods aim to obtain a sample from the joint posterior distribution of the model parameters, which you can use to approximate almost any quantity relevant to Bayesian inference, including expected values, variances, quantiles, and marginal densities (Wang and Park 2020).

In the R language, to fit a model to the data using Bayesian inference with sampling, you can use packages based on BUGS (Bayesian inference Using Gibbs Sampling), JAGS: (Just Another Gibbs Sampler) or Stan (Carpenter et al. 2017). These tools are samplers in the sense that they return a sample from the posterior distribution of the model parameters. The term posterior distribution refers to the distribution of the model parameter values that fit the model to the data. Each of these tools defines a programming language to describe the model to solve.

These tools are based on Monte Carlo Markov Chain (MCMC) sampling methods (Geyer 2011). On the opposite side of the MCMC methods are the point estimation methods where a single value estimates each parameter. Also, the uncertainty about it is estimated with a single value, its standard error. Instead, a Bayesian Inference method describes them by a sample from their joint posterior distribution.

Stan is a software platform at the core of several cutting-edge algorithms for statistical modeling. Around this core, several interfaces expose Stan services in different programming languages, such as Python, R, and Julia, among others. To use Stan in the R language, you install its official interface, the package rstan (Guo et al. 2020), made by the creators of stan. So Stan is a platform, a language, and an interface in R that we will refer to as stan, and the context will indicate which of these meanings we are referring to.

With stan, it is possible to define and solve models as complicated and elaborate as desired. However, in the case of known models, such as those in the family of Generalized Linear Models and their extensions in Multilevel models, the rstanarm (Gabry and Goodrich 2020) or brms (Bürkner 2018) packages allow us to do the modeling similarly to how it is usually done in R. Those packages build the equivalent model in the stan language behind the scenes and use rstan to solve it. This automatic translation eases the learning curve avoiding learning to program directly in the stan language. They also allow you to easily use and combine model families that would otherwise be very time-consuming to correctly create by hand from scratch in the stan language.

Data set eCompo

We will use a dataset named “eCompo” as the running example. These are a series of readings of the “\(x\)” response emitted by an electronic component after being subjected to different stimulus values “\(y\)” in a test. We want to know the two parameters \(\beta\) and \(\sigma_\epsilon\) in the model shown below: \[\begin{align} y &= \beta \cdot x + \epsilon \\ \epsilon &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.1} \\[3pt] \implies y - (\beta \cdot x) &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.2} \end{align}\]

ECOMPO_DATA <- data.frame(
  x = c(23.54, 22.97, 11.78, 18.64, 14.36, 18.93, 19.04, 10.768, 27.39, 
    17.28, 22.63, 14.32, 22.95, 27.82, 28.34, 23.71, 29.73, 25.41, 
    27.60, 15.54 ),
  y = c(247.35, 180.47, 86.78, 178.12, 168.72, 135.10, 160.17, 172.48, 
    274.35, 130.32, 224.71, 144.31, 245.39, 314.81, 256.71, 187.07, 
    251.90, 253.75, 181.58, 179.68)
)

The expected distribution of the residuals (1.1) has their variance, but its center is fixed at zero. In the following examples, we will estimate the variance \(\sigma_{\epsilon}^2\) with that of the residuals in the data.

1.1 Bayesian Inference

1.1.1 The Notion of Joint Probability

In graph “Beta vs. Residuals” 1.1, the expected distribution of the residuals is shown with green, and its actual distribution \(\epsilon = y - (\beta \cdot x)\) in the data, with purple. With \(\beta = 9.3\) in (1), the residuals approach their expected distribution. As \(\beta\) increases, such as up to \(\beta = 15\) in (2), the residuals spread out, and their center moves to the left, away from its expected center. As their variance increases, their expected distribution flattens out, and the match is getting worse: the real distribution is more and more different than expected. In the latter case, we say that \(\beta = 15\) “makes it unlikely to observe the data we have.”

The phrase “makes it unlikely to observe the data” is an abbreviated way to say that some parameter values make it more or less probable for the model to produce the observed data; in this case, less likely.

Beta vs. Residuals. The residuals expected distribution is shown with green. It is centered at zero and has the variance of the residuals. With purple, it’s the actual distribution. At the moment (1), the value of \(\beta\) produces residuals whose distribution is relatively in line with what is expected. At the moment (2), increasing values of \(\beta\) cause increasingly negative residuals and increasing variance. Its expected distribution flattens out like “trying to follow” the residuals, but the fit is getting worse.

Figure 1.1: Beta vs. Residuals. The residuals expected distribution is shown with green. It is centered at zero and has the variance of the residuals. With purple, it’s the actual distribution. At the moment (1), the value of \(\beta\) produces residuals whose distribution is relatively in line with what is expected. At the moment (2), increasing values of \(\beta\) cause increasingly negative residuals and increasing variance. Its expected distribution flattens out like “trying to follow” the residuals, but the fit is getting worse.

This exercise aims to motivate the intuition that a model’s parameters have a joint distribution of the probability of observing the data according to the model depending on its parameter values. When we move one or more parameters, we use the data to see how “the likelihood of seeing it” varies accordingly.

1.1.2 The Probability of Observing the Data

In our eCompo model the joint probability of observing the data (\(y, X\)) with the model \(\mathcal{M}\), and parameters \(\beta, \sigma_{\epsilon}\) is: \[\begin{align*} \pi(\beta, \sigma_{\epsilon} \mid \mathcal{M}, y, X) \end{align*}\]

Simplified Notation

For simplicity, we will refer with the vector \(\theta\) to all the model parameters. In turn, with \(y\), we will represent all the data, so \(X\) will be implicit, likewise the model which is constant. Finally, in these expressions, \(\pi(\cdot)\) is the probability density function of \((\cdot)\), meaning that it is a different function for each different \((\cdot)\), which corresponds to a mass function in the discrete case. We will use the term distribution in the same sense as probability density function.

With this simplified notation, the joint probability of observing the data (\(y, X\)) with the model with the parameters \(\theta\) is: \[\begin{align*} \pi(\theta \mid y) \end{align*}\]

By the Bayes rule we know: \[\begin{align} \pi(\theta \mid y) &= \frac{\pi(y, \theta)}{\pi(y)} = \frac{\pi(y \mid \theta) \cdot \pi(\theta)}{\pi(y)} \tag{1.3} \\[3pt] \pi(\theta \mid y) &\propto \pi(y \mid \theta) \cdot \pi(\theta) \tag{1.4} \end{align}\]

In the equation (1.3), the numerator is the joint probability of the parameters and the data in the sense we explored in the previous exercise. The denominator \(\pi(y)\) is a constant that does not depend on \(\theta\), but only on the data, so \(\pi (\theta \mid y)\) in (1.4) is proportional to the numerator in (1.3). We can consider the right-hand side of (1.4) as the non-normalized distribution of \(\pi(\theta \mid y)\), that is, a distribution whose integral is not equal to \(1\).

The equation (1.4) is fundamental in Bayesian inference, and each term has a name that we will refer to frequently.

  • \(\pi(\theta)\) is the prior distribution: the one we assume has the parameters before seeing the data.
  • \(\pi(y \mid \theta)\) is the data sampling distribution. Since the data “\(y\)” remains constant, it is a function of \(\theta\), and in this sense, it is called the likelihood of our model.
  • \(\pi(\theta \mid y)\) is the posterior distribution: It is the distribution of the parameters conditioned to the data. We will refer to this when we talk about the probability of seeing the data we have, with the model and parameters we have.

Since \(\theta\) is a vector, all these terms correspond to joint probability distributions. Notice how the likelihood term refers to the notion of “possibility to observe the data” in the previous exercise: with some parameters, it is more likely for the model to produce the data.

1.2 An Example of Bayesian Inference with Sampling

In this section, we will see a practical example of the concepts introduced in the previous section, analyzing numerically (without using stan in any way) the eCompo model.

1.2.1 The Prior Distribution

The experts we have consulted tell us the value of the \(\beta\) must be between \(1\) y \(25\), with an approximately Normal distribution. Its center is then around \((25+1)/2=13\), with a standard deviation about \((25-1)/4=6\), so we assume as prior of \(\beta\) a Normal distribution with those parameters. The experts also tell us that the Exponential distribution is suitable as a prior of the parameter \(\sigma_\epsilon\), using the reciprocal of the standard deviation of “\(y\)” as its \(\lambda\) parameter, so we adopt as prior of \(\sigma_\epsilon\) an exponential distribución with \(\lambda=1/sd(y)\).

PRIOR_BETA_SD <- 6
PRIOR_BETA_MEAN <- 13
PRIOR_SIGMA_LAMBDA <- 1 / sd(ECOMPO_DATA$y)

The prior distribution \(\pi(\theta)\) of our model is:

prior_density <- function(beta, sigma, log = TRUE) {
  log_prior_beta <- dnorm(beta, mean = PRIOR_BETA_MEAN, sd = PRIOR_BETA_SD, log = TRUE)
  log_prior_sigma <- dexp(sigma, rate = PRIOR_SIGMA_LAMBDA, log = TRUE)
  result <- log_prior_beta + log_prior_sigma

  if (log != TRUE) result <- exp(result)
  return(result)
}

Where the beta and sigma function parameters correspond to \(\beta\) and \(\sigma_\epsilon\) in our model (1.2). Since these parameters are independent, their joint prior distribution is the product of these parameters individual priors. As expected, this distribution does not depend on the data, only on \(\theta\). The hyper-parameters like PRIOR_BETA_SD or PRIOR_SIGMA_LAMBDA used to model each individual prior are model constants. We do the logarithmic scale calculations as a purely technical matter (stability and numerical accuracy).

Prior Distribution. We have a weakly-informative prior distribution, which means it supports all plausible model parameter values.

Figure 1.2: Prior Distribution. We have a weakly-informative prior distribution, which means it supports all plausible model parameter values.

Our weakly-informative prior distribution supports all plausible parameter values, as we see in the graph “Prior Distribution” 1.2 of our model.

1.2.2 The Likelihood Distribution

The likelihood distribution \(\pi(y \mid \theta)\) of our model is:

likelihood_density <- function(beta, sigma, y, x, log = TRUE) {
  result <- 0
  for (i in seq_along(y)) {
    mean <- beta * x[i]
    result <- result + dnorm(y[i], mean = mean, sd = sigma, log = TRUE)
  }

  if (log != TRUE) result <- exp(result)
  return(result)
}

In our model, equation (1.1) defines the probability of seeing each observation in the data. Since these are independent, their joint probability is the product of their individual likelihood. In the logarithmic scale, it is the sum of the logarithm of their probabilities.

Note that, strictly speaking, likelihood_density() and prior_density() are non-normalized density functions. For them to be normalized, the integral with respect to \(\theta\), of each of these functions is required to be \(1\). For example, in the case of the prior distribution, this is \(\int_{-\infty}^{+\infty} \pi(\theta)~\textrm{d}\theta = 1\), which is not the case in those functions. We need to divide each function’s result by the value of its integral over \(\theta\). We will deal with this later.

1.2.3 The Joint Distribution

The name Joint distribution can be a little confusing because all these distributions (prior, likelihood, posterior) are joint distributions of the model parameters. Here we are referring to \(\pi(y, \theta)\), the numerator in (1.3), which is the joint density of the parameters and the data.

With what we got so far, \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\) is easy:

joint_density <- function(beta, sigma, data, log = TRUE) {
  result <- prior_density(beta, sigma) + likelihood_density(beta, sigma, data$y, data$x)

  if (log != TRUE) result <- exp(result)
  return(result)
}

As in previous cases, this is also a non-normalized distribution, and we shall correct this in the posterior distribution.

1.2.4 The Posterior Distribution

The joint distribution \(\pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta)\) computed by the function joint_density() above, is all we need in (1.4). However, we will develop the normalized posterior density \(\pi(\theta \mid y)\) according to (1.3), which is the joint distribution joint_density() divided by \(\pi(y)\). Where, in turn, \(\pi(y)\) is the value of the integral of joint_density() over \(\theta\). \[\begin{align} \pi(y) &= \int_{-\infty}^{+\infty} \pi(y \mid \theta) \cdot \pi(\theta)~\textrm{d}\theta \\\ \pi(y) &= \int_0^{+\infty} \int_{-\infty}^{+\infty} \textrm{joint_density}(\beta, \sigma_\epsilon \mid \text{data})~\textrm{d}\beta~\textrm{d}\sigma_\epsilon \end{align}\]

The posterior distribution \(\pi(\theta \mid y)\) has the form:

posterior_density <- function(beta, sigma, data, log = TRUE) {
  result <- joint_density(beta, sigma, data, log = TRUE) - log(pi_y)

  if (log != TRUE) result <- exp(result)
  return(result)
}

Where pi_y is the value of \(\pi(y)\), the integral value we must calculate. We have the difficulty that when we integrate with pi_y = 1 the posterior_density() function returns values smaller than \(3.4 \times 10^{-47}\), and the integration is numerically unstable. To deal with this, we assign to pi_y a temporary value while we integrate, and then we fit it according to the integral.

pi_y <- 3.4e-47
int_pi_y <- integrate(function(sigmas) {
  sapply(sigmas, function(sigma) {
    integrate(function(beta) posterior_density(beta, sigma, ECOMPO_DATA, log = F), -Inf, +Inf)$value
  })
}, 0, +Inf)$value

pi_y <- pi_y * int_pi_y

The final value is \(\pi(y)=2.8\times 10^{-46}\), and now the posterior_density() function is a proper density distribution. We can now go back to the starting exercise and compute, using posterior_density(), the probability of observing the data at (1) \(\pi(\beta=9.3, \sigma_\epsilon=36.6 \mid y)=0.071\) and at (2) \(\pi(\beta=15, \sigma_\epsilon=55.1 \mid y)=1.6 \times 10^{-25}\). Where we see that it is practically impossible to observe the data with \(\beta=15\).

The Graph of the Posterior Density

The Posterior Distribution graph 1.3 shows the surface of our just developed posterior_density() function. To build this graph, we have computed the dist_mat matrix with the code below. The dist_mat matrix contains the posterior density for a grid of equidistant points in the beta and sigma plane. The functions –also in the code below– row_to_beta() and col_to_sigma() allow us to obtain the values of beta and sigma from the row and column position of any cell. Later we will use this matrix a few times.

betas <- seq(7.5, 11, by = 0.02)
sigmas <- seq(19, 79, by = 0.02)

dist_mat <- outer(
  betas, sigmas,
  function(beta, sigma) posterior_density(beta, sigma, ECOMPO_DATA, log = FALSE)
)
row_to_beta <- function(row_nbr) (row_nbr - 1) * 0.02 + 7.5
col_to_sigma <- function(col_nbr) (col_nbr - 1) * 0.02 + 19
str(dist_mat)

num [1:165, 1:2931] 4.01e-20 8.73e-20 1.88e-19 4.02e-19 8.49e-19 ...

The Posterior Distribution Mode

In the Posterior Distribution graph 1.3, we see that in our model \(\pi(\theta \mid y)\) has a prominent modal point. Estimation algorithms of the maximum-likelihood type estimate the likelihood distribution mode’s position as the solution to the model parameters. In a full Bayesian inference, we estimate the \(\theta\) distribution, which allows us to compute all the measures we want, such as mean, mode, quartiles, standard deviation, mean absolute deviation, and so on.