1 Introduction to Bayesian Inference

Introduction

In the Bayesian paradigm the model’s parameters are random variables whose different possible values make the model produce data that is more or less likely to come from the same distribution as the observed data.

This post develops the Bayesian inference concepts through their application in a practical example, solving a simple regression model numerically, in a fully Bayesian way. We will match the theoretical concepts with their practical implementation, helping you to grasp their meaning entirely.

The eCompo dataset

We will use a dataset named “eCompo” for our running example. These are a series of readings of the “\(x\)” response emitted by an electronic component after being subjected to different stimulus values “\(y\)” in a test. We want to know the two parameters \(\beta\) and \(\sigma_\epsilon\) in the model shown below: \[\begin{align} y &= \beta \cdot x + \epsilon \\ \epsilon &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.1} \\[3pt] \implies y - (\beta \cdot x) &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.2} \end{align}\]

ECOMPO_DATA <- data.frame(
  x = c(23.54, 22.97, 11.78, 18.64, 14.36, 18.93, 19.04, 10.768, 27.39, 
    17.28, 22.63, 14.32, 22.95, 27.82, 28.34, 23.71, 29.73, 25.41, 
    27.60, 15.54 ),
  y = c(247.35, 180.47, 86.78, 178.12, 168.72, 135.10, 160.17, 172.48, 
    274.35, 130.32, 224.71, 144.31, 245.39, 314.81, 256.71, 187.07, 
    251.90, 253.75, 181.58, 179.68)
)

In the model equation (1.2), notice that the expected distribution of the residuals has their variance, but its center is fixed at zero. In the following examples, we will estimate the variance \(\sigma_{\epsilon}^2\) with that of the residuals in the data.

1.1 Bayesian Inference

1.1.1 The Notion of Joint Probability

In the “Beta vs. Residuals” graph 1.1, the residuals expected distribution is shown with green, and its actual distribution \(\epsilon = y - (\beta \cdot x)\) with purple.

With \(\beta = 9.3\) in (1), the residuals approach their expected –zero centered– distribution. In contrast, when \(\beta\) increases to \(15\) in (2), the residuals spread out, and their center moves away from its expected zero value. We say that \(\beta = 15\) “makes it unlikely to observe the data we have.”

The phrase “makes it unlikely to observe the data” is an abbreviated way to say that some parameter values make it more or less probable for the model to produce the observed data; in this case, less likely.

Beta vs. Residuals. The residuals expected distribution is colored green. It is centered at zero and has the residuals’ actual variance. The residuals’ real distribution is colored purple. At the moment (1), \(\beta=9.3\) produces residuals whose distribution relatively matches its expected distribution, where the green and purple blobs overlap. At the moment (2), \(\beta=15\) causes more spread and negative residuals. The expected and actual residuals become more different from each other.

Figure 1.1: Beta vs. Residuals. The residuals expected distribution is colored green. It is centered at zero and has the residuals’ actual variance. The residuals’ real distribution is colored purple. At the moment (1), \(\beta=9.3\) produces residuals whose distribution relatively matches its expected distribution, where the green and purple blobs overlap. At the moment (2), \(\beta=15\) causes more spread and negative residuals. The expected and actual residuals become more different from each other.

This exercise aims to give you the intuition that a model’s parameters have a joint distribution. Changing any of them changes the likelihood of observing the data according to the model.

1.2 The Bayesian Inference Components

In our eCompo model, the joint distribution of observing the data (\(y, X\)) with the model \(\mathcal{M}\), and parameters \((\beta, \sigma_{\epsilon})\) can be denoted as: \[\begin{align*} \pi(\beta, \sigma_{\epsilon} \mid \mathcal{M}, y, X) \end{align*}\]

Where \(\pi()\) is the distribution, which depends on \(\beta, \sigma_{\epsilon}\) given the model and the data.

1.2.1 Notational Convention

For continuous random variables, we use the term distribution in the same sense as probability density function (PDF), which corresponds to a probability mass function (PMF) for discrete variables. We use \(\pi(\cdot)\) to denote a PDF or a PMF. From the context, it will be evident which is the case.

Being \(\pi(\cdot)\) the distribution of \((\cdot)\), it is a different function for each different \((\cdot)\). Again, it refers to a PDF in the continuous case or a PMF otherwise. On some occasions, to avoid confusion, we use \(Pr(\cdot)\) to denote an event’s probability.

1.2.2 Simplifying the Notation

For simplicity, we will refer with the vector \(\theta\) to all the model parameters. In turn, with \(y\), we will represent all the data, so \(X\) will be implicit. Being the model constant, we will also omit it.

With this simplified notation, the joint probability of observing the data (\(y, X\)) with the model’ parameters \(\theta\) is: \[\begin{align*} \pi(\theta \mid y) \end{align*}\]

By the Bayes rule we know: \[\begin{align} \pi(\theta \mid y) &= \frac{\pi(y, \theta)}{\pi(y)} = \frac{\pi(y \mid \theta) \cdot \pi(\theta)}{\pi(y)} \tag{1.3} \end{align}\]

The equation (1.3) is fundamental in Bayesian inference, and each term has a name that we will refer to frequently.

  • \(\pi(\theta)\) is the prior distribution: the one we assume has the parameters before seeing the data.
  • \(\pi(y \mid \theta)\) is the data sampling distribution. When you consider the data \("y"\) is constant, this is a function only of \(\theta\), and this term is called the model likelihood.
  • \(\pi(\theta \mid y)\) is the posterior distribution: It is the distribution of the parameters conditioned to the data.
  • \(\pi(y)\) is sometimes called the evidence or the marginal likelihood.

The names prior and posterior refer to the fact that they are distributions modeled before and after considering at the data, respectively.

1.2.3 \(\pi(y)\)

\(\pi(y)\), the denominator in (1.3), is a value that depends only on the data \(y\) and is the normalization constant that makes the integration of the distribution \(\pi(\theta \mid y)\) equal to \(1\) (or the sum in the discrete case).

For continuous \(\theta\), \(\pi(y)\) is: \[\begin{align} \pi(y) &= \int \pi(y \mid \theta) \cdot \pi(\theta) \, \text{d}\theta \tag{1.4} \end{align}\]

For the discrete case, the previous equations is equivalent to:

\[ \pi(y) = \sum_{\theta} \pi(y \mid \theta) \cdot \pi(\theta) \]

In these expressions, we are averaging \(\pi(y \mid \theta)\) (the likelihood) over all possible \(\theta\) values. This is an operation called marginalization and is the reason why \(pi(y)\) is called the marginal likelihood.

1.3 Building the Posterior Distribution

We will develop, for the eCompo model, the distributions we just defined in the previous section. Later we will use them to solve that model.

1.3.1 The Prior Distribution

The prior distribution is modeled with the previously known information about the model parameters or the general expectations. This is called a weakly-informative prior. In our example, we ask the company experts what the model’s parameters domain is.

The experts tell us the value of the \(\beta\) must be between \(1\) y \(25\), with an approximately Normal distribution. Now we can say that the \(\beta\) prior distribution is Normal, with its center around \((25+1)/2=13\), and a standard deviation about \((25-1)/4=6\).

The experts also tell us that the Exponential distribution is suitable as a prior of the parameter \(\sigma_\epsilon\), using the reciprocal of the standard deviation of “\(y\)” as its \(\lambda\) parameter, so we adopt as prior of \(\sigma_\epsilon\) an exponential distribución with \(\lambda=1/sd(y)\).

PRIOR_BETA_SD <- 6
PRIOR_BETA_MEAN <- 13
PRIOR_SIGMA_LAMBDA <- 1 / sd(ECOMPO_DATA$y)

The prior distribution \(\pi(\theta)\) of our model is:

prior_density <- function(beta, sigma, log = TRUE) {
  log_prior_beta <- dnorm(beta, mean = PRIOR_BETA_MEAN, sd = PRIOR_BETA_SD, log = TRUE)
  log_prior_sigma <- dexp(sigma, rate = PRIOR_SIGMA_LAMBDA, log = TRUE)
  result <- log_prior_beta + log_prior_sigma

  if (log != TRUE) result <- exp(result)
  return(result)
}

Where the beta and sigma function parameters correspond to \(\beta\) and \(\sigma_\epsilon\) in our model (1.2). Since these parameters are independent, their joint prior distribution is the product of these parameters’ individual priors. As expected, this distribution does not depend on the data, only on \(\theta\). The hyper-parameters like PRIOR_BETA_SD or PRIOR_SIGMA_LAMBDA used to model each individual prior are model constants. We use the logarithmic scale for the calculations as a purely technical matter (stability and numerical accuracy).

Prior Distribution. We have a weakly-informative prior distribution, supporting all plausible model parameter values.

Figure 1.2: Prior Distribution. We have a weakly-informative prior distribution, supporting all plausible model parameter values.

Our weakly-informative prior distribution supports all plausible parameter values. It is shown in the “Prior Distribution” graph 1.2.

1.3.2 The Likelihood Distribution

The likelihood distribution \(\pi(y \mid \theta)\) of our model is:

likelihood_density <- function(beta, sigma, y, x, log = TRUE) {
  result <- 0
  for (i in seq_along(y)) {
    mean <- beta * x[i]
    result <- result + dnorm(y[i], mean = mean, sd = sigma, log = TRUE)
  }

  if (log != TRUE) result <- exp(result)
  return(result)
}

In our model, the equation (1.1) defines the probability of seeing each observation in the data. Since these are independent, their joint probability is the product of their individual likelihood. In the logarithmic scale, it is the sum of the logarithm of their probabilities.

Note that likelihood_density() and prior_density() are non-normalized density functions. For them to be normalized, the integral with respect to \(\theta\), of each of these functions, is required to be \(1\).

For example, in the case of the prior distribution, this is \(\int_{-\infty}^{+\infty} \pi(\theta)~\textrm{d}\theta = 1\), which is not true. We need to divide each function’s current output by the value of its integral over \(\theta\). We will deal with this later.

1.3.3 The Joint Distribution

The name Joint distribution can be confusing because all these distributions (prior, likelihood, posterior) are joint distributions on the model parameters. Here we are referring to \(\pi(y, \theta)\), the numerator in (1.3), which is the joint density of the parameters and the data.

As this function is the product of the prior and likelihood densities \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\), this is easy:

joint_density <- function(beta, sigma, data, log = TRUE) {
  result <- prior_density(beta, sigma) + likelihood_density(beta, sigma, data$y, data$x)

  if (log != TRUE) result <- exp(result)
  return(result)
}

As in previous cases, this is also a non-normalized distribution, and we shall correct this in the posterior distribution.

1.3.4 The Posterior Distribution

The joint distribution \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\) computed by the function joint_density() above, is the non-normalized distribution of \(\pi(\theta \mid y)\), in the sense its integral is not equal to \(1\). We must divide joint_density() by the normalization constant \(\pi(y)\) to get the normalized posterior density \(\pi(\theta \mid y)\) according to (1.3).

We will get \(\pi(y)\) by marginalizing out \(\theta\) from \(\pi(y, \theta)\). Mathematically, this corresponds to integrating \(\pi(y, \theta)\) over \(\theta\). \[\begin{align} \pi(y) &= \int \pi(y , \theta) \text{d}\theta \end{align}\]

In turn, we can replace \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\), which gives us the equation (1.4). In our case, \(\theta\) contains two parameters: \[\begin{align} \pi(y) &= \int_0^{+\infty} \int_{-\infty}^{+\infty} \textrm{joint}\_\textrm{density}(\beta, \sigma_\epsilon ,\text{data}) ~\textrm{d}\beta~\textrm{d}\sigma_\epsilon \end{align}\]

The posterior distribution \(\pi(\theta \mid y)\) has the form:

posterior_density <- function(beta, sigma, data, log = TRUE) {
  result <- joint_density(beta, sigma, data, log = TRUE) - log(pi_y)

  if (log != TRUE) result <- exp(result)
  return(result)
}

Where pi_y is the value of \(\pi(y)\), the value we must calculate. In the code above, we are subtracting in the logarithm scale, which is equivalent to dividing in the linear one.

We have the difficulty that when we integrate with pi_y = 1 the posterior_density() function returns values smaller than \(3.4 \times 10^{-47}\), causing numerically unstable integration. To deal with this, we shall assign to pi_y a temporary value while we integrate, and then shall adjust it according to the integral.

pi_y <- 3.4e-47
int_pi_y <- integrate(function(sigmas) { # Integrate over sigma
  sapply(sigmas, function(sigma) {
    integrate(function(beta) # Integrate over beta
      posterior_density(beta, sigma, ECOMPO_DATA, log = F), -Inf, +Inf)$value
  })
}, 0, +Inf)$value

pi_y <- pi_y * int_pi_y

In the code above, like in the mathematical expression, the inner integration is the integrand of the outer one. We wrap the inner integration with a call to sapply() to compute the inner integral over beta for many sigmas from the outer integration.

The final value is \(\pi(y)= 2.80 \times 10^{-46}\), and now the posterior_density() function is a proper density distribution.

Posterior Distribution Graph

The Posterior Distribution graph 1.3 shows the surface of our just developed posterior_density() function. Notice that it is a map of each possible set of parameters (the horizontal plane) and its density (the vertical axis), which in turn tell us there are some combination of parameters that explain the data better than others.

To build this graph, we compute the dist_mat matrix with the code below. This matrix contains the posterior density for a grid of equidistant points in the beta and sigma plane. The functions row_to_beta() and col_to_sigma() –also in the code below– allow us to obtain the values of beta and sigma from the row and column position of any cell in the dist_mat matrix.

beta_start <- 7.5
sigma_start <- 19

betas <- seq(beta_start, 11, by = 0.02)
sigmas <- seq(sigma_start, 79, by = 0.02)

dist_mat <- outer(
  betas, sigmas,
  function(beta, sigma) posterior_density(beta, sigma, ECOMPO_DATA, log = FALSE)
)
row_to_beta <- function(row_nbr) (row_nbr - 1) * 0.02 + beta_start
col_to_sigma <- function(col_nbr) (col_nbr - 1) * 0.02 + sigma_start
str(dist_mat)

num [1:165, 1:2931] 4.01e-20 8.73e-20 1.88e-19 4.02e-19 8.49e-19 ...

The Posterior Distribution Mode

In the Posterior Distribution graph 1.3, we see it has a prominent mode. Estimation algorithms of the maximum-likelihood type estimate the likelihood distribution mode’s position as the model parameters’ solution. They don’t use to have a prior distribution, and then their posterior is just the likelihood, hence the maximum-likelihood name.