- 1 Bayesian Inference with Sampling
- Introduction
- 1.1 Bayesian Inference
- 1.2 An Example of Bayesian Inference with Sampling
- 1.3 Bayesian Inference using Stan
- 1.3.1 Jacobian Correction for Change of Variable
- 1.3.2 The Posterior Density According to Stan vs. posterior_density()
- 1.3.3 The Posterior Distribution from Stan
- 1.3.4 Autocorrelation and Sample Thinning
- 1.3.5 Marginal Distribution Integrating the Posterior
- 1.3.6 Expected Value of Parameters
- 1.3.7 Predictive Check of Posterior and Anterior Distribution

- References

In a Bayesian inference with sampling, you learn about the model parameters through the sample taken from its *posterior distribution*, the joint distribution probability of *observing the data* with the model, a function of the model parameters values.

This post presents the Bayesian inference concepts through their application in a practical example, solving “by hand” a simple regression model in a fully Bayesian way and then using `stan`

, a tool for Bayesian inference with sampling. By matching the steps we do by hand with what stan does, you will deeply understand how modeling works with `stan`

and other packages that use `stan`

in the background like `brms`

or `rstanarm`

.

Except for relatively simple models, there are no straightforward solutions that allow obtaining Bayesian estimates. This limitation has led to the development of various approximation methods. When precision is essential, simulation-based methods offer a valuable alternative. These methods aim to obtain a sample from the joint *posterior distribution* of the model parameters, which you can use to approximate almost any quantity relevant to Bayesian inference, including expected values, variances, quantiles, and marginal densities (Wang and Park 2020).

In the R language, to fit a model to the data using Bayesian inference with sampling, you can use packages based on *BUGS* (**B**ayesian inference **U**sing **G**ibbs **S**ampling*), JAGS: (Just Another Gibbs Sampler*) or

These tools are based on *Monte Carlo Markov Chain (MCMC)* sampling methods (Geyer 2011). On the opposite side of the MCMC methods are the point estimation methods where a single value estimates each parameter. Also, the uncertainty about it is estimated with a single value, its standard error. Instead, a Bayesian Inference method describes them by a sample from their joint *posterior distribution*.

*Stan* is a software platform at the core of several cutting-edge algorithms for statistical modeling. Around this core, several interfaces expose *Stan* services in different programming languages, such as Python, R, and Julia, among others. To use *Stan* in the R language, you install its official interface, the package `rstan`

(Guo et al. 2020), made by the creators of `stan`

. So *Stan* is a platform, a language, and an interface in R that we will refer to as `stan`

, and the context will indicate which of these meanings we are referring to.

With `stan,`

it is possible to define and solve models as complicated and elaborate as desired. However, in the case of known models, such as those in the family of Generalized Linear Models and their extensions in Multilevel models, the `rstanarm`

(Gabry and Goodrich 2020) or `brms`

(Bürkner 2018) packages allow us to do the modeling similarly to how it is usually done in R.
Those packages build the equivalent model in the `stan`

language behind the scenes and use `rstan`

to solve it. This automatic translation eases the learning curve avoiding learning to program directly in the `stan`

language. They also allow you to easily use and combine model families that would otherwise be very time-consuming to correctly create by hand from scratch in the `stan`

language.

We will use a dataset named *“eCompo”* as the running example. These are a series of readings of the “\(x\)” response emitted by an electronic component after being subjected to different stimulus values “\(y\)” in a test. We want to know the two parameters \(\beta\) and \(\sigma_\epsilon\) in the model shown below:
\[\begin{align}
y &= \beta \cdot x + \epsilon
\\
\epsilon &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.1}
\\[3pt]
\implies y - (\beta \cdot x) &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.2}
\end{align}\]

```
<- data.frame(
ECOMPO_DATA x = c(23.54, 22.97, 11.78, 18.64, 14.36, 18.93, 19.04, 10.768, 27.39,
17.28, 22.63, 14.32, 22.95, 27.82, 28.34, 23.71, 29.73, 25.41,
27.60, 15.54 ),
y = c(247.35, 180.47, 86.78, 178.12, 168.72, 135.10, 160.17, 172.48,
274.35, 130.32, 224.71, 144.31, 245.39, 314.81, 256.71, 187.07,
251.90, 253.75, 181.58, 179.68)
)
```

The expected distribution of the residuals (1.1) has their variance, but its center is fixed at zero. In the following examples, we will estimate the variance \(\sigma_{\epsilon}^2\) with that of the residuals in the data.

In graph *“Beta vs. Residuals”* 1.1, the expected distribution of the residuals is shown with green, and its actual distribution \(\epsilon = y - (\beta \cdot x)\) in the data, with purple. With \(\beta = 9.3\) in (1), the residuals approach their expected distribution. As \(\beta\) increases, such as up to \(\beta = 15\) in (2), the residuals spread out, and their center moves to the left, away from its expected center. As their variance increases, their expected distribution flattens out, and the match is getting worse: the real distribution is more and more different than expected. In the latter case, we say that \(\beta = 15\) “makes it unlikely to *observe* the data we have.”

The phrase *“makes it unlikely to observe the data”* is an abbreviated way to say that some parameter values make it more or less probable for the model to produce the observed data; in this case, less likely.

This exercise aims to motivate the intuition that a model’s parameters have *a joint distribution* of the probability of observing the data according to the model depending on its parameter values. When we move one or more parameters, we use the data to see how *“the likelihood of seeing it”* varies accordingly.

In our eCompo model the joint probability of observing the data (\(y, X\)) with the model \(\mathcal{M}\), and parameters \(\beta, \sigma_{\epsilon}\) is: \[\begin{align*} \pi(\beta, \sigma_{\epsilon} \mid \mathcal{M}, y, X) \end{align*}\]

For simplicity, we will refer with the vector \(\theta\) to all the model parameters. In turn, with \(y\), we will represent all the data, so \(X\) will be implicit, likewise the model which is constant. Finally, in these expressions, \(\pi(\cdot)\) is the *probability density function* of \((\cdot)\), meaning that it is a different function for each different \((\cdot)\), which corresponds to a *mass function* in the discrete case. We will use the term *distribution* in the same sense as *probability density function*.

With this simplified notation, the joint probability of observing the data (\(y, X\)) with the model with the parameters \(\theta\) is: \[\begin{align*} \pi(\theta \mid y) \end{align*}\]

By the Bayes rule we know: \[\begin{align} \pi(\theta \mid y) &= \frac{\pi(y, \theta)}{\pi(y)} = \frac{\pi(y \mid \theta) \cdot \pi(\theta)}{\pi(y)} \tag{1.3} \\[3pt] \pi(\theta \mid y) &\propto \pi(y \mid \theta) \cdot \pi(\theta) \tag{1.4} \end{align}\]

In the equation (1.3), the numerator is the joint probability of the parameters and the data in the sense we explored in the previous exercise. The denominator \(\pi(y)\) is a constant that does not depend on \(\theta\), but only on the data, so \(\pi (\theta \mid y)\) in (1.4) is proportional to the numerator in (1.3). We can consider the right-hand side of (1.4) as the non-normalized distribution of \(\pi(\theta \mid y)\), that is, a distribution whose integral is not equal to \(1\).

The equation (1.4) is fundamental in Bayesian inference, and each term has a name that we will refer to frequently.

- \(\pi(\theta)\) is the
: the one we assume has the parameters before seeing the data.**prior distribution** - \(\pi(y \mid \theta)\) is the data sampling distribution. Since the data “\(y\)” remains constant, it is a function of \(\theta\), and in this sense, it is called the
of our model.**likelihood** - \(\pi(\theta \mid y)\) is the
: It is the distribution of the parameters conditioned to the data. We will refer to this when we talk about the probability of seeing the data we have, with the model and parameters we have.**posterior distribution**

Since \(\theta\) is a vector, all these terms correspond to *joint* probability distributions. Notice how the *likelihood* term refers to the notion of *“possibility to observe the data”* in the previous exercise: with some parameters, it is more *likely* for the model to produce the data.

In this section, we will see a practical example of the concepts introduced in the previous section, analyzing numerically (without using `stan`

in any way) the eCompo model.

The experts we have consulted tell us the value of the \(\beta\) must be between \(1\) y \(25\), with an approximately Normal distribution. Its center is then around \((25+1)/2=13\), with a standard deviation about \((25-1)/4=6\), so we assume as *prior* of \(\beta\) a Normal distribution with those parameters. The experts also tell us that the Exponential distribution is suitable as a *prior* of the parameter \(\sigma_\epsilon\), using the reciprocal of the standard deviation of “\(y\)” as its \(\lambda\) parameter, so we adopt as *prior* of \(\sigma_\epsilon\) an exponential distribución with \(\lambda=1/sd(y)\).

```
<- 6
PRIOR_BETA_SD <- 13
PRIOR_BETA_MEAN <- 1 / sd(ECOMPO_DATA$y) PRIOR_SIGMA_LAMBDA
```

The *prior* distribution \(\pi(\theta)\) of our model is:

```
<- function(beta, sigma, log = TRUE) {
prior_density <- dnorm(beta, mean = PRIOR_BETA_MEAN, sd = PRIOR_BETA_SD, log = TRUE)
log_prior_beta <- dexp(sigma, rate = PRIOR_SIGMA_LAMBDA, log = TRUE)
log_prior_sigma <- log_prior_beta + log_prior_sigma
result
if (log != TRUE) result <- exp(result)
return(result)
}
```

Where the `beta`

and `sigma`

function parameters correspond to \(\beta\) and \(\sigma_\epsilon\) in our model (1.2). Since these parameters are independent, their joint *prior* distribution is the product of these parameters individual *priors*. As expected, this distribution does not depend on the data, only on \(\theta\). The *hyper-parameters* like `PRIOR_BETA_SD`

or `PRIOR_SIGMA_LAMBDA`

used to model each individual prior are model constants. We do the logarithmic scale calculations as a purely technical matter (stability and numerical accuracy).

Our weakly-informative prior distribution supports all plausible parameter values, as we see in the graph *“Prior Distribution”* 1.2 of our model.

The likelihood distribution \(\pi(y \mid \theta)\) of our model is:

```
<- function(beta, sigma, y, x, log = TRUE) {
likelihood_density <- 0
result for (i in seq_along(y)) {
<- beta * x[i]
mean <- result + dnorm(y[i], mean = mean, sd = sigma, log = TRUE)
result
}
if (log != TRUE) result <- exp(result)
return(result)
}
```

In our model, equation (1.1) defines the probability of seeing each observation in the data. Since these are independent, their joint probability is the product of their individual *likelihood*. In the logarithmic scale, it is the sum of the logarithm of their probabilities.

Note that, strictly speaking, `likelihood_density()`

and `prior_density()`

are non-normalized density functions. For them to be normalized, the integral with respect to \(\theta\), of each of these functions is required to be \(1\). For example, in the case of the *prior* distribution, this is \(\int_{-\infty}^{+\infty} \pi(\theta)~\textrm{d}\theta = 1\), which is not the case in those functions. We need to divide each function’s result by the value of its integral over \(\theta\). We will deal with this later.

The name *Joint* distribution can be a little confusing because all these distributions (prior, likelihood, posterior) are joint distributions of the model parameters. Here we are referring to \(\pi(y, \theta)\), the numerator in (1.3), which is the joint density of the parameters and the data.

With what we got so far, \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\) is easy:

```
<- function(beta, sigma, data, log = TRUE) {
joint_density <- prior_density(beta, sigma) + likelihood_density(beta, sigma, data$y, data$x)
result
if (log != TRUE) result <- exp(result)
return(result)
}
```

As in previous cases, this is also a non-normalized distribution, and we shall correct this in the posterior distribution.

The joint distribution \(\pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta)\) computed by the function `joint_density()`

above, is all we need in (1.4). However, we will develop the *normalized posterior* density \(\pi(\theta \mid y)\) according to (1.3), which is the joint distribution `joint_density()`

divided by \(\pi(y)\). Where, in turn, \(\pi(y)\) is the value of the integral of `joint_density()`

over \(\theta\).
\[\begin{align}
\pi(y) &= \int_{-\infty}^{+\infty} \pi(y \mid \theta) \cdot \pi(\theta)~\textrm{d}\theta
\\\
\pi(y) &= \int_0^{+\infty} \int_{-\infty}^{+\infty} \textrm{joint_density}(\beta, \sigma_\epsilon \mid \text{data})~\textrm{d}\beta~\textrm{d}\sigma_\epsilon
\end{align}\]

The *posterior* distribution \(\pi(\theta \mid y)\) has the form:

```
<- function(beta, sigma, data, log = TRUE) {
posterior_density <- joint_density(beta, sigma, data, log = TRUE) - log(pi_y)
result
if (log != TRUE) result <- exp(result)
return(result)
}
```

Where `pi_y`

is the value of \(\pi(y)\), the integral value we must calculate. We have the difficulty that when we integrate with `pi_y = 1`

the `posterior_density()`

function returns values smaller than \(3.4 \times 10^{-47}\), and the integration is numerically unstable. To deal with this, we assign to `pi_y`

a temporary value while we integrate, and then we fit it according to the integral.

```
<- 3.4e-47
pi_y <- integrate(function(sigmas) {
int_pi_y sapply(sigmas, function(sigma) {
integrate(function(beta) posterior_density(beta, sigma, ECOMPO_DATA, log = F), -Inf, +Inf)$value
})0, +Inf)$value
},
<- pi_y * int_pi_y pi_y
```

The final value is \(\pi(y)=2.8\times 10^{-46}\), and now the `posterior_density()`

function is a proper density distribution. We can now go back to the starting exercise and compute, using `posterior_density()`

, the probability of observing the data at (1) \(\pi(\beta=9.3, \sigma_\epsilon=36.6 \mid y)=0.071\) and at (2) \(\pi(\beta=15, \sigma_\epsilon=55.1 \mid y)=1.6 \times 10^{-25}\). Where we see that it is practically impossible to *observe* the data with \(\beta=15\).

The *Posterior Distribution* graph 1.3 shows the surface of our just developed `posterior_density()`

function. To build this graph, we have computed the `dist_mat`

matrix with the code below. The `dist_mat`

matrix contains the *posterior* density for a grid of equidistant points in the `beta`

and `sigma`

plane. The functions –also in the code below– `row_to_beta()`

and `col_to_sigma()`

allow us to obtain the values of `beta`

and `sigma`

from the row and column position of any cell. Later we will use this matrix a few times.

```
<- seq(7.5, 11, by = 0.02)
betas <- seq(19, 79, by = 0.02)
sigmas
<- outer(
dist_mat
betas, sigmas,function(beta, sigma) posterior_density(beta, sigma, ECOMPO_DATA, log = FALSE)
)<- function(row_nbr) (row_nbr - 1) * 0.02 + 7.5
row_to_beta <- function(col_nbr) (col_nbr - 1) * 0.02 + 19
col_to_sigma str(dist_mat)
```

`num [1:165, 1:2931] 4.01e-20 8.73e-20 1.88e-19 4.02e-19 8.49e-19 ...`

In the *Posterior Distribution* graph 1.3, we see that in our model \(\pi(\theta \mid y)\) has a prominent modal point. Estimation algorithms of the *maximum-likelihood* type estimate the *likelihood* distribution mode’s position as the solution to the model parameters. In a full Bayesian inference, we estimate the \(\theta\) distribution, which allows us to compute all the measures we want, such as mean, mode, quartiles, standard deviation, mean absolute deviation, and so on.