In the Bayesian paradigm the model’s parameters are random variables whose different possible values make the model produce data that is more or less likely to come from the same distribution as the observed data.
This post develops the Bayesian inference concepts through their application in a practical example, solving a simple regression model numerically, in a fully Bayesian way. We will match the theoretical concepts with their practical implementation, helping you to grasp their meaning entirely.
We will use a dataset named “eCompo” for our running example. These are a series of readings of the “\(x\)” response emitted by an electronic component after being subjected to different stimulus values “\(y\)” in a test. We want to know the two parameters \(\beta\) and \(\sigma_\epsilon\) in the model shown below: \[\begin{align} y &= \beta \cdot x + \epsilon \\ \epsilon &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.1} \\[3pt] \implies y - (\beta \cdot x) &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.2} \end{align}\]
<- data.frame(
ECOMPO_DATA x = c(23.54, 22.97, 11.78, 18.64, 14.36, 18.93, 19.04, 10.768, 27.39,
17.28, 22.63, 14.32, 22.95, 27.82, 28.34, 23.71, 29.73, 25.41,
27.60, 15.54 ),
y = c(247.35, 180.47, 86.78, 178.12, 168.72, 135.10, 160.17, 172.48,
274.35, 130.32, 224.71, 144.31, 245.39, 314.81, 256.71, 187.07,
251.90, 253.75, 181.58, 179.68)
)
In the model equation (1.2), notice that the expected distribution of the residuals has their variance, but its center is fixed at zero. In the following examples, we will estimate the variance \(\sigma_{\epsilon}^2\) with that of the residuals in the data.
In the “Beta vs. Residuals” graph 1.1, the residuals expected distribution is shown with green, and its actual distribution \(\epsilon = y - (\beta \cdot x)\) with purple.
With \(\beta = 9.3\) in (1), the residuals approach their expected –zero centered– distribution. In contrast, when \(\beta\) increases to \(15\) in (2), the residuals spread out, and their center moves away from its expected zero value. We say that \(\beta = 15\) “makes it unlikely to observe the data we have.”
The phrase “makes it unlikely to observe the data” is an abbreviated way to say that some parameter values make it more or less probable for the model to produce the observed data; in this case, less likely.
Figure 1.1: Beta vs. Residuals. The residuals expected distribution is colored green. It is centered at zero and has the residuals’ actual variance. The residuals’ real distribution is colored purple. At the moment (1), \(\beta=9.3\) produces residuals whose distribution relatively matches its expected distribution, where the green and purple blobs overlap. At the moment (2), \(\beta=15\) causes more spread and negative residuals. The expected and actual residuals become more different from each other.
This exercise aims to give you the intuition that a model’s parameters have a joint distribution. Changing any of them changes the likelihood of observing the data according to the model.
In our eCompo model, the joint distribution of observing the data (\(y, X\)) with the model \(\mathcal{M}\), and parameters \((\beta, \sigma_{\epsilon})\) can be denoted as: \[\begin{align*} \pi(\beta, \sigma_{\epsilon} \mid \mathcal{M}, y, X) \end{align*}\]
Where \(\pi()\) is the distribution, which depends on \(\beta, \sigma_{\epsilon}\) given the model and the data.
For continuous random variables, we use the term distribution in the same sense as probability density function (PDF), which corresponds to a probability mass function (PMF) for discrete variables. We use \(\pi(\cdot)\) to denote a PDF or a PMF. From the context, it will be evident which is the case.
Being \(\pi(\cdot)\) the distribution of \((\cdot)\), it is a different function for each different \((\cdot)\). Again, it refers to a PDF in the continuous case or a PMF otherwise. On some occasions, to avoid confusion, we use \(Pr(\cdot)\) to denote an event’s probability.
For simplicity, we will refer with the vector \(\theta\) to all the model parameters. In turn, with \(y\), we will represent all the data, so \(X\) will be implicit. Being the model constant, we will also omit it.
With this simplified notation, the joint probability of observing the data (\(y, X\)) with the model’ parameters \(\theta\) is: \[\begin{align*} \pi(\theta \mid y) \end{align*}\]
By the Bayes rule we know: \[\begin{align} \pi(\theta \mid y) &= \frac{\pi(y, \theta)}{\pi(y)} = \frac{\pi(y \mid \theta) \cdot \pi(\theta)}{\pi(y)} \tag{1.3} \end{align}\]
The equation (1.3) is fundamental in Bayesian inference, and each term has a name that we will refer to frequently.
The names prior and posterior refer to the fact that they are distributions modeled before and after considering at the data, respectively.
\(\pi(y)\), the denominator in (1.3), is a value that depends only on the data \(y\) and is the normalization constant that makes the integration of the distribution \(\pi(\theta \mid y)\) equal to \(1\) (or the sum in the discrete case).
For continuous \(\theta\), \(\pi(y)\) is: \[\begin{align} \pi(y) &= \int \pi(y \mid \theta) \cdot \pi(\theta) \, \text{d}\theta \tag{1.4} \end{align}\]
For the discrete case, the previous equations is equivalent to:
\[ \pi(y) = \sum_{\theta} \pi(y \mid \theta) \cdot \pi(\theta) \]
In these expressions, we are averaging \(\pi(y \mid \theta)\) (the likelihood) over all possible \(\theta\) values. This is an operation called marginalization and is the reason why \(pi(y)\) is called the marginal likelihood.
We will develop, for the eCompo model, the distributions we just defined in the previous section. Later we will use them to solve that model.
The prior distribution is modeled with the previously known information about the model parameters or the general expectations. This is called a weakly-informative prior. In our example, we ask the company experts what the model’s parameters domain is.
The experts tell us the value of the \(\beta\) must be between \(1\) y \(25\), with an approximately Normal distribution. Now we can say that the \(\beta\) prior distribution is Normal, with its center around \((25+1)/2=13\), and a standard deviation about \((25-1)/4=6\).
The experts also tell us that the Exponential distribution is suitable as a prior of the parameter \(\sigma_\epsilon\), using the reciprocal of the standard deviation of “\(y\)” as its \(\lambda\) parameter, so we adopt as prior of \(\sigma_\epsilon\) an exponential distribución with \(\lambda=1/sd(y)\).
<- 6
PRIOR_BETA_SD <- 13
PRIOR_BETA_MEAN <- 1 / sd(ECOMPO_DATA$y) PRIOR_SIGMA_LAMBDA
The prior distribution \(\pi(\theta)\) of our model is:
<- function(beta, sigma, log = TRUE) {
prior_density <- dnorm(beta, mean = PRIOR_BETA_MEAN, sd = PRIOR_BETA_SD, log = TRUE)
log_prior_beta <- dexp(sigma, rate = PRIOR_SIGMA_LAMBDA, log = TRUE)
log_prior_sigma <- log_prior_beta + log_prior_sigma
result
if (log != TRUE) result <- exp(result)
return(result)
}
Where the beta
and sigma
function parameters correspond to \(\beta\) and \(\sigma_\epsilon\) in our model (1.2). Since these parameters are independent, their joint prior distribution is the product of these parameters’ individual priors. As expected, this distribution does not depend on the data, only on \(\theta\). The hyper-parameters like PRIOR_BETA_SD
or PRIOR_SIGMA_LAMBDA
used to model each individual prior are model constants. We use the logarithmic scale for the calculations as a purely technical matter (stability and numerical accuracy).
Figure 1.2: Prior Distribution. We have a weakly-informative prior distribution, supporting all plausible model parameter values.
Our weakly-informative prior distribution supports all plausible parameter values. It is shown in the “Prior Distribution” graph 1.2.
The likelihood distribution \(\pi(y \mid \theta)\) of our model is:
<- function(beta, sigma, y, x, log = TRUE) {
likelihood_density <- 0
result for (i in seq_along(y)) {
<- beta * x[i]
mean <- result + dnorm(y[i], mean = mean, sd = sigma, log = TRUE)
result
}
if (log != TRUE) result <- exp(result)
return(result)
}
In our model, the equation (1.1) defines the probability of seeing each observation in the data. Since these are independent, their joint probability is the product of their individual likelihood. In the logarithmic scale, it is the sum of the logarithm of their probabilities.
Note that likelihood_density()
and prior_density()
are non-normalized density functions. For them to be normalized, the integral with respect to \(\theta\), of each of these functions, is required to be \(1\).
For example, in the case of the prior distribution, this is \(\int_{-\infty}^{+\infty} \pi(\theta)~\textrm{d}\theta = 1\), which is not true. We need to divide each function’s current output by the value of its integral over \(\theta\). We will deal with this later.
The name Joint distribution can be confusing because all these distributions (prior, likelihood, posterior) are joint distributions on the model parameters. Here we are referring to \(\pi(y, \theta)\), the numerator in (1.3), which is the joint density of the parameters and the data.
As this function is the product of the prior and likelihood densities \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\), this is easy:
<- function(beta, sigma, data, log = TRUE) {
joint_density <- prior_density(beta, sigma) + likelihood_density(beta, sigma, data$y, data$x)
result
if (log != TRUE) result <- exp(result)
return(result)
}
As in previous cases, this is also a non-normalized distribution, and we shall correct this in the posterior distribution.
The joint distribution \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\) computed by the function joint_density()
above, is the non-normalized distribution of \(\pi(\theta \mid y)\), in the sense its integral is not equal to \(1\). We must divide joint_density()
by the normalization constant \(\pi(y)\) to get the normalized posterior density \(\pi(\theta \mid y)\) according to (1.3).
We will get \(\pi(y)\) by marginalizing out \(\theta\) from \(\pi(y, \theta)\). Mathematically, this corresponds to integrating \(\pi(y, \theta)\) over \(\theta\). \[\begin{align} \pi(y) &= \int \pi(y , \theta) \text{d}\theta \end{align}\]
In turn, we can replace \(\pi(y, \theta) = \pi(y \mid \theta) \cdot \pi(\theta)\), which gives us the equation (1.4). In our case, \(\theta\) contains two parameters: \[\begin{align} \pi(y) &= \int_0^{+\infty} \int_{-\infty}^{+\infty} \textrm{joint}\_\textrm{density}(\beta, \sigma_\epsilon ,\text{data}) ~\textrm{d}\beta~\textrm{d}\sigma_\epsilon \end{align}\]
The posterior distribution \(\pi(\theta \mid y)\) has the form:
<- function(beta, sigma, data, log = TRUE) {
posterior_density <- joint_density(beta, sigma, data, log = TRUE) - log(pi_y)
result
if (log != TRUE) result <- exp(result)
return(result)
}
Where pi_y
is the value of \(\pi(y)\), the value we must calculate. In the code above, we are subtracting in the logarithm scale, which is equivalent to dividing in the linear one.
We have the difficulty that when we integrate with pi_y = 1
the posterior_density()
function returns values smaller than \(3.4 \times 10^{-47}\), causing numerically unstable integration. To deal with this, we shall assign to pi_y
a temporary value while we integrate, and then shall adjust it according to the integral.
<- 3.4e-47
pi_y <- integrate(function(sigmas) { # Integrate over sigma
int_pi_y sapply(sigmas, function(sigma) {
integrate(function(beta) # Integrate over beta
posterior_density(beta, sigma, ECOMPO_DATA, log = F), -Inf, +Inf)$value
})0, +Inf)$value
},
<- pi_y * int_pi_y pi_y
In the code above, like in the mathematical expression, the inner integration is the integrand of the outer one. We wrap the inner integration with a call to sapply()
to compute the inner integral over beta
for many sigma
s from the outer integration.
The final value is \(\pi(y)= 2.80 \times 10^{-46}\), and now the posterior_density()
function is a proper density distribution.
The Posterior Distribution graph 1.3 shows the surface of our just developed posterior_density()
function. Notice that it is a map of each possible set of parameters (the horizontal plane) and its density (the vertical axis), which in turn tell us there are some combination of parameters that explain the data better than others.
To build this graph, we compute the dist_mat
matrix with the code below. This matrix contains the posterior density for a grid of equidistant points in the beta
and sigma
plane. The functions row_to_beta()
and col_to_sigma()
–also in the code below– allow us to obtain the values of beta
and sigma
from the row and column position of any cell in the dist_mat
matrix.
<- 7.5
beta_start <- 19
sigma_start
<- seq(beta_start, 11, by = 0.02)
betas <- seq(sigma_start, 79, by = 0.02)
sigmas
<- outer(
dist_mat
betas, sigmas,function(beta, sigma) posterior_density(beta, sigma, ECOMPO_DATA, log = FALSE)
)<- function(row_nbr) (row_nbr - 1) * 0.02 + beta_start
row_to_beta <- function(col_nbr) (col_nbr - 1) * 0.02 + sigma_start
col_to_sigma str(dist_mat)
num [1:165, 1:2931] 4.01e-20 8.73e-20 1.88e-19 4.02e-19 8.49e-19 ...
In the Posterior Distribution graph 1.3, we see it has a prominent mode. Estimation algorithms of the maximum-likelihood type estimate the likelihood distribution mode’s position as the model parameters’ solution. They don’t use to have a prior distribution, and then their posterior is just the likelihood, hence the maximum-likelihood name.
Figure 1.3: Posterior Distribution. Numerically calculated posterior distribution, with some contour lines to better notice its shape, like \(\sigma_\epsilon\) having a longer tail on its right end.
We can estimate the mode position using the dist_mat
matrix we computed in the previous section. For this, we find the cell with the maximum value in the matrix, and with that cell position, we get the value of beta
and sigma
with a precision of \(0.02\), the spacing in the grid.
<- which(dist_mat == max(dist_mat), arr.ind = T)
mode_pos c(row_to_beta(mode_pos[1]), col_to_sigma(mode_pos[2]))
[1] 9.302 35.260
In our example, we can also find it with a numerical optimizer in R. This time we will find the position of likelihood mode.
nlm(
function(theta)
-likelihood_density(theta[1], theta[2], ECOMPO_DATA$y, ECOMPO_DATA$x), c(7, 11)
)
:
$estimate
[1] 9.303106 35.262023
:
Since the nlm()
(Non-Linear Minimization) function finds minimums, and we are looking for the maximum, we invert the sign of the likelihood_density()
function. Also, because log
is a monotonic function, we can optimize on that scale for numerical convenience.
We can get the posterior mode in non-trivial models by using the optimizing()
function in rstan.
Get this article in .pdf format, including the solution to exercises (implicitly or explicitly) left to the reader. Additionally, you will get the supplemental source code, including this article’s code snippets and other ones used to create the post, such as the 3D graphics or compute estimands.
Suppose we are interested in a scalar parameter which is a function \(g(\theta)\) of the model parameters. Its expected value, by definition is:
\[
\mathbb{E}[g(\theta)] =
\int g(\theta) \cdot \pi(\theta \mid y) ~\textrm{d}\theta \tag{1.5}
\]
Where the integration is over all the parameters in \(\theta\). In our example it is over \(\beta\) and \(\sigma_{\epsilon}\). On the other hand, \(\pi(\theta \mid y)\) in our model is the posterior_density()
function.
The posterior distribution we have, is the parameters’ joint density \(\pi(\beta, \sigma_{\epsilon})\) conditioned to the data. We need both a value for \(\beta\) and \(\sigma_{\epsilon}\) to evaluate this distribution. However, we can get the marginal distribution of each parameter, by integrating all the parameters except the one for which we want the marginal. For example, the marginal distribution of \(\beta\) is: \[\begin{align} \pi(\beta) &= \int \pi(\beta, \sigma_\epsilon) ~\textrm{d}\sigma_\epsilon \\ \pi(\beta) &= \int \pi(\beta \mid \sigma_\epsilon) \cdot \pi(\sigma_\epsilon) ~\textrm{d}\sigma_\epsilon \end{align}\]
We can see, in the last equation above, that the marginal \(\pi(\beta)\) is the average of \(\pi(\beta \mid \sigma_\epsilon)\) along the possible \(\sigma_\epsilon\) values for the given \(\beta\). For clarity, we are omitting in the notation that all these distributions are also conditioned to the data \(y\).
<- function(betas) {
beta_marginal sapply(betas, function(beta) { # Compute the integral for each `beta` in `betas`
integrate(function(sigma) { # Integrate over sigma
posterior_density(beta, sigma, ECOMPO_DATA, log = F)
0, +Inf)$value
},
}) }
The code above implements the marginal \(\pi(\beta)\) density. We wrap the integration with a call to sapply()
to make the function able to compute, in a single call, the integral for many beta values in the argument betas
.
Figure 1.4: Marginal Distribution of Beta and Sigma. These are the marginal distributions of our model parameters, plotted under the same scales. There we can see we have a lot more uncertainty about sigma
than about beta.
These marginal distributions will help us to build easier to understand equations and code.
The Expected Value of a Single Parameter Estimand
For the particular case where we are interested in an estimand \(g(\alpha)\) which only depends on a single parameter \(\alpha\), the same equation above can be denoted using the parameter marginal distribution \(\pi(\alpha)\).
\[ \mathbb{E}[g(\alpha)] = \int g(\alpha) \cdot \pi(\alpha) ~\textrm{d}\alpha \tag{1.6} \] This is the case –for example– for the expected value of a parameter or its standard deviation.
Now that we have the posterior distribution, we can learn about the model parameters, which we call solving it. For a more compelling case dealing with estimands involving more than one parameter, we will study the Ratio of beta to sigma \((\beta / \sigma_{\epsilon})\).
Computing the expected value of estimands depending on a single parameter is pretty easy with the equation (1.6) and the marginal distributions we developed in the previous section. We will leave them as exercises for the reader. You can check how we compute expectations for the Ratio beta to sigma as guidance to solve single parameter expectations. In the end, you should get:
mean sd median mad Q5 Q95
beta 9.30638 0.39920 9.3053 0.3139 8.6547 9.9607
sigma 38.59072 6.65043 37.7209 5.1139 29.4171 50.7350
Notice that here we are computing the actual values of those estimands. Later we will use them as their ground-truth values to assess the accuracy of other methods.
For this estimand we need to use the equation (1.5), with \(g(\beta, \sigma_{\epsilon})= (\beta / \sigma_{\epsilon})\):
\[ \mathbb{E}(\beta / \sigma_{\epsilon}) = \int_0^{+\infty} \int_{-\infty}^{+\infty} (\beta / \sigma_{\epsilon}) \cdot \textrm{joint}\_\textrm{density}(\beta, \sigma_\epsilon ,\textrm{data}) ~\textrm{d}\beta~\textrm{d}\sigma_\epsilon \]
Which in R code is:
<- integrate(function(sigmas) { # Integrate over sigma
beta_sigma_ratio_ev sapply(sigmas, function(sigma) {
integrate(function(beta) # Integrate over beta
/sigma * posterior_density(beta, sigma, ECOMPO_DATA, log = F), -Inf, +Inf)$value
beta
})0, +Inf)$value %>% print() },
0.2476815
We will use again the equation (1.5), except that this time \(g(\beta, \sigma_{\epsilon})= (\beta / \sigma_{\epsilon} - \mu_r)^2\) where \(\mu_r\) is the Ratio’s expected value computed in the previous section.
0.04153125
The Median is a little bit tricky, so we will first compute the Parameters’ Median, to get a better sense of how to calculate it. In the next section, we will add a little bit of complexity to compute the Ratio’s Median.
The median is the point in the parameter distribution’ horizontal axis (1.4) where the area under the curve is split exactly into two halves. This condition is easier to state using the Cumulative Distribution Function (CDF). For example, let’s beta_cdf()
be the beta
CDF:
\[
\textrm{beta}\_\textrm{cdf}(\beta_0) =
\textrm{Pr}(\beta<\beta_0) =
\int_{-\infty}^{\beta_0} \pi(\beta) \, \textrm{d}\beta
\]
Then \(\tilde{\beta}\), the median of \(\beta\), is the value that satisfies the condition:
\[
\textrm{beta}\_\textrm{cdf}(\tilde{\beta}) = \textrm{Pr}(\beta<\tilde{\beta}) = 0.5
\]
In R the beta_cdf()
function is:
<- function(beta_limit) {
beta_cdf integrate(function(beta) beta_marginal(beta), -Inf, beta_limit)$value
}
Therefore, beta_median
\(=\tilde{\beta}\) must satisfy beta_cdf(beta_median) = 0.5
, which means beta_median = beta_cdf_inverse(0.5)
. That is, the median is the inverse of the beta
CDF evaluated at \(0.5\). We compute the beta_cdf_inverse(p)
by solving the equation beta_cdf(beta) - p
where p
is a given value, like \(0.5\) for the median.
<- function(p) {
beta_cdf_inverse <- function(beta) {
to_solve beta_cdf(beta) - p
}uniroot(to_solve, c(0, 11))$root
}
Now it is easy to get the median by evaluating beta_cdf_inverse(0.5)
.
<- beta_cdf_inverse(0.5) median_beta
median
beta 9.3053
sigma 37.7209
Similar to what we did to compute the parameters median, we will first compute the Ratio’s Cumulative Distribution Function (CDF), then we will solve it to get its inverse.
The Ratio’s CDF computes \(Pr(\beta / \sigma_{\epsilon} < r)\), where \(r\) is a given value, which implies \(Pr(\beta < r · \sigma_{\epsilon})\). Therefore, we compute the median using: \[ %% \begin{align} \textrm{ratio}\_\textrm{cdf}(r) = \textrm{Pr}(\beta / \sigma_{\epsilon} < r_0) = \int_0^{+\infty} \int_{-\infty}^{r · \sigma_{\epsilon}} \textrm{joint}\_\textrm{density}(\beta, \sigma_\epsilon ,\text{data}) ~\textrm{d}\beta~\textrm{d}\sigma_\epsilon %% \end{align} \]
In the code below, to avoid precision issues, we use the limits containing more than 99% of the distribution mass, rather than from/to infinity.
<- function(ratio) {
beta_sigma_ratio_cdf integrate(
function(sigmas) { # Integrate over sigma
sapply(sigmas, function(sigma) {
integrate(function(beta) { # Integrate over beta
posterior_density(beta, sigma, ECOMPO_DATA, log = F)
7.2, sigma * ratio)$value
},
})18, 79
}, $value
)
}c(Q50=beta_sigma_ratio_cdf_inverse(0.50))
Q50
0.2467930
Having the inverse of the Ratio’s CDF, we can easily compute its credible interval at \(90\%\). For example, the lower \(5\%\) limit is beta_sigma_ratio_cdf_inverse(0.05)
.
Q5 Q95
0.1815486 0.3182257
Figure 1.5: Probability density function of the Ratio beta/sigma. We use the the Ratio’s CMF to compute this PDF.
We can use the Ratio’s CMF to compute its PDF, shown in the Probability density function of the Ratio beta/sigma graph 1.5.
We have computed the following Ratio’s Statistics.
mean sd median Q5 Q95
ratio 0.2476 0.04153 0.2467 0.1815 0.3182
Get this article beautifully presented in .pdf format, including the solution to exercises (implicit or explicit) left to the reader.
We have developed the posterior distribution for the eCompo model and then used it to solve it. We have done this to make tangible what exactly the posterior distribution is and hopefully refresh or clarify some other concepts.
To solve the model, we computed expectations, which lead us to the actual value of estimands for any function of the model parameters, like the Ratio beta
to sigma
.
We have also seen using R for many different tasks, like integrating, finding a function minimum, or solving non-linear equations.
However, it should be clear this method is not viable to solve problems with more than just a few parameters. Imagine a twelve parameters problem, the math acrobatics for integrating twelve concurrent integrals, or the grid for the posterior distribution having a dozen dimensions. The required complexity and computational resources for such cases are unmanageable.
When dealing with multilevel regressions, it is not unusual to have tens of parameters or more. For example, for the “Junior School Project,” we had more than one thousand parameters.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.