1 Multilevel Regresion

In a classical linear regression, the regression coefficients are the same for all observations. \[\begin{align} y^{(i)} &= \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \dots + \beta_p x_p^{(i)} + \epsilon^{(i)}, ~\text{for}~i=1, ..., N \tag{1.1} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.2} \end{align}\]

In equation (1.1), “\(y\)” is a vector with the values of the outcome variable (also called response variable), and the notation to refer to its value in the position \(i\) is \(y^{(i)}\). All the vectors have a vertical shape, like a one-column matrix. The variables \(\{x_1^{(i)}, x_2^{(i)}, \dots \, x_p^{(i)}\}\) are the predictors of \(y^{(i)}\). In a more compact notation, we use \[\begin{align} y^{(i)} = {X}^{(i)} {\beta} + \epsilon^{(i)} \tag{1.3} \end{align}\]

Where \({X}\) is a matrix of \(N \times p\) elements, that is, with \(N\) rows, each containing the \(p\) predictors of each observation including a column of \(1\)s for the intercept; then \(X^{(i)}\) are the predictors of the observation at the \(i\)th row. In turn, \(\beta\) is a vector with the regression coefficients, then \({X}^{(i)}{\beta}\) is a vector where each row is the internal product of predictors by coefficients. Finally, the standard prerequisite (1.2) is that the errors \(\epsilon^{(i)}\) are independent and identically distributed, for example, with Normal distribution and variance \(\sigma_{\epsilon}^2\).

In multilevel regressions (MLR), the data is split into groups where the regression between \(y\) and its predictors varies. For example, consider the performance of employees in a company that varies between offices and regions.

Like the classical regressions are extended with the generalized linear model (GLM), the multilevel regressions are also extended into that context, and even others like the generalized additive model (GAM).

The useful thing about MLR is that between-group variance is not crucial. As the MLR model includes between-group variance estimation, one can know if these groups are essential or not in the model. When the groups are not required because there is no variance between them, a multilevel regression automatically reverts to a one without groups.

The Toy Data Set

We will practice with a running example using a simulated Toy dataset. This data set contains 504 observations of the (x, y, g) variables, where we will use “x” as a predictor; “y” as the outcome, and “g” is a categorical variable with the group number to which the observation belongs. You can download the data from here, and the model from where we sampled the data is here.

In this dataset, we have 49 groups, so g varies between 1 and 49. The number of observations per group \(n_j\) varies between 1 and 21, with an average of 10 observations per group.

Table 1.1: Some example rows from the Toy data set.
x y g
0.273 8.876 1
3.309 15.598 1
2.720 15.150 1
0.618 11.607 1
0.252 8.001 1
5.337 21.263 1

1.1 Toy | Varying Intercept

To introduce the MLR notation, we will start by saying the data can be separated into \(J\) groups. Initially, we will study the case in which only the intercept \(\alpha_0\) varies with the group: \[\begin{align} y^{(i)} = \alpha_0^{(j)} + {X}^{(i)} {\beta} +\epsilon^{(i)} ,~for~i=1, \dots ,N \tag{1.4} \end{align}\]

Since on the left side of the equation (1.4) we have \(y^{(i)}\), is implicit that the group number \(j\) in \(\alpha_0^{(j)}\) is obtained from \(i\), which is the input data. In the data matrix, there is a column with the group number of each observation. In our example, that’s the variable g, which contains the group number, so \(j = g[i]\). To denote this, we could have used \(\alpha_0^{(g[i])}\) or maybe \(\alpha_0^{(i\,j)}\), but that turns the equation more difficult to read, and having made this clarification, we use just \(\alpha_0^{(j)}\).

To fit the model to the data, we could use the R package lme4 (Bates et al. 2015), which uses point estimates of maximum likelihood. For calculations, with full Bayesian inference, we can use brms (Bürkner 2020) or rstanarm (Gabry and Goodrich 2020) or even rstan (Guo et al. 2020). The advantage of lme4 is that it is fast and can be used to explore the data, but for the final model, we recommend using brms or rstanarm, which behind the scenes call stan (Carpenter et al. 2017). In our examples, we will use brms. Fortunately, lme4, brms and rstanarm accept the notation we use here.

In the notation style of (1.4), our Toy model is: \[\begin{align} y^{(i)} = \alpha_0^{(j)} + \beta_0 + {\beta_1}{x}^{(i)} +\epsilon^{(i)} ,~for~i=1, \dots ,N \tag{1.5} \end{align}\]

while in the brms formula notation, it is:

y ~ 1 + x + (1 | g)

Where “1 + x” indicates we want a regression with intercept and with “x” as a predictor, both common to the entire population. Whereas “(1 | g)” denotes we also want the intercept varying for each g group. The global 1 can be omitted since, by default, these formulas in R assume that we want such an intercept, although it is better to be explicit. Rather, when the intercept is not desired, we must use -1 or 0 as in y ~ -1 + x + (1 | g).

Before calculating the regression for our example, we will introduce the equations depicting this multilevel model. To give more generality to the equation (1.4) instead of \(\alpha_0^{(j)}\), we will use \(\pmb{\alpha}^{(j)}\), adding that \(\alpha_0^{(j)}\) is its only component in this model: \[\begin{align} y^{(i)} &= \pmb{\alpha}^{(j)} + {X}^{(i)} {\beta} +\epsilon^{(i)} ,~\text{for}~i=1, \dots ,\text{N} \tag{1.6} \\ \pmb{\alpha}^{(j)} &= \alpha_0^{(j)} + \eta^{(j)}, ~\text{for}~j=1, \dots ,\text{J} \tag{1.7} \\ \eta^{(j)} &\sim \mathcal{N}(0, \sigma_\alpha^2) \tag{1.8} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.9} \end{align}\]

The \(\eta^{(j)}\) term in the equations (1.7) and (1.8) tells us that all \(J\) interceptions \(\alpha_0^{(j)}\) come from the same Normal distribution with variance \(\sigma_\alpha^2\). Another way to express these two equations in a condensed but more general way is: \[\begin{align} \pmb{\alpha}^{(j)} \sim \mathcal{N}(\mu_{\alpha}, \sigma_\alpha^2) \tag{1.10} \end{align}\]

Note that \(\mu_{\alpha}\) and \(\sigma_\alpha\) does not depend on \(j\), which indicates that the group intercepts are modeled as draws from the same distribution. This will bring benefits that we will discuss later.

The intercept and coefficients that vary between groups are sometimes called random effects. In our notation, they are components of \(\pmb{\alpha}^{(j)}\), although, at the moment, we only have the intercept. In the same terminology, the intercept and variables common to all the population (in \({X}^{(i)}\)) are called fixed effects, and when a model has both, it is said to have mixed effects. However, we refer to them as at the group or at the population (global) level, respectively.

With Intercept at Group and Population Level

Note that our Toy model (1.5) includes intercept at both the population and group level (\(\beta_0\) and \(\alpha_0 ^{(j)}\)), which is somewhat redundant. To solve this, the multilevel model estimates the population intercept with the global intercept mean \((\beta_0 =\mu_\alpha)\). Then, the group level intercepts \(\alpha_0^{(j)}\) are deviations relative to the global one \(\beta_0\). That is, in our Toy model \(\mu_{\alpha} = 0\) in the equation (1.10).

1.1.1 Fitting the Model to the Data

We will fit the model to the data using the brm() function in the brms package (from bayesian regression models using stan) (Bürkner 2018), with the default prior distributions. After fitting the model, we print the fitting result summary.

brm_m0 <- brm(y ~ 1 + x + (1 | g), toy_data, iter = 2500)
summary(brm_m0, priors = T)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 + x + (1 | g) 
   Data: data (Number of observations: 504) 
Samples: 4 chains, each with iter = 2500; warmup = 1250; thin = 1;
         total post-warmup samples = 5000

Priors: 
Intercept ~ student_t(3, 20.2, 8.6)
sd ~ student_t(3, 0, 8.6)
sigma ~ student_t(3, 0, 8.6)

Group-Level Effects: 
~g (Number of levels: 49) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     2.95      0.34     2.35     3.69 1.01      634     1034

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     9.42      0.47     8.51    10.36 1.00      389      903
x             3.70      0.06     3.59     3.81 1.00     4933     3703

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     2.07      0.07     1.95     2.21 1.00     4844     3774

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Please note these three blocks in the above summary:

  1. Priors: Because we call the summary function with the priors=T argument, we got this section with the prior distributions used to fit the model. In this case, they are the brms default ones.
  2. Group-Level Effects: For group-level intercept and coefficients, our only group for the g variable is at the ~g section.
  3. Population-Level Effects: Population-level (global) intercept and coefficients.
  4. Family Specific Parameters: Auxiliary model parameters, such as \(\sigma_\epsilon\), called sigma here.

In the summary, each Estimate is the mean of the corresponding parameter values in the sample from the posterior distribution. With the data in this summary, we can say the realization of the equations (1.6)-(1.9), for our model is: \[\begin{align*} y^{(i)} &= \pmb{\alpha}^{(j)} + 9.36 + 3.70 x^{(i)} +\epsilon^{(i)} ,\text{~for}~i=1, \dots ,504 \\ \pmb{\alpha}^{(j)} &= \alpha_0^{(j)} + \eta^{(j)}, ~\text{~for}~j=1, \dots ,49 \\ \eta^{(j)} &\sim \mathcal{N}(0, 2.96^2) \\ \epsilon^{(i)} &\sim \mathcal{N}(0, 2.07^2) \end{align*}\]

We need to get the between-groups varying coefficients \(\alpha_0^{(j)}\) to have all the point estimated coefficients, which we will see later. First, let’s see what just happened.

1.1.2 Posterior Distribution obtained with Stan

When we fit the model with the brm() function, it transformed our formula y ~ 1 + x + (1 | g) into a model coded in stan, including its default prior distribution. With that code and our data, stan fitted the model. Then the brm() function received from stan \(5000\) samples from the posterior distribution of our Toy model (1.5), and based on them, it estimated the parameters reported in the fitting summary.

Histograms of the \(\beta_0\) and \(\beta_1\) coefficients, for the Toy model, in the samples received by brms from stan. They are approximations to their posterior marginal distribution.

Figure 1.1: Histograms of the \(\beta_0\) and \(\beta_1\) coefficients, for the Toy model, in the samples received by brms from stan. They are approximations to their posterior marginal distribution.

The number of samples we get depends on the chains, iter, and warmup parameters in the call to brm(), where the sample size is \(chains \times (iter-warmup)\). In our call it is = 4 * (2500-1250) = 5000. We used the default warmup argument, which is half of the number of iterations iter and the default value for chain, 4.

There is always the chance that stan cannot satisfactorily traverse the parameters’ entire sample space. Therefore, the returned samples won’t be representative of the actual posterior distribution. To keep up with this, stan issues statistics such asRhat and Bulk_ESS that appear in summary (for more details see Brief Guide to Stan’s Warnings). Ideally, Rhat is less than 1.01, with 1 being its best possible value. Tail_ESS and Bulk_ESS are expected to be greater than \(100 \times chains\). We used iter = 2500 because with its default value 2000, we did not pass these checks, now everything is fine.

1.1.3 Parameter Estimates

The brms package allows us to obtain a sample of the posterior distribution of the parameters and their point estimation, calculated from the samples’ values. For simplicity, here we are going to use point values.

With the ranef() (random effects) function we can obtain the group-varying coefficients \(\alpha_0^{(j)}\), and with fixef() (fixed effects) the global ones, while the consolidation from both origins are obtained with coef().

In our example the population-level intercept \(\beta_0\) from fixef() is \(9.36\), and the intercept for groups 1 and 2 \(\alpha_0^{(1)}, \alpha_0^{(2)}\) from ranef() are \(-3.34\) and \(-2.31\). Then the consolidated intercept, \(\alpha_0^{(j)} + \beta_0\) from coef(), for these two groups is \(6.01\) and \(7.04\), since \(6.01 = -3.34+9.36\) and \(7.04 = -2.31+9.36\).

cat("--- fixef:\n"); fixef(brm_m0)
cat("--- coef:\n"); coef(brm_m0)

--- fixef:
          Estimate Est.Error     Q2.5     Q97.5
Intercept 9.420005 0.4740923 8.511883 10.355825
x         3.696783 0.0568952 3.585757  3.807682
--- coef:
$g
, , Intercept
    Estimate Est.Error      Q2.5     Q97.5
1   6.009167 0.8352279  4.409063  7.671721
2   7.046681 0.4624261  6.146496  7.931515
3  10.660634 0.5802230  9.518355 11.818091
:

1.1.4 Goodness of Fit

There are multiple and complementary ways to assess the quality of a model fitted with stan. We’ll look at a couple of them to get an idea of how our example progress.

Toy | Posterior Predictive Checking

You can interpret each parameter set in a posterior sample as a potential solution to the model. With a set of these sample units, together with the observed predictors \(X\), we can replicate the values of \(y\), which we will call \(y_{rep}\), by predicting their values according to these potential solutions. In this way, we will obtain replicas that describe both the uncertainty we have about the parameters and that coming from the sampling distribution. We should expect that the distribution of the observed \(y\) in the data is a possible case among these replicas \(y_{rep}\). Checking how much this is or not fulfilled is called Posterior Predictive Check.

The bayesplot package, from the same team of stan developers, includes dozens of chart types to compare the replicas with the observed data \(y\) (see more details here). Fortunately, brms make those charts easy to access with its pp_check() function.

With the pp_check() (posterior predictive checking) function many graph types can be obtained by varying the type parameter. The default plot shows \(y\) values on the horizontal axis and on the vertical their density.

pp_check(brm_m0, nsamples = 100) +
  theme(legend.position = c(0.5, 0.4))