1 Multilevel Regresion

In a classical linear regression, the regression coefficients are the same for all observations. \[\begin{align} y^{(i)} &= \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \dots + \beta_p x_p^{(i)} + \epsilon^{(i)}, ~\text{for}~i=1, ..., N \tag{1.1} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.2} \end{align}\]

In equation (1.1), “$y$” is a vector with the values of the outcome variable (also called response variable), and the notation to refer to its value in the position $i$ is $y^{(i)}$. All the vectors have a vertical shape, like a one-column matrix. The variables $\{x_1^{(i)}, x_2^{(i)}, \dots \, x_p^{(i)}\}$ are the predictors of $y^{(i)}$. In a more compact notation, we use \[\begin{align} y^{(i)} = {X}^{(i)} {\beta} + \epsilon^{(i)} \tag{1.3} \end{align}\]

Where ${X}$ is a matrix of $N \times p$ elements, that is, with $N$ rows, each containing the $p$ predictors of each observation including a column of $1$s for the intercept; then $X^{(i)}$ are the predictors of the observation at the $i$th row. In turn, $\beta$ is a vector with the regression coefficients, then ${X}^{(i)}{\beta}$ is a vector where each row is the internal product of predictors by coefficients. Finally, the standard prerequisite (1.2) is that the errors $\epsilon^{(i)}$ are independent and identically distributed, for example, with Normal distribution and variance $\sigma_{\epsilon}^2$.

In multilevel regressions (MLR), the data is split into groups where the regression between $y$ and its predictors varies. For example, consider the performance of employees in a company that varies between offices and regions.

Like the classical regressions are extended with the generalized linear model (GLM), the multilevel regressions are also extended into that context, and even others like the generalized additive model (GAM).

The useful thing about MLR is that between-group variance is not crucial. As the MLR model includes between-group variance estimation, one can know if these groups are essential or not in the model. When the groups are not required because there is no variance between them, a multilevel regression automatically reverts to a one without groups.

The Toy Data Set

We will practice with a running example using a simulated Toy dataset. This data set contains 504 observations of the (x, y, g) variables, where we will use “x” as a predictor; “y” as the outcome, and “g” is a categorical variable with the group number to which the observation belongs. You can download the data from here, and the model from where we sampled the data is here.

In this dataset, we have 49 groups, so g varies between 1 and 49. The number of observations per group $n_j$ varies between 1 and 21, with an average of 10 observations per group.

Table 1.1: Some example rows from the *Toy* data set.
x	y	g
0.273	8.876	1
3.309	15.598	1
2.720	15.150	1
0.618	11.607	1
0.252	8.001	1
5.337	21.263	1

1.1 Toy | Varying Intercept

To introduce the MLR notation, we will start by saying the data can be separated into $J$ groups. Initially, we will study the case in which only the intercept $\alpha_0$ varies with the group: \[\begin{align} y^{(i)} = \alpha_0^{(j)} + {X}^{(i)} {\beta} +\epsilon^{(i)} ,~for~i=1, \dots ,N \tag{1.4} \end{align}\]

Since on the left side of the equation (1.4) we have $y^{(i)}$, is implicit that the group number $j$ in $\alpha_0^{(j)}$ is obtained from $i$, which is the input data. In the data matrix, there is a column with the group number of each observation. In our example, that’s the variable g, which contains the group number, so $j = g[i]$. To denote this, we could have used $\alpha_0^{(g[i])}$ or maybe $\alpha_0^{(i\,j)}$, but that turns the equation more difficult to read, and having made this clarification, we use just $\alpha_0^{(j)}$.

To fit the model to the data, we could use the R package lme4 (Bates et al. 2015), which uses point estimates of maximum likelihood. For calculations, with full Bayesian inference, we can use brms (Bürkner 2020) or rstanarm (Gabry and Goodrich 2020) or even rstan (Guo et al. 2020). The advantage of lme4 is that it is fast and can be used to explore the data, but for the final model, we recommend using brms or rstanarm, which behind the scenes call stan (Carpenter et al. 2017). In our examples, we will use brms. Fortunately, lme4, brms and rstanarm accept the notation we use here.

In the notation style of (1.4), our Toy model is: \[\begin{align} y^{(i)} = \alpha_0^{(j)} + \beta_0 + {\beta_1}{x}^{(i)} +\epsilon^{(i)} ,~for~i=1, \dots ,N \tag{1.5} \end{align}\]

while in the brms formula notation, it is:

y ~ 1 + x + (1 | g)

Where “1 + x” indicates we want a regression with intercept and with “x” as a predictor, both common to the entire population. Whereas “(1 | g)” denotes we also want the intercept varying for each g group. The global 1 can be omitted since, by default, these formulas in R assume that we want such an intercept, although it is better to be explicit. Rather, when the intercept is not desired, we must use -1 or 0 as in y ~ -1 + x + (1 | g).

Before calculating the regression for our example, we will introduce the equations depicting this multilevel model. To give more generality to the equation (1.4) instead of $\alpha_0^{(j)}$, we will use $\pmb{\alpha}^{(j)}$, adding that $\alpha_0^{(j)}$ is its only component in this model: \[\begin{align} y^{(i)} &= \pmb{\alpha}^{(j)} + {X}^{(i)} {\beta} +\epsilon^{(i)} ,~\text{for}~i=1, \dots ,\text{N} \tag{1.6} \\ \pmb{\alpha}^{(j)} &= \alpha_0^{(j)} + \eta^{(j)}, ~\text{for}~j=1, \dots ,\text{J} \tag{1.7} \\ \eta^{(j)} &\sim \mathcal{N}(0, \sigma_\alpha^2) \tag{1.8} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \tag{1.9} \end{align}\]

The $\eta^{(j)}$ term in the equations (1.7) and (1.8) tells us that all $J$ interceptions $\alpha_0^{(j)}$ come from the same Normal distribution with variance $\sigma_\alpha^2$. Another way to express these two equations in a condensed but more general way is: \[\begin{align} \pmb{\alpha}^{(j)} \sim \mathcal{N}(\mu_{\alpha}, \sigma_\alpha^2) \tag{1.10} \end{align}\]

Note that $\mu_{\alpha}$ and $\sigma_\alpha$ does not depend on $j$, which indicates that the group intercepts are modeled as draws from the same distribution. This will bring benefits that we will discuss later.

The intercept and coefficients that vary between groups are sometimes called random effects. In our notation, they are components of $\pmb{\alpha}^{(j)}$, although, at the moment, we only have the intercept. In the same terminology, the intercept and variables common to all the population (in ${X}^{(i)}$) are called fixed effects, and when a model has both, it is said to have mixed effects. However, we refer to them as at the group or at the population (global) level, respectively.

With Intercept at Group and Population Level

Note that our Toy model (1.5) includes intercept at both the population and group level ($\beta_0$ and $\alpha_0 ^{(j)}$), which is somewhat redundant. To solve this, the multilevel model estimates the population intercept with the global intercept mean $(\beta_0 =\mu_\alpha)$. Then, the group level intercepts $\alpha_0^{(j)}$ are deviations relative to the global one $\beta_0$. That is, in our Toy model $\mu_{\alpha} = 0$ in the equation (1.10).

1.1.1 Fitting the Model to the Data

We will fit the model to the data using the brm() function in the brms package (from bayesian regression models using stan) (Bürkner 2018), with the default prior distributions. After fitting the model, we print the fitting result summary.

brm_m0 <- brm(y ~ 1 + x + (1 | g), toy_data, iter = 2500)
summary(brm_m0, priors = T)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 + x + (1 | g) 
   Data: data (Number of observations: 504) 
Samples: 4 chains, each with iter = 2500; warmup = 1250; thin = 1;
         total post-warmup samples = 5000

Priors: 
Intercept ~ student_t(3, 20.2, 8.6)
sd ~ student_t(3, 0, 8.6)
sigma ~ student_t(3, 0, 8.6)

Group-Level Effects: 
~g (Number of levels: 49) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     2.95      0.34     2.35     3.69 1.01      634     1034

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     9.42      0.47     8.51    10.36 1.00      389      903
x             3.70      0.06     3.59     3.81 1.00     4933     3703

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     2.07      0.07     1.95     2.21 1.00     4844     3774

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Please note these three blocks in the above summary:

Priors: Because we call the summary function with the priors=T argument, we got this section with the prior distributions used to fit the model. In this case, they are the brms default ones.
Group-Level Effects: For group-level intercept and coefficients, our only group for the g variable is at the ~g section.
Population-Level Effects: Population-level (global) intercept and coefficients.
Family Specific Parameters: Auxiliary model parameters, such as $\sigma_\epsilon$, called sigma here.

In the summary, each Estimate is the mean of the corresponding parameter values in the sample from the posterior distribution. With the data in this summary, we can say the realization of the equations (1.6)-(1.9), for our model is: \[\begin{align*} y^{(i)} &= \pmb{\alpha}^{(j)} + 9.36 + 3.70 x^{(i)} +\epsilon^{(i)} ,\text{~for}~i=1, \dots ,504 \\ \pmb{\alpha}^{(j)} &= \alpha_0^{(j)} + \eta^{(j)}, ~\text{~for}~j=1, \dots ,49 \\ \eta^{(j)} &\sim \mathcal{N}(0, 2.96^2) \\ \epsilon^{(i)} &\sim \mathcal{N}(0, 2.07^2) \end{align*}\]

We need to get the between-groups varying coefficients $\alpha_0^{(j)}$ to have all the point estimated coefficients, which we will see later. First, let’s see what just happened.

1.1.2 Posterior Distribution obtained with Stan

When we fit the model with the brm() function, it transformed our formula y ~ 1 + x + (1 | g) into a model coded in stan, including its default prior distribution. With that code and our data, stan fitted the model. Then the brm() function received from stan $5000$ samples from the posterior distribution of our Toy model (1.5), and based on them, it estimated the parameters reported in the fitting summary.

$Histograms of the $\beta_0$ and $\beta_1$ coefficients, for the Toy model, in the samples received by brms from stan. They are approximations to their posterior marginal distribution.$

Figure 1.1: Histograms of the $\beta_0$ and $\beta_1$ coefficients, for the Toy model, in the samples received by brms from stan. They are approximations to their posterior marginal distribution.

The number of samples we get depends on the chains, iter, and warmup parameters in the call to brm(), where the sample size is $chains \times (iter-warmup)$. In our call it is = 4 * (2500-1250) = 5000. We used the default warmup argument, which is half of the number of iterations iter and the default value for chain, 4.

There is always the chance that stan cannot satisfactorily traverse the parameters’ entire sample space. Therefore, the returned samples won’t be representative of the actual posterior distribution. To keep up with this, stan issues statistics such asRhat and Bulk_ESS that appear in summary (for more details see Brief Guide to Stan’s Warnings). Ideally, Rhat is less than 1.01, with 1 being its best possible value. Tail_ESS and Bulk_ESS are expected to be greater than $100 \times chains$. We used iter = 2500 because with its default value 2000, we did not pass these checks, now everything is fine.

1.1.3 Parameter Estimates

The brms package allows us to obtain a sample of the posterior distribution of the parameters and their point estimation, calculated from the samples’ values. For simplicity, here we are going to use point values.

With the ranef() (random effects) function we can obtain the group-varying coefficients $\alpha_0^{(j)}$, and with fixef() (fixed effects) the global ones, while the consolidation from both origins are obtained with coef().

In our example the population-level intercept $\beta_0$ from fixef() is $9.36$, and the intercept for groups 1 and 2 $\alpha_0^{(1)}, \alpha_0^{(2)}$ from ranef() are $-3.34$ and $-2.31$. Then the consolidated intercept, $\alpha_0^{(j)} + \beta_0$ from coef(), for these two groups is $6.01$ and $7.04$, since $6.01 = -3.34+9.36$ and $7.04 = -2.31+9.36$.

cat("--- fixef:\n"); fixef(brm_m0)
cat("--- coef:\n"); coef(brm_m0)

--- fixef:
          Estimate Est.Error     Q2.5     Q97.5
Intercept 9.420005 0.4740923 8.511883 10.355825
x         3.696783 0.0568952 3.585757  3.807682
--- coef:
$g
, , Intercept
    Estimate Est.Error      Q2.5     Q97.5
1   6.009167 0.8352279  4.409063  7.671721
2   7.046681 0.4624261  6.146496  7.931515
3  10.660634 0.5802230  9.518355 11.818091
:

1.1.4 Goodness of Fit

There are multiple and complementary ways to assess the quality of a model fitted with stan. We’ll look at a couple of them to get an idea of how our example progress.

Toy | Posterior Predictive Checking

You can interpret each parameter set in a posterior sample as a potential solution to the model. With a set of these sample units, together with the observed predictors $X$, we can replicate the values of $y$, which we will call $y_{rep}$, by predicting their values according to these potential solutions. In this way, we will obtain replicas that describe both the uncertainty we have about the parameters and that coming from the sampling distribution. We should expect that the distribution of the observed $y$ in the data is a possible case among these replicas $y_{rep}$. Checking how much this is or not fulfilled is called Posterior Predictive Check.

The bayesplot package, from the same team of stan developers, includes dozens of chart types to compare the replicas with the observed data $y$ (see more details here). Fortunately, brms make those charts easy to access with its pp_check() function.

With the pp_check() (posterior predictive checking) function many graph types can be obtained by varying the type parameter. The default plot shows $y$ values on the horizontal axis and on the vertical their density.

pp_check(brm_m0, nsamples = 100) +
  theme(legend.position = c(0.5, 0.4))

$Posterior Predictive Distribution of the Toy Model | Varying only Intercept by Group. Our model is not close enough to the observed data. On the horizontal axis are the values of $y$ and on the vertical their density.$

Figure 1.2: Posterior Predictive Distribution of the Toy Model | Varying only Intercept by Group. Our model is not close enough to the observed data. On the horizontal axis are the values of $y$ and on the vertical their density.

While we do not want to overfit the model to the data, we expect the model to follow its general pattern. In the graph 1.2, we see that our model is deficient in several sections, as in the left tail and in the region of $y$ between $[10 \ldots 15]$.

Loo Information Criterion

Sometimes an index is required to quantify the goodness of fit and compare two models. In our examples, we’ll use loo. loo() and WAIC are similar methods —with the same assumptions— to estimate the out-of-sample model’ predictive capability. This type of measure contrasts with performance estimates based on how close the model replicas are to the actual data, as do indices of the type R2.

It is advisable to prefer loo over WAIC because it has the advantage of being able to tell when its calculation has produced an unreliable result. Both will give you similar and asymptotically equal values. However, only loo will say to you when outliers in the data have dominated the estimation, and it is not trustable.

loo is based on the likelihood of the model’s predictive ability, calculated using leave-one-out cross-validation (LOO-CV), and is more useful when the number of observations is large (Vehtari, Gelman, and Gabry 2016).

loo(brm_m0)

  Computed from 5000 by 504 log-likelihood matrix

           Estimate   SE
  elpd_loo  -1111.3 19.2
  p_loo        50.8  4.3
  looic      2222.6 38.4
  ------
  Monte Carlo SE of elpd_loo is 0.2.
  Pareto k diagnostic values:
                           Count Pct.    Min. n_eff
  (-Inf, 0.5]   (good)     495   98.2%   533       
   (0.5, 0.7]   (ok)         9    1.8%   103       
     (0.7, 1]   (bad)        0    0.0%   <NA>      
     (1, Inf)   (very bad)   0    0.0%   <NA>      

  All Pareto k estimates are ok (k < 0.7).
  See help('pareto-k-diagnostic') for details.

The loo index is stated in two ways (see more details here):

elpd_loo is a value we want to maximize (higher is better).
looic is a value we want to minimize (lower is better).

eplpd_loo can be positive or negative. In turn looic is easily calculated from elpd_loo since $looic \equiv -2 \times elpd\_loo$.

p_loo is an estimator of how much more difficult it is to predict future data than the observed $y$. Asymptotically (regarding the number of observations), under certain conditions, p_loo can be interpreted as the effective number of parameters. When the model performs well, $p\_loo<N$ and $p\_loo<p$, where $p$ is the total number of parameters in the model and $N$ is the number of observations. When this is not true, it indicates that the model has a weak predictive ability and may show a severe specification error.

The loo() values for our model

The values of elpd_loo and looic, compared to their SE (standard error), look reliable. They are the reference against we will compare changes to this model. The Pareto diagnosis also indicates that we can trust these estimates because there are no outliers with excessive influence over the loo estimate.

On the other hand, p_loo = 50.8 does not indicate problems in our model: we have $p = 53$ parameters (49 group interceptions + two coefficients at population level + two variances) and $N = 504$, so $p\_loo <53$ and $p\_loo <504$.

1.2 Intra-Class Correlation

A relevant question in MLR is how significant the between-group variance is compared to the total one? This proportion is measured by the intraclass correlation (ICC: intraclass correlation): \[\begin{equation} ICC = \frac{\sigma_\alpha^2}{\sigma_\alpha^2 + \sigma_{\epsilon}^2} \tag{1.11} \end{equation}\]

In our model we have $ICC=2.96^2/(2.96^2+2.07^2)=67\%$. The higher the ICC, the more relevant is the use of groups in the model.

Figure 1.3: ICC levels in the position of points grouped by colors. The higher the ICC, the greater the variance between groups than between points.

The graph 1.3 contains point clouds of 6 different colors representing each point’s membership to a group. The points and the cloud centers position have variances that correspond to $\sigma_{\epsilon}^2$ and $\sigma_\alpha^2$, respectively in the ICC formula. The ICC levels (left to right) are 20%, 50%, and 90%, illustrating that the higher the ICC, the greater the variance in the groups’ centers than between points in the same group.

1.3 A Multilevel Regression is Required

Intuitively, a higher value of $\sigma_\alpha$ to $\sigma_{\epsilon}$ corresponds to greater relevance in the use of groups, but from what value of this proportion an MLR is recommended? When the design effect index $\text{deff}$ is greater than $1.1$, the use of MLN (Lai 2015) is advised. This design effect index (deff) is defined as: \[\begin{equation} \text{deff} = 1+(n_j-1) \cdot \text{ICC} \tag{1.12} \end{equation}\]

Where ${n}_j$ is the average number of observations per group, and ICC is the intra-class correlation. In our example $\text{deff} = 1+(10.3-1) \times 0.67= 7.2$ which justifies the use of a multilevel regression with groups of g.

If we simplify the equation (1.12), conditional on $de\textit{ff}>1.1$, we find that what the requirement is: \[ \frac{\sigma_{\alpha}}{\sigma_\epsilon} > \frac{1}{\sqrt{10 \cdot n_j-11}} \]

$When is an MLR required?. An MLR is required when the ratio of $\sigma_\alpha/sigma_\epsilon$ is above the curve, corresponding to $\textrm {deff}> 1.1$, the green area on the graph.$

Figure 1.4: When is an MLR required?. An MLR is required when the ratio of $\sigma_\alpha/sigma_\epsilon$ is above the curve, corresponding to $\textrm {deff}> 1.1$, the green area on the graph.

Which for $n_j$ between 5 and 10 requires $\sigma_\alpha$ to be greater than between 0.16 and 0.11 times the value of $\sigma_{\epsilon}$, respectively. Which is not a very hard bar to pass.

1.4 Reversal to a Regression without Groups

The equations (1.6)-(1.9) in the notation of our model in (1.5), correspond to \[\begin{align} y^{(i)} &= \alpha_0^{(j)} + \beta_0 + {\beta_1}{x}^{(i)} +\epsilon^{(i)} ,~for~j=1, \dots ,J \tag{1.13} \\ \alpha_0^{(j)} &\sim \mathcal{N}(0, \sigma_\alpha^2) \nonumber \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \nonumber \end{align}\]

In the $\alpha_0^{(j)}$ distribution we can see that as $\sigma_\alpha$ approaches zero, then $\alpha_0^{(j)}$ also approaches zero: \[ \lim_{\sigma_\alpha \to 0} \alpha_0^{(j)} \to 0\]

That is, when there is no variance between groups, $\alpha_0^{(j)}$ in the equation (1.13) becomes zero, and it reverts to \[\begin{align*} y^{(i)} &= \beta_0 + {\beta_1}{x}^{(i)} +\epsilon^{(i)} ,~for~j=1, \dots ,J \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \end{align*}\]

a regression without groups.

1.5 Toy | Varying Intercept and Slope

When the intercept and slope vary by groups, the model includes $\alpha_1^{(j)} · x$ term as a between-groups varying term, but most notably now states that $\alpha_0^{(j)}$ and $\alpha_1^{(j)}$ belong to a Normal bivariate distribution with correlation $\rho$. This last addition to the model is a natural extension to the case we had before, with only the intercept varying between groups. The added $\rho$ parameter would help us understand de relationship between $\alpha_0$ and $\alpha_1$. It is not a restriction because it can be zero. \[\begin{align} y^{(i)} &= (\alpha_0^{(j)} + \alpha_1^{(j)} x_1^{(i)}) + {X}^{(i)} {\beta} + \epsilon^{(i)} ,~\text{for}~i=1, \dots ,\text{N} \nonumber \\ \begin{pmatrix} \alpha_0^{(j)} \\[5pt] \alpha_1^{(j)} \end{pmatrix} &\sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 0 \\[5pt] 0 \end{pmatrix} , \begin{pmatrix} \sigma_{\alpha_0}^2 & \rho~ \sigma_{\alpha_0} \sigma_{\alpha_1} \\[5pt] \rho~ \sigma_{\alpha_0} \sigma_{\alpha_1} & \sigma_{\alpha_1}^2 \end{pmatrix} \end{pmatrix} \tag{1.14} \\[4pt] \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \nonumber \end{align}\]

When $\beta$ also includes global intercept and slope $\beta_0+\beta_1 x+\cdots$, like between groups, the group estimates are zero-centered, that is $\mu_{\alpha_0} = 0$ y $\mu_{\alpha_1} = 0$, similar to the case with just an intercept at both global and group level.

It is important to note that all pairs of coefficients $(\alpha_0^{(j)}, \alpha_1^{(j)})$ will come from the same distribution. That is, the model learns from all the groups to fit the bivariate distribution, and with this, it estimates the coefficients $\alpha_0^{(j)}$ and $\alpha_1^{(j)}$ considering the patterns in each group.

It is easy to intuit that if we had variation with another predictor per group such as $\alpha_2 ^ {(j)}· x_2$, then $(\alpha_0, \alpha_1, \alpha_2)$ would have a tri-varied Normal distribution with a matrix covariance of size $3 \times 3$, with 3 correlation coefficients: one for each pair of predictors.

1.5.1 Model fitting

This time we will adjust a model with intercept and slope both at global and group g level. We modify the formula to include x as a predictor at both levels.

brm_m1 <- brm(y ~ 1 + x + (1 + x | g), toy_data, seed = 1234)
summary(brm_m1)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 + x + (1 + x | g) 
   Data: data (Number of observations: 504) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~g (Number of levels: 49) 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)        1.33      0.25     0.89     1.88 1.00     2344     3127
sd(x)                0.93      0.12     0.72     1.19 1.00     1389     2143
cor(Intercept,x)     0.18      0.20    -0.21     0.57 1.01      496     1020

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     9.13      0.26     8.61     9.64 1.00     3111     3020
x             3.72      0.15     3.44     4.02 1.00     1188     1955

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.55      0.06     1.45     1.66 1.00     5266     3072
:

The resulting model is \[\begin{align*} y^{(i)} &= \alpha_0^{(j)} + \alpha_1^{(j)} + 9.13 + 3.72 x^{(i)} +\epsilon^{(i)} ,~for~i=1, \dots ,504 \\ \begin{pmatrix} \alpha_0^{(j)} \\[5pt] \alpha_1^{(j)} \end{pmatrix} &\sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 0 \\[5pt] 0 \end{pmatrix} , \begin{pmatrix} 1.33^2 & 0.18 \times 1.33 \times 0.93 \\[5pt] 0.18 \times 1.33 \times 0.93 & 0.93^2 \end{pmatrix} \end{pmatrix} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, 1.55^2) \end{align*}\]

print(ranef(brm_m1), digits = 4)

$g
, , Intercept
   Estimate Est.Error  Q2.5  Q97.5
1   -0.7395    0.7702 -2.2919  0.7728
2   -0.9254    0.5796 -2.0871  0.2139
3   -0.6877    0.8318 -2.3608  0.8645
:
, , x
   Estimate Est.Error  Q2.5 Q97.5
1  -1.25787    0.3178 -1.87344 -0.65652
2  -0.44439    0.2108 -0.86385 -0.03554
3   0.65459    0.2634  0.14985  1.18126
:

The range between l-95% CI and u-95% CI are the parameters credible interval limits. Each parameter credible interval is computed from the parameter values in the sample from the posterior. They make the statement that the parameter is in this interval with a 95% chance, which is a fact from the sample’s parameter values. On the other hand, a confidence interval (frequentist) says that if we obtain repeated samples from the same population from which the data comes, and we estimate the (frequentist) interval on each sample; we will get different intervals, but in 95% of the cases, those intervals will contain the actual parameter value. That is not a statement about the certainty that a given interval includes the real value as it happens in the Bayesian way.

In general, these intervals get narrower with a greater volume of data and get broader with higher variance. In the case of $\rho$, which by its nature is small, it requires a very narrow interval to be precise. The * actual* value of $\rho$, with which we generated the data, is $0.4$.

1.5.2 Goodness of Fit

pp_check()

The graph 1.5 shows this model’ posterior predictive plot.

pp_check(brm_m1, nsamples = 100) +
  theme(legend.position = c(0.4, 0.6))

Predictive Posterior Distribution of the Toy Model | Varying Intercept and Slope. The observed data now looks plausible among the replicas.

Figure 1.5: Predictive Posterior Distribution of the Toy Model | Varying Intercept and Slope. The observed data now looks plausible among the replicas.

The fit is now much better at supporting the actual pattern observed in the data. Now the observed data looks like a plausible case among the replicas.

loo()

Now the loo index is:

Computed from 4000 by 504 log-likelihood matrix

        Estimate   SE
elpd_loo   -977.9 16.3
p_loo        71.3  4.8
looic      1955.8 32.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     481   95.4%   502       
 (0.5, 0.7]   (ok)        20    4.0%   100       
   (0.7, 1]   (bad)        3    0.6%   73        
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

In the Pareto k diagnostic above, we have three poorly classified observations, so the loo estimate might not be reliable. Therefore, we do a k-fold validation on our latest and previous models for greater confidence.

brm_m1 <- add_criterion(brm_m1, "loo")
brm_kfold_m1 <- kfold(brm_m1, K = 12, chains = 5)
brm_kfold_m0 <- kfold(brm_m0, K = 12, chains = 5)
compare_ic(brm_kfold_m1, brm_kfold_m0)

                KFOLDIC    SE
brm_m1          1967.43 33.18
brm_m0          2227.86 38.84
brm_m1 - brm_m0 -260.44 38.04

This output tells us that we could have trust the previous estimate of loo: the looic with unsatisfactory Pareto diagnosis was $1955$, and now we got $1967$.

When comparing our two models, we see this last one is better than the previous one. There is a large loo difference $-260.44$, and its magnitude is much greater than its standard error of $38.04$.

1.6 The Toy Dataset Source Model

The Toy dataset was produced by drawing samples from the following model: \[\begin{align} y^{(i)} &= \alpha_0^{(j)} + \alpha_1^{(j)} + 2.7 + 1.3 x^{(i)} +\epsilon^{(i)} ,~for~i=1, \dots ,504 \\ \begin{pmatrix} \alpha_0^{(j)} \\[5pt] \alpha_1^{(j)} \end{pmatrix} &\sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 6.2 \\[5pt] 2.3 \end{pmatrix} , \begin{pmatrix} 1.4^2 & 0.4 \times 1.4 \times 0.90 \\[5pt] 0.4 \times 1.4 \times 0.90 & 0.90^2 \end{pmatrix} \end{pmatrix} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, 1.6^2) \end{align}\]

As we have already commented, by having the model intercept and slope at both population and group level, the population level coefficients $\beta_0, \beta_1$ get the mean of ($\alpha_0, \alpha_1$), while the group level coefficients ($\alpha_0^{(j)}, \alpha_1^{(j)}$) get deviations around the global mean, they are centered at zero.

In our data, the true global mean is $\mu_{\alpha_0}=2.7+6.2=8.9$ and $\mu_{\alpha_1}=1.3+2.3=3.6$, while the model has estimated $\beta_0=\mu_{\alpha_0}=9.1$ and $\beta_1=\mu_{\alpha_1}=3.7$, so considering the model sampling variance, we have got estimates very close to their true values.

Note that this model, where the data came from, is not identifiable, because of the arbitrary way the mean $\alpha_0^{(j)}, \alpha_1^{(j)}$ has been split. For example, breaking $\beta_0+\mu_{\alpha_0}=8.9$ into $\beta_0=2.7$ for population and $\mu_{\alpha_0}=6.2$ for groups, as happens in this model, is one of the infinite arbitrary ways to do such a partition and produce the same data. The MLR answer to this, using the mean in the population and deviations in the groups, is simple and elegant.

If we fit a model without intercept or slope at the population level y ~ -1 + (1 + x | g), we should obtain $\alpha_0^{(j)}$ and $\alpha_1^{(j)}$ not centered at cero but at their means:

brm_m2 <- brm(y ~ -1 + (1 + x | g), toy_data, seed=1234)
cat("Average Intercept    : \n"); ranef(brm_m2)$g[,"Estimate","Intercept"] %>% mean()
cat("Average x Coefficient: \n"); ranef(brm_m2)$g[,"Estimate","x"] %>% mean()

Average Intercept    : [1] 9.128348
Average x Coefficient: [1] 3.707252

Indeed, the mean of $\alpha_0$ and $\alpha_1$, estimated with the average of $\alpha_0^{(j)}$ and $\alpha_1^{(j)}$, approximately match ($9.12$ and $3.71$) what we obtained before ($9.13$ and $3.72$).

1.7 Regression Pooling Types

Pooling: Complete \[\begin{align*} \text{for}~i &= 1, \ldots, \text{N}: \\ \bar{y} &= \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \dots + \beta_p x_p^{(i)} \end{align*}\]

A regression without groups, which considers that the observations do not vary between the groups, has a type called complete pooling. In this case, a single regression over all data points shows their global trend (gray dotted line in 1.6). As the model is not informed of each group’s trend, there is underfitting on the data: see groups 23 and 6 in 1.6.

Pooling: None \[\begin{align*} \text{for}~j &= 1, \ldots, \text{J}: \\ \bar{y}^{(j)} &= \beta_0^{(j)} + \beta_1^{(j)} x_1 + \beta_2^{(j)} x_2 + \dots + \beta_p^{(j)} x_p \end{align*}\]

A regression on each group of points, with each group completely ignoring the others, has a type called (no pooling). In this case, there are as many regressions as there are groups (with more than one point) in the data (dashed line in 1.6). As each regression is only informed by the group’s pattern and ignores the global trend, it overfits groups with few observations (see groups 12, 22, 30, and 42). It is also unable to solve groups with a single data point.

When a regression is fitted with a group-varying slope and intercept without using the Multilevel model (like with lm() for example), the resulting regressions also have this no-pooling behavior.

lm_fit <- lm(y ~ 1 + x + g + g:x, toy_data)

Pooling: Partial \[\begin{align*} \text{for}~i &= 1, \ldots, \text{N}, ~j = 1, \ldots, \text{J}: \\ \bar{y}^{(i)} &= \alpha_0^{(j)} + \alpha_1^{(j)} x_1 + \dots + \alpha_1^{(j)} x_1 + \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \dots + \beta_p x_p^{(i)} \end{align*}\]

A Multilevel regression can be understood as the generalization of the two types of grouping mentioned above, in which there is a partial pooling. Each group’s slope and intercept come from the same distribution, with a mean approximated to complete pooling. Nevertheless, each group’s coefficients are estimated taking into account the trend of the group’s observations. That is, the model is informed by both the global and the local trend in each group (solid line in 1.6).

When there are few observations, the group regression tends to the global pattern (groups 22, 42, and 30), and when there are more observations, the regression fits more on the group’s points (group 23).

Regressions by Pooling Type. (A) Complete Pooling: The gray dotted line is from a regression on all points. By ignoring the trend of each group, it tends to underfit: groups 23 and 6. (B) No Pooling: The dashed line is a regression on each separated group of points. It tends to overfit when there are few group observations: groups 22 and 42. (C) Partial Pooling: The solid line is from a Multilevel regression where the slope and intercept vary between the two previous extremes. The fitting on the group trend is greater when it has more observations: group 23; while it is more towards the global trend when there are few points in the group: groups 22, 30, and 42.

Figure 1.6: Regressions by Pooling Type. (A) Complete Pooling: The gray dotted line is from a regression on all points. By ignoring the trend of each group, it tends to underfit: groups 23 and 6. (B) No Pooling: The dashed line is a regression on each separated group of points. It tends to overfit when there are few group observations: groups 22 and 42. (C) Partial Pooling: The solid line is from a Multilevel regression where the slope and intercept vary between the two previous extremes. The fitting on the group trend is greater when it has more observations: group 23; while it is more towards the global trend when there are few points in the group: groups 22, 30, and 42.

1.8 Shrinking in Partial Pooling

To see in more detail the effect of partial pooling in an MLR for selected groups in the Toy model, we will compare the slope obtained with the three types of regressions we saw in the previous section. All the models are fitted using stan.

In the graph 1.7, the thick gray horizontal line is the regression slope with total pooling, calculated with stan. The dots represent the estimated slope in each group and the vertical lines their credible range at 95%. The groups are arranged on the horizontal axis according to their size (number of observations). Thus group 49 –on the extreme left– has only one observation, and 9 –to the right– has 21. As we created the data, we know the * actual* value of the group slope, shown with a green diamond.

Slopes by Regression Pooling type. Each group is arranged along the horizontal axis according to the number of observations in each group. In most cases, Multilevel estimates (blue point) have a value between the estimate with no pooling (red point) and complete pooling (gray line). The multilevel model increases the estimate precision (narrower intervals) and improves the slope estimate by bringing it closer to the actual slope (with which we generated the data), shown with a green diamond.

Figure 1.7: Slopes by Regression Pooling type. Each group is arranged along the horizontal axis according to the number of observations in each group. In most cases, Multilevel estimates (blue point) have a value between the estimate with no pooling (red point) and complete pooling (gray line). The multilevel model increases the estimate precision (narrower intervals) and improves the slope estimate by bringing it closer to the actual slope (with which we generated the data), shown with a green diamond.

The blue lines are intervals from the Multilevel regression (calculated with brms) and the red ones from the regression with no pooling (one regression per group, estimated with stan). For groups with three or fewer points, stan could not estimate the regression due to diagnostic errors caused by the scarce data. That is why the first two groups on the left show only blue estimates for the Multilevel regression, capable of solving these cases.

In most groups, the MLR estimate (blue point) has a value between the estimate with no pooling (red point) and the estimate with complete pooling (gray trace), which is called shrinkage. This shrinkage (towards the mean) tends to be more significant when there are fewer observations in the group. The greater uncertainty is compensated by taking more into account the global pattern. The shrinkage enhances the certainty (narrower intervals), and in most cases, brings a better approximation to the actual value.

Gelman (Gelman, Hill, and Yajima 2013) notes that this shrinkage makes it possible to compare effects across groups without worrying about following the classic multiple comparison procedures. He even proposes to build MLRs when these comparisons are required:

“We propose the construction of multilevel models in environments where multiple comparisons arise. Multilevel models perform partial groupings, while the classical procedures usually keep the intervals stationary centers, adjusting the multiple comparisons by making the intervals wider [.. .]. Therefore, multilevel models address multiple comparisons and produce more efficient estimates, especially in settings with low variation at the group level, which is where multiple comparisons are of particular concern.”

(Gelman, Hill, and Yajima 2013)

Introduction to Multilevel Regressions

Oscar de Lama

2021-03-13