We shall analyze Peter Mortimer’s dataset from his work “A Study of Effective Elementary Schools” (Mortimer, Ecob, and Sammons 1989). This dataset is known as the “Junior School Project” (JSP), which is available with the name jsp
in the R package faraway
.
The JSP was a longitudinal study of about 2,000 pupils from 50 primary schools chosen at random among the 636 schools under the Inner London Education Authority (ILEA) in 1980. Here we will use the subset available from that dataset, which contains the following variables:
I
, II
, III nonmanual
, III manual
, IV
, V
, lterm unemp
(long term unemployed), curr unemp
(currently unemployed), father absent
(absent father).The Maths and Raven scores have negative skew (long left tail) distributions due to a ceiling effect: the difference in score among the most proficient students varies very little regarding the measured ability. Also, the score range in English
is different from the other tests. To correct these problems, we have transformed the math
, english
, and raven
variables using the R package bestNormalize
(Peterson and Cavanaugh 2019), which analyzes several possible monotonic transformations (e.g., Yeo-Johnson, Lambert) and applies the most efficient for the given case. For our data, the selected method was Ordered Quantile normalization for all the scores. We did the correction over each test score by school year.
Figure 1.1: JSP Correction for Bias in Scores. (Left) Histogram with the original scores in math
, and (Right) after normalization.
This normalization does not seem so arbitrary when considering this is not an uncommon practice in educational settings. The ceiling effect occurs when a test does not discriminate enough against the most outstanding students. Consequently, from a certain level of competence onwards, they are receiving very similar scores. Ideally, you should add more difficult questions to the test to better discriminate those skilled students. However, in post facto cases like this, the normalization redistributes the scores to correct this.
The transformed variables have also been standardized (centered and scaled) to have zero as mean and one standard deviation as unit. These transformed variables have names ending in _z
. The “JSP Correction for Bias in Scores.” 1.1 graph, shows the distribution of the original maths scores and after the normalization and standardization.
str(jsp)
'data.frame': 3236 obs. of 9 variables:
$ school : Factor w/ 49 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ class : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ gender : Factor w/ 2 levels "m","f": 2 2 2 1 1 1 1 1 1 1 ...
$ social : Factor w/ 9 levels "I","II","III nonmanual",..: 9 9 9 2 2 2 2 2 9 9 ...
$ student : Factor w/ 1192 levels "1","2","3","4",..: 1 1 1 2 2 3 3 3 4 4 ...
$ year : Factor w/ 3 levels "3","4","5": 1 2 3 1 2 1 2 3 1 2 ...
$ raven_z : num -0.403 -0.403 -0.403 -1.588 -1.588 ...
$ math_z : num -0.351 -0.256 -1.146 -1.48 -1.626 ...
$ english_z: num 0.7718 0.4376 -0.0204 -1.9804 -1.9337 ...
In the “Examples of Math Scores” 1.2 graph, we see, for some selected schools, a comparison between mathematics scores in the fifth year against the third. The lines correspond to classical linear regressions fitted on students’ in the same school. They intend to show the score pattern in each school.
Figure 1.2: Examples of Math Scores. Scores on selected schools. The lines correspond to classical linear regressions fitted on students’ in the same school. They intend to show the score pattern in each school.
In some schools, the regression line is almost parallel and completely above that of another school (e.g., 36:light blue over 3:orange), which corresponds to the fact that most students in the school above got better scores at 5th grade, than those in the school below. Consequently, the top school was more effective in promoting student learning than the bottom one.
When the trend lines between two schools are not parallel, the one above on the left side corresponds to where the most lagging students had better progress from 3rd to 5th grade, while the one higher on the right side is where the most proficient students made relatively more progress. Consequently, the lines with little slope correspond to schools where the gap between lagging and skilled students is small.
Having the scores for english
and math
for the 3rd and 5th year, we could analyze the school’s effect on performance by subject, but we would end up with two classifications of schools, one for english
and another for math
. Even though we might see how to consolidate those results into a single ranking, this seems like an overly complicated route.
We have chosen to merge the english
and math
scores and consider them just scores at the third score_3_z
and fifth score_5_z
year. Joining the english
and math
scores this way helps us demonstrate the use of Multilevel Regressions (MLR) without overly complicated analysis. Likewise, following the same criteria, we have kept only the students who stayed in the same school from 3rd to 5th year.
str(jsp_sc, give.attr = F)
tibble [1,774 × 8] (S3: tbl_df/tbl/data.frame)
$ school : Factor w/ 49 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ class : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ gender : Factor w/ 2 levels "m","f": 2 1 1 2 1 2 2 2 2 1 ...
$ social : Factor w/ 9 levels "I","II","III nonmanual",..: 9 2 9 4 9 9 4 2 7 5 ...
$ student : Factor w/ 1192 levels "1","2","3","4",..: 1 3 4 6 7 8 11 14 15 16 ...
$ raven_z : num [1:1774, 1] -0.445 -0.619 -1.868 -1.449 -1.303 ...
$ score_3_z: num [1:1774, 1] -0.435 1.598 -0.297 -0.964 -0.571 ...
$ score_5_z: num [1:1774, 1] -1.15896 1.8403 -0.00456 -2.1354 -0.83616 ...
The outcome variable \(y\) in our model is the 5th year score_5_z
score, and the potential predictor variables are score_3_z
, raven_z,
school,
gender
, and social
. As we centered score_5_z
, the outcome variable is the student’s 5th-year score as a deviation from its mean. As usual, the intercept shall have the unit of \(y\). In our case, due to the scaling, it is the standard deviation of the 5th-year scores. For short, we will call \(\sigma_y\) to the standard deviation of score_5_z
.
The interception corresponds to the case when the numeric predictors are zero, score_3_z = 0
and raven_z = 0
. As these variables are centered at their mean, the interception is the expected score in the 5th year for students with mean numeric predictors. In other words, the interception is the expected 5th-year score for a 3rd year average student.
Mortimer’s study’s primary goal is to find if some schools are more effective than others in promoting student learning and development. The improvement of the scores from the third to the fifth year in the school is indicative of the student’s progress. In causal analysis terminology, the school is the exposure, and the student’s 5th-year score is the outcome when adjusted by the 3rd year ones.
Figure 1.3: JSP Causal Diagram. Factors that influence student performance.
The “JSP Causal Diagram” 1.3, shows our causal model, where we consider the school, the student’s cognitive ability, and his/her other personal traits as drivers of his progress in the school. We use the Raven score as a proxy of his cognitive ability.
In several schools, the number of students’ fathers on each socioeconomic classification is quite unbalanced towards the upper or lower classes. Hence, it seems appropriate to consider the father’s socioeconomic class influences the school’s selection. We also consider the father’s class also influences the environment and the facilities the student has to develop and study, affecting his/her scores and cognitive skills.
Finally, we also consider the gender may influence the pupil’s focus and predisposition to learn in the school years.
To get an idea of how much our following models’ are progressing, we shall build the simplest possible one, where the 5th-year scores are constant. \[\begin{align*} \textrm{score_5\z}^{(i)} &= \beta_0 + \sigma_epsilon^{(i)} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \end{align*}\]
<- brm(score_5_z ~ 1, data = jsp_sc))
(jsp_null loo(jsp_null)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: score_5_z ~ 1
Data: jsp_sc (Number of observations: 1774)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.00 0.02 -0.05 0.05 1.00 3520 2341
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.00 0.02 0.97 1.03 1.00 3591 2725
Computed from 4000 by 1774 log-likelihood matrix
Estimate SE
elpd_loo -2518.7 27.2
p_loo 1.9 0.1
looic 5037.4 54.4
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
In this model the intercept \(\beta_0 = 0.00\) is the mean of score_5_z
, which is zero due to its centering. Recall that \(\sigma_{\epsilon}\), represented by sigma
in the listing above, in this model, is the standard deviation of \(y\), and because of the scaling sigma
it is equal to \(1\).
To assess whether it is appropriate to group by school
, by the father’s socioeconomic classification social
, and by student
, we will fit a model only with intercept in each of these groups. At the population level, we will try the predictors we expect to use in our final model.
<- brm(
jsp_m0 ~ gender + score_3_z + raven_z +
score_5_z 1 | school) + (1 | social) + (1 | student),
(data = jsp_sc, refresh = 0
)
There were 127 divergent transitions after warmup.
The largest R-hat is 1.06, indicating chains have not mixed.
Effective Samples Size (ESS) is too low, indicating posterior means and medians
and tail quantiles may be unreliable.
The messages in the fitting output, like divergences after warmup come from stan
, and they are warning us that it has difficulty to sample the posterior distribution. In this case, we need more iterations iter = 3000
, and more warmup = 2000
(by default, they are 2000 and 1000 respectively). We also need a higher value for the parameters adapt_delta = 0.999
and max_treedepth = 12
(by default, they are \(0.8\) and \(10\) respectively). You can use the suggested link above to get documentation about these warnings.
<- brm(
jsp_m0 ~ gender + score_3_z + raven_z +
score_5_z 1 | school) + (1 | social) + (1 | student),
(data = jsp_sc,
seed = 123,
iter = 3000, warmup = 2000,
control = list(adapt_delta = 0.99, max_treedepth = 12)
) jsp_m0
Family: gaussian
Links: mu = identity; sigma = identity
Formula: score_5_z ~ gender + score_3_z + raven_z + (1 | school) + (1 | social) + (1 | student)
Data: jsp_sc (Number of observations: 1774)
Samples: 4 chains, each with iter = 3000; warmup = 2000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~school (Number of levels: 48)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.24 0.03 0.18 0.31 1.00 1575 2560
~social (Number of levels: 9)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.08 0.04 0.01 0.19 1.00 1119 1323
~student (Number of levels: 887)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.29 0.02 0.24 0.34 1.00 1073 1418
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.02 0.06 -0.13 0.09 1.00 1794 2635
genderf 0.05 0.03 -0.02 0.11 1.00 5082 3561
score_3_z 0.59 0.02 0.55 0.63 1.00 3162 3419
raven_z 0.19 0.02 0.15 0.23 1.00 3504 3347
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.55 0.01 0.52 0.57 1.00 1429 1913
The intercepts at school
, social
and student
levels are their mean effect on the expected 5th-grade score, with \(\sigma_y\) as its unit. The population-level intercept is for the boys, while genderf
is the intercept for the girls, where the average girl has \(0.05-(-0.02)=0.07\,\sigma_y\) better score than the average boy.
The variance between schools is \(0.24^2\), between social categories is \(0.08^2\), and between students is \(0.29^2\), which correspond to a \(ICC=0.24^2/(0.55^2+0.24^2+0.08^2+0.29^2)=12.8\%\), \(1.4\%\), and \(18.7\%\) respectively. The average number of observations per school is \(37\), per social category is \(197\), and per student is \(2\). Then, the design effect index for schools is \(\textit{deff}=1+(37-1)\times 0.128=5.6>1.1\), \(3.74>1.1\) for social
, and \(1.19>1.1\) for student
, which means it is required to keep modeling the data with those groups (Lai 2015).
We will keep school
, social
, and student
groups as required by the design effect index, but now we will add score_3_z
and raven_z
as predictors in school
and social
groups. In other words, we think these variables may have different influences on the student depending on the school
and the father’s socioeconomic classification social
.
<- score_5_z ~ 1 + gender + score_3_z + raven_z + (1 | student) +
formula 1 + score_3_z + raven_z | school) + (1 + score_3_z + raven_z | social) (
We have the categorical variable gender
only at the population level. It is not a good idea to add a group with this variable. It has only two possible values, and technically, it is impossible to find the variance between only two values with a reasonable certainty level. On the other hand, by having our model both intercept and gender
at the population level, we will obtain the global intercept but only the intercept for one of the genders (as Intercept
and genderf
in the model summary above). It is not identifiable to have a global intercept plus one for each gender; that is why we lose one gender intercept. To avoid that, we shall remove the global intercept to get it for both genders.
<- score_5_z ~ 0 + gender + score_3_z + raven_z + (1 | student) +
formula 1 + score_3_z + raven_z | school) + (1 + score_3_z + raven_z | social) (
This time, we will establish some prior distributions instead of using the default ones as we have been doing. To see the prior distributions of the model, we can use the get_prior()
function.
<- get_prior(formula, jsp_sc)) (jsp_priors
prior class coef group resp dpar nlpar bound
1 b
2 b genderf
3 b genderm
4 b raven_z
5 b score_3_z
6 lkj(1) cor
7 cor school
8 cor social
9 student_t(3, 0, 2.5) sd
10 sd school
11 sd Intercept school
12 sd raven_z school
13 sd score_3_z school
14 sd social
15 sd Intercept social
16 sd raven_z social
17 sd score_3_z social
18 sd student
19 sd Intercept student
20 student_t(3, 0, 2.5) sigma
In the list above, we see that since we have omitted the global intercept, we will obtain the coefficients genderf
and genderm
(rows 2 and 3), corresponding to the intercept for each gender, as we wanted. In this article you will find suggestions for prior distributions.
Parameters with blank prior
value have a prior distribution assigned by default by stan
, while those that are not blank are brms
defaults. It is possible to establish a prior distribution for each parameter separately, or the same distribution can be established for a subset of these, as denoted by rows 1, 6, and 9. Thus —for example— it can be seen that the priors with class = b
are assigned by stan
(they are blank), while those of the classes class = cor
, class = sd
and class = sigma
have brm
default distributions (rows 6, 9, 18). More exactly, row 6 indicates that the prior with class = cor
(the correlation between coefficients of the same group) for school
and social
groups is lkj(1)
; this doesn’t apply to students groups because they have only intercepts (Lewandowski, Kurowicka, and Joe 2001).
The class = b
corresponds to the prior distribution for the coefficients at the population level. As our variables are standardized, we will give those coefficients the distribution \(\mathcal{N}(0,2)\).
1, 1] <- prior(normal(0, 2)) jsp_priors[
The prior for the class sd
with the column group
not blank, correspond to the standard deviation of the predictors in each group, such as \(\sigma_{\alpha_0}\) and \(\sigma_{\alpha_1}\) in (??). For these, we are going to use an Exponential(1) distribution.
9, 1] <- prior(exponential(1)) jsp_priors[
The class = sigma
corresponds to the population level standard deviation \(\sigma_{\epsilon}\). As our outcome variable is scaled, and its variance is 1, we will use a Cauchy(0,1) distribution.
20, 1] <- prior(cauchy(0, 1)) jsp_priors[
As a standard deviation can only have positive values, when we assign it a distribution with possible negative values, such as Normal or Cauchy, internally, brm
adjusts the case to use only the distribution’ positive side.
We can check the goodness of our prior distributions by directly calculating their predictions. We sample parameters from the prior distribution and introduce them into the model formula, along with the predictors in the data, to obtain \(y_{\textrm{rep}}\) from the sampling distribution. This procedure is analogous to the Posterior Predictive Checking we did earlier. But instead of using the distribution of the parameters conditioned to the data, we use their prior distribution.
We can do this easily with brm()
using the parameter sample_prior = "only"
; then we graph the distribution of the predicted values \(y_{rep}\) and we will compare it with that of the data using pp_check()
.
<- brm(
jsp_m1_prior
formula,data = jsp_sc,
seed = 123,
prior = jsp_priors,
sample_prior = "only",
iter = 1200, warmup = 200, refresh = 0
)
pp_check(jsp_m1_prior, nsamples = 70) +
theme(legend.position = c(0.8, 0.5)) +
coord_cartesian(xlim = c(-20, 20))
Figure 1.4: Prior Predictive Checking. The predictions of our prior distributions look good, with greater dispersion than the observed data.
The predictions from our prior distribution look good. They cover a range of possibilities reasonably greater than the data.
Having confirmed our prior distribution is reasonable, we fit the model to the data.
<- brm(
jsp_m1
formula,data = jsp_sc,
seed = 123,
prior = jsp_priors,
iter = 2500, warmup = 1500,
control = list(adapt_delta = 0.99)
)<- add_criterion(jsp_m1, "loo")
jsp_m1 summary(jsp_m1)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: score_5_z ~ 0 + gender + score_3_z + raven_z + (1 | student) +
(1 + score_3_z + raven_z | school) + (1 + score_3_z + raven_z | social)
Data: jsp_sc (Number of observations: 1774)
Samples: 4 chains, each with iter = 2500; warmup = 1500; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~school (Number of levels: 48)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.24 0.03 0.19 0.31 1.00 1187 1818
sd(score_3_z) 0.11 0.03 0.06 0.16 1.00 1363 1872
sd(raven_z) 0.03 0.02 0.00 0.09 1.00 1008 1684
cor(Intercept,score_3_z) -0.75 0.15 -0.97 -0.39 1.00 1386 1461
cor(Intercept,raven_z) 0.16 0.44 -0.75 0.89 1.00 3238 2462
cor(score_3_z,raven_z) -0.27 0.46 -0.93 0.77 1.00 2636 2592
~social (Number of levels: 9)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.08 0.04 0.01 0.18 1.01 1029 1178
sd(score_3_z) 0.03 0.02 0.00 0.08 1.00 2140 2145
sd(raven_z) 0.06 0.04 0.00 0.16 1.00 1088 1310
cor(Intercept,score_3_z) 0.01 0.50 -0.87 0.88 1.00 4214 2632
cor(Intercept,raven_z) -0.07 0.46 -0.87 0.82 1.00 2442 2214
cor(score_3_z,raven_z) -0.15 0.51 -0.93 0.83 1.00 1861 2851
~student (Number of levels: 887)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.28 0.03 0.23 0.33 1.00 754 1185
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
genderm -0.01 0.05 -0.12 0.10 1.00 966 1413
genderf 0.04 0.05 -0.07 0.14 1.00 985 1475
score_3_z 0.59 0.03 0.53 0.65 1.00 1831 2716
raven_z 0.19 0.03 0.12 0.25 1.00 2243 2045
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.54 0.01 0.52 0.57 1.00 928 1842
The variance of raven_z
between school
s (0.03) and score_3_z
between social
classes (0.03) seems too small. Let’s try without those predictors.
<- update(
jsp_m2
jsp_m1,formula = score_5_z ~ 0 + gender + score_3_z + raven_z +
1 | student) +
(1 + score_3_z | school) + (1 + raven_z | social)
(
)
loo(jsp_m2, jsp_m1)
:
Model comparisons:
elpd_diff se_diff
jsp_m2 0.0 0.0
jsp_m1 0.0 1.1
The difference between the previous model jsp_m1
and this last one jsp_m2
is not significant, we will keep the simplest model, the last one (Occam’s Razor).
Our final model, including the prior distributions, is: \[\begin{align*} \text{--- At population level:} \\ \text{score\_5\_z} &= \beta_0^{(gender)} + \beta_1 \, \text{raven\_z} + \beta_2 \, \text{score\_3\_z} + \pmb{\alpha}^{(school)} + \pmb{\gamma}^{(social)} + \delta_0^{(student)} + \epsilon \\ \text{priors:} \\ \delta_0^{(student)} &\sim \mathcal{N}(0,\sigma_{student}^2) \\ \sigma_{student} &\sim \mathrm{Exponential}(1) \\ \epsilon &\sim \mathrm{Cauchy}(0, 1) \\[4pt] \text{--- At school level:} \\ \pmb{\alpha}^{(school)} &= \alpha_0^{(school)} + \alpha_1^{(school)} \, \text{score\_3\_z} \\ \begin{pmatrix} \alpha_0^{(school)} \\[5pt] \alpha_1^{(school)} \end{pmatrix} &\sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 0 \\ 0 \end{pmatrix} , \Sigma_{school} \end{pmatrix} \\ \text{priors:} \\ \Sigma_{school} &= \operatorname{diag}(\pmb{\tau}_\alpha) \, \pmb{\Omega}_\alpha \text{diag}(\pmb{\tau}_\alpha) \\ \pmb{\tau}_\alpha^{(j)} &\sim \mathrm{Exponential}(1), \; j = 1, 2 \\ \pmb{\Omega}_\alpha &\sim \mathrm{LKJ}(1) \\[4pt] \text{--- At social class level:} \\ \pmb{\gamma}^{(social)} &= \gamma_0^{(social)} + \gamma_1^{(social)} \, \text{raven\_z} \\ \begin{pmatrix} \gamma_0^{(social)} \\[5pt] \gamma_1^{(social)} \end{pmatrix} &\sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 0 \\ 0 \end{pmatrix} , \Sigma\_{social} \end{pmatrix} \\ \text{priors:} \\ \Sigma\_{social} &= \operatorname{diag}(\pmb{\tau}_\gamma) \, \pmb{\Omega}_\gamma \operatorname{diag}(\pmb{\tau}_\gamma) \\ \pmb{\tau}_\gamma^{(j)} &\sim \mathrm{Exponential}(1), \, j = 1, 2 \\ \pmb{\Omega}_\gamma &\sim \mathrm{LKJ}(1) \end{align*}\]
jsp_m2loo(jsp_m2)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: score_5_z ~ gender + score_3_z + raven_z + (1 | student) +
(1 + score_3_z | school) + (1 + raven_z | social) - 1
Data: jsp_sc (Number of observations: 1774)
Samples: 4 chains, each with iter = 2500; warmup = 1500; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~school (Number of levels: 48)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.24 0.03 0.18 0.31 1.00 1021 2009
sd(score_3_z) 0.10 0.02 0.06 0.15 1.00 1400 2265
cor(Intercept,score_3_z) -0.80 0.15 -0.99 -0.43 1.00 1050 1347
~social (Number of levels: 9)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.08 0.05 0.02 0.19 1.00 1074 1258
sd(raven_z) 0.05 0.04 0.00 0.15 1.00 724 1163
cor(Intercept,raven_z) -0.05 0.52 -0.93 0.88 1.00 1699 2463
~student (Number of levels: 887)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.28 0.02 0.23 0.33 1.00 761 1499
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
genderm -0.01 0.06 -0.12 0.10 1.00 1046 1580
genderf 0.04 0.06 -0.08 0.15 1.00 1012 1386
score_3_z 0.59 0.02 0.54 0.64 1.00 1968 2668
raven_z 0.18 0.03 0.12 0.25 1.00 2071 1859
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.54 0.01 0.52 0.57 1.00 1113 2499
:
Computed from 4000 by 1774 log-likelihood matrix
Estimate SE
elpd_loo -1631.6 32.9
p_loo 328.2 11.0
looic 3263.2 65.7
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 1710 96.4% 303
(0.5, 0.7] (ok) 58 3.3% 119
(0.7, 1] (bad) 6 0.3% 54
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
Our final model explains \(1-0.54^2/1.00^2 = 71\%\) of the outcome variable variance. Also, \(loo\) has improved very much, from elpd_loo= -2518.7
in our reference model to -1631.6
in this final one.
A note that we must put next to each interpretation we make of the model parameters is (*) “… in a regression modeling scores from 3rd to 5th school year, discounting the effect of school, gender, student, student’s father social class, and the student’s abstract reasoning ability measured by the Raven test.”
The population-level intercept is separated by gender (genderm
: male, genderf
: female) and is very small, although with a slight advantage of girls over boys. We will later analyze this in detail.
At the school level, there is a negative correlation \(-0.80\) between the intercept and the slope of score_3_z
, that is, the higher the intercept, the lower the slope and vice versa. Later we analyze and interpret this.
Let’s take a look at a couple of graphical checks to get a sense of the goodness of fit.
Figure 1.5: JSP: Final Model Check. (Left) Posterior Predictive Check of the model, which looks good. (Right) Residual check \(\epsilon\) vs. Fitted Values (\(\hat{y}\) predicted), also looks good, there are no patterns neither in the distribution of the points nor their variability. The diagonal patterns are artifacts caused by the discreteness of the data.
The model looks good in the graphic 1.5. On the left side we see the the model predictions distribution supporting the observed data. Even though there are segments where \(y\) is at the margin of the predictions, it does not stray too far. On the right side, there is no pattern in the residuals nor their variability. The diagonal patterns are artifacts caused by the discreteness of the data.
Since the scores are made up of the math
and english
subjects, we can draw a line for each student, joining those subjects in a 5th vs. 3rd-year scores graph.
Figure 1.6: JSP: Score Prediction at selected schools. Only the eight schools with the highest and lowest interceptions are included. Line segments link each student’s grade predictions for English and math. These lines converge in the upper right corner of the graph, corresponding to the fact that the most outstanding students obtain high scores depending less on the school in which they study, as opposed to the most lagging ones (close to the lower-left corner) which obtain more different grades depending on from school.
In the graph JSP: Predictions in selected schools 1.6, each line corresponds to a student, joining his/her scores in English and Mathematics in the third and fifth year. We show the eight schools with higher or lower interceptions. We can see these lines converge roughly in the graph upper right corner. This convergence shows the negative correlation, estimated in \(\text{cor(Intercept, \, score\_3\_z)} = -0.80\) between the intercept and slope per school, whose interpretation is:
The more outstanding a student is, the less his/her score depends on the school. Conversely, the more lagging a student is, the more his/her grades varies with the school. Consequently, when the school effect is negative, the most affected are the least skilled students.
Having an approximation to the joint distribution of the model parameters, we can test hypotheses that otherwise, with classical point estimates, would be impossible to test with precision. With hypothesis we refer to a proposition that predicates on one or more model parameters, such as “\(\beta_0>0\)” or “\(\alpha_0^{(2)}> \alpha_0^{(3)}\).” Here are some examples.
The graph 1.7-A shows the distribution of two schools intercepts (\(\alpha_0^{(6)}\) and \(\alpha_0^{(29)}\)), wherein the base of each distribution, both its credible interval at \(95\%\) and its mean are shown in blue. Since the distributions overlap so much, at first glance, it seems counterintuitive to say that we have a \(99\%\) certainty that the intercept of the s6
school above is greater than that of the s29
school bottom. In 1.7-B the red dashed line is the y=x
line. Seeing that most points are above this line, now it is clear that the s6
intercept is greater than s29
. When using classical pointwise regressions, the blue trace in graph A is all the information we have. It is impossible to guess that one school intercept is actually greater than the other one with so much overlap.
Figure 1.7: JSP: Comparison of Intercept of Schools s29 and s6. In (A), the blue point and thick line are the estimated interception value and range of two schools, which is all you have in a classical pointwise regression. At first glance from that graph, it seems counterintuitive, but the school s6
intercept is higher than the one of the school s29
with a 99% certainty. In (B), you can see both intercepts’ marginal joint distribution, where the red dashed line is the y=x
line. Now it is clear that with certainty, one school intercept is greater than the other one.
In Bayesian inference, we compare two parameters through their values in the sample, and we estimate the certainty with the proportion of the findings. For example, among the \(4000\) sample units, in \(3958\) of them, the school s6
intercept is greater than that of s29
, so the estimated probability is \(3958/4000 =98.95\%\). This way, we evaluate the expected value of an Indicator function \(g(\theta)\), which is \(1\) when the predicate on the parameter value is true and zero otherwise. The predicate states the hypothesis we want to test.
\[
\mathop{{}\mathbb{E}}(g(\theta)) =
\\
\int g(\theta)
\cdot \pi(\theta \mid y) \, \textrm{d}\theta \approx \frac{1}{N} \sum_i g(\theta_i)
\]
For our example \(g(\theta)=I(\theta^{(6)} > \theta^{(29)})\).
With brms
, you can test hypotheses with the hypothesis()
function. In our previous example, the hypothesis is:
hypothesis(jsp_m2,
c("s6>s29" = "r_school[6,Intercept] > r_school[29,Intercept]"),
class = NULL
)
Hypothesis Tests for class :
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 s6>s29 0.35 0.16 0.1 0.62 94.24 0.99 *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
In the hypothesis()
function output, the Estimate
column shows the estimated value of the difference between the schools’ interceptions (\(s6-s29\)). The Est.Error
value is the Estimate
d value standard error. The column Evid.Ratio
is the ratio of the evidence in favor of the hypothesis to that against it \(Evid.Ratio = 0.9895 / (1−0.9895) = 94.24\). Finally, Post.Prob
is the proportion of the samples in favor of the hypothesis.
Figure 1.8: JSP: Difference of Interceptions of Schools 6 and 29. In 98% of the samples, the Intercept of school 6 is greater than that of school 29. The dotted line indicates the mean difference value. The annotations in red and blue are the percentage of samples with the difference below and above zero, respectively. SD
is the standard deviation of the difference.
To better understand the results of the hypothesis()
function, the plot 1.8 shows the distribution of the school’s intercepts difference of (s6
minus s29
) together with some statistics about it. As expected, the \(98.95\%\) of the differences is positive, equal to the Post.Prob
column in the result of hypothesis()
, and its standard deviation (SD in the graph) is equal to the Est.Error
value, while the average difference is \(0.35\), as reported in the Estimate
column.
In the jsp_m2
fitting summary, we see in the gender intercept, that the girls genderf
(gender:fem) have a very slight advantage (from \(0.03-(- 0.01)=0.04 \, \sigma_y\)) with respect to the boys genderm
(gender:masc). But there is no great certainty that either of these two interceptions is different from zero, as their credible ranges include it. What can we say about the difference between these two interceptions? Perhaps the difference’s variance is so great that it precludes us from saying anything about this difference with reasonable certainty? Let’s solve this with the help of the hypothesis()
function.
hypothesis(jsp_m2, "genderf - genderm > 0", class = "b")
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (genderf-genderm) > 0 0.04 0.03 -0.01 0.09 10.33 0.91
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
:
We find that in \(91\%\) of the samples, the intercept difference favors the girls.
The 3rd-grade average girl student has a slightly better 5th-grade score than the average boy student. The data shows this with an estimated certainty of \(91\%\), and the estimated advantage of girls over boys is about \(0.04 \, \sigma_y\).
We can test as a hypothesis the sign of the intercept in each social
class, that is, test whether the parent’s Socioeconomic class has a positive or negative effect on the student learning ability.
hypothesis(jsp_m2, "Intercept > 0", scope = "ranef", group = "social")
Hypothesis Tests for class :
Group Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 I (Intercept) > 0 0.05 0.07 -0.05 0.18 3.02 0.75
2 II (Intercept) > 0 0.06 0.06 -0.02 0.15 7.08 0.88
3 III nonmanual (Intercept) > 0 0.03 0.05 -0.06 0.12 2.35 0.70
4 III manual (Intercept) > 0 -0.04 0.05 -0.11 0.03 0.21 0.18
5 IV (Intercept) > 0 0.00 0.06 -0.09 0.08 0.88 0.47
6 V (Intercept) > 0 -0.07 0.07 -0.19 0.02 0.15 0.13
7 lterm unemp (Intercept) > 0 -0.06 0.07 -0.18 0.03 0.22 0.18
8 curr unemp (Intercept) > 0 -0.02 0.07 -0.15 0.09 0.65 0.39
9 father absent (Intercept) > 0 0.05 0.05 -0.02 0.13 5.78 0.85
Let’s remember that in our model, these interceptions are deviations around the mean at the population level. They are centered at zero, and positive values refer to intercepts above the mean and vice versa. That is why it makes sense to evaluate the certainty we have of these intercepts concerning zero.
When Evid.Ratio
is less than 1, it indicates we have more evidence that the effect of the social
class is negative than positive. Thus, in the case of row \(6\) Evid.Ratio=0.13
, we have \(1/0.13=7.7\) times more evidence that the effect is negative than positive. Correspondingly, the probability that the class has a negative impact is \(1-0.13= 87\%\).
The influence of the father’s socioeconomic class on student performance is small, with an average magnitude of \(0.04 \,\sigma_y\) up to an estimated maximum of \(0.07 \,\sigma_y\). We are more confident that this influence is favorable for students with parents in the first two social classes and the category
father absent
; and that it is negative with parents in the last categoryV
, or that they are unemployed in the long termlterm unemp
. However, the certainty we have of the referred effects is in the order of \([75\% - 88\%]\).Surprisingly, the
father absent
category has a positive effect on student learning, with an advantage of \(0.05-(-0.07) = 0.12 \,\sigma_y\) compared to the impact of a parent in the last categoryV
.
Let us recall that with the ranef()
function, we can obtain the predictors’ coefficients in the groups, such as the intercept and score_3_z
by school
.
ranef(jsp_m2)$school %>% print(digits = 3)
, , Intercept
Estimate Est.Error Q2.5 Q97.5
1 -0.32204 0.0876 -0.4935 -0.1511
2 -0.15156 0.1226 -0.3919 0.0893
3 0.09473 0.1445 -0.1789 0.3798
4 0.02927 0.0893 -0.1440 0.2095
5 -0.06213 0.1028 -0.2666 0.1406
:
, , score_3_z
Estimate Est.Error Q2.5 Q97.5
1 0.12172 0.0542 0.02355 0.235792
2 0.04731 0.0635 -0.07774 0.177896
3 -0.03464 0.0749 -0.18919 0.111111
4 -0.00567 0.0508 -0.10957 0.094687
5 0.02995 0.0575 -0.08256 0.149794
:
The intercept in our model is composed by terms at the population level (by gender), and by the intercept in each of the three groups we have. If we call \(\hat{\beta_0}\) to that total intercept, we have: \[\begin{equation} \hat{\beta_0} = \beta_0^{(gender)} + \alpha_0^{(school)} + \gamma_0^{(social)} + \delta_0^{(student)} \end{equation}\]
Since this total intercept is the expected 5th-year score for the average third-year student, we can classify schools according to their effect on this average student. However, to make a fair comparison, we must consider the school impact on the same typical student. We look up to this through the expected effect on the student: \[\begin{equation} \mathop{\mathbb{E}}(\hat{\beta_0}) = \mathop{\mathbb{E}}(\beta_0^{(gender)}) + \alpha_0^{(school)} + \mathop{\mathbb{E}}(\gamma_0^{(social)}) + \mathop{\mathbb{E}}(\delta_0^{(student)}) \tag{1.1} \end{equation}\]
Where the expectation \(\mathbb{E}()\) is taken over the student’ characteristics, and all of these terms will be constant regarding the students. The only term that depends on the school is the \(\alpha_0^{(school)}\). We shall use this term to classify the schools by ther effect on the average student.
Figure 1.9: Classification of Schools Effect on Average Pupil. Each point represents the estimated intercept, and the vertical bars their credible interval at 50% and 95%. The relative effect between best and worst school is enormous, about \(1 \,\sigma_y\) one standard deviation in the student’s 5th-year score.
In the graph Classification of Schools Effect on Average Pupil 1.9, the schools are classified by their impact on the average third-year student; the credible intervals are at 50% and 95%. The relative effect between the best and the worst school is enormous. It is approximately one standard deviation \(1 \,\sigma_y\) in the student’s 5th-year score. This is in agreement with the standard deviation of these schools’ intercepts \((0.24)\).
If we consider that 52.1% of the students in the data are girls, we can estimate \(\mathop{\mathbb{E}}_{student}(\beta_0^{(gender)})\) in (1.1), with the sum of the intercepts by gender weighted by the frequency: \[\begin{align*} \beta_0 &= \begin{cases} -0.00991 & \text{if gender}= \text{m (47.9\%)} \\ \phantom{-}0.03656 & \text{if gender}= \text{f (52.1\%)} \end{cases} \\ \implies \mathop{\mathbb{E}}_{student}(\beta_0) &= 47.9\% \times -0.00991 + 52.1\% \times +0.03656 = 0.01429 \end{align*}\]
In the same way we estimate the expected intercept by the father’ socioeconomic class social
\(\mathop{\mathbb{E}}_{student}(\gamma_0^{(social)})\) at \(-0.00424\), and the expected slope of raven
\(\mathop{\mathbb{E}}_{student}(\gamma_1^{(social)})\), at \(0.00418\). Using these values, the regressions for the schools with the highest and lowest interceptions (best and worst schools) are:
\[\begin{align*}
\text{Best school:}
\\
\text{score\_5\_z} &= 0.5563 + 0.1874 \, \text{raven\_z} + 0.4041 \, \text{score\_3\_z}
\\
\text{Worst School:}
\\
\text{score\_5\_z} &= -0.4463 + 0.1874 \, \text{raven\_z} + 0.7315 \, \text{score\_3\_z}
\end{align*}\]
The average student in the best school has a score of \(0.5563 - (-0.4463) = 1.0 \,\sigma_y\) higher than one in the worst. For the worst school students to catch up with the ones in the best school, it is required that: \[\begin{align*} -0.4463 + 0.1874 \, \text{raven\_z} + 0.7315 \, \text{score\_3\_z} &= 0.5563\\ 0.1874 \, \text{raven\_z} + 0.7315 \, \text{score\_3\_z} &= 1.0 \end{align*}\]
Where one of the infinite solutions is \(\text{raven\_z} = 1.1\) and \(\text{score\_3\_z} = 1.1\), since \(0.1874 \times 1.1 + 0.7315 \times 1.1 = 1.0 \, \sigma_y\), which allows us to state:
The school has a significant impact on student learning. The average student in the school with the most positive influence (best school) has an estimated advantage of \(1\) standard deviation in the 5th-year score over the same average student in the worst school.
For a student in the worst school to catch up with the average student in the best one, he/she must have scored in 3rd year and abstract reasoning
raven
with \(1.1\) standard deviations above the population average.
To classify the schools, we will cross-compare each school’s intercept against all the others (i.e., \(48 \times 47/2 = 1128\) comparisons) using the hypothesis test we saw earlier. In each comparison, we will conclude that one school has a more significant positive effect on the student than another when the certainty is greater than \(95 \%\). Remember that classifying schools by interception are equivalent to doing so according to their effect on the 3rd-year average student.
In the graph JSP: Comparison of each school against all the others 1.10, each green point flags that the school on the horizontal axis has a more significant positive effect on the average student than the school on the vertical axis, with certainty greater than \(95\%\). For example, school 27
has a more significant effect than schools 29
and 1
. The red dots indicate the opposite, so the 7
has less impact on the student than the 31
schools.
Figure 1.10: JSP: Comparing each school against all others. Each green dot indicates the school in the row has a greater effect (95% or more) than the school in the column. Each red dot indicates the opposite, also at 95%.
The middle zone without points contains the combinations of schools for which there is no certainty (at 95%) which one is better than the other. A larger volume of data is required to solve these cases. However, we have been able to discriminate in \(461\) out of \(1128\) comparisons; that’s the \(41 \%\) of all of them. If we had only the intervals shown in the graph JSP: interceptions by school 1.9, as in the classic point estimation regressions, we would have been able to discriminate the difference of the effect in much fewer schools.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.