1 The Junior School Project

We shall analyze Peter Mortimer’s dataset from his work “A Study of Effective Elementary Schools” (Mortimer, Ecob, and Sammons 1989). This dataset is known as the “Junior School Project” (JSP), which is available with the name jsp in the R package faraway.

The JSP was a longitudinal study of about 2,000 pupils from 50 primary schools chosen at random among the 636 schools under the Inner London Education Authority (ILEA) in 1980. Here we will use the subset available from that dataset, which contains the following variables:

school: School ID {1,…, 50} (categorical).
class: Class ID {1,…, 4} (categorical).
student: Student ID {1,…, 1402} (categorical).
gender: Student gender {m, f} (categorical).
year: Year of school {3, 4, 5} (categorical).
social: Student’ father socioeconomic class (categorical). The levels are: I, II, III nonmanual, III manual, IV, V, lterm unemp (long term unemployed), curr unemp (currently unemployed), father absent (absent father).
raven: Student’ Raven test score (between 0 y 40). Several studies have found this metric median or highly correlated with the evaluated person’ IQ. This test was evaluated at the beginning of the student’s studies.
math: Student’s score in the mathematics test (between 0 and 40).
english: Student’s score in the English test (between 0 and 100).

1.1 Skew Correction

The Maths and Raven scores have negative skew (long left tail) distributions due to a ceiling effect: the difference in score among the most proficient students varies very little regarding the measured ability. Also, the score range in English is different from the other tests. To correct these problems, we have transformed the math, english, and raven variables using the R package bestNormalize (Peterson and Cavanaugh 2019), which analyzes several possible monotonic transformations (e.g., Yeo-Johnson, Lambert) and applies the most efficient for the given case. For our data, the selected method was Ordered Quantile normalization for all the scores. We did the correction over each test score by school year.

Figure 1.1: JSP Correction for Bias in Scores. (Left) Histogram with the original scores in math, and (Right) after normalization.

This normalization does not seem so arbitrary when considering this is not an uncommon practice in educational settings. The ceiling effect occurs when a test does not discriminate enough against the most outstanding students. Consequently, from a certain level of competence onwards, they are receiving very similar scores. Ideally, you should add more difficult questions to the test to better discriminate those skilled students. However, in post facto cases like this, the normalization redistributes the scores to correct this.

The transformed variables have also been standardized (centered and scaled) to have zero as mean and one standard deviation as unit. These transformed variables have names ending in _z. The “JSP Correction for Bias in Scores.” 1.1 graph, shows the distribution of the original maths scores and after the normalization and standardization.

str(jsp)

'data.frame':   3236 obs. of  9 variables:
 $ school   : Factor w/ 49 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ class    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ gender   : Factor w/ 2 levels "m","f": 2 2 2 1 1 1 1 1 1 1 ...
 $ social   : Factor w/ 9 levels "I","II","III nonmanual",..: 9 9 9 2 2 2 2 2 9 9 ...
 $ student  : Factor w/ 1192 levels "1","2","3","4",..: 1 1 1 2 2 3 3 3 4 4 ...
 $ year     : Factor w/ 3 levels "3","4","5": 1 2 3 1 2 1 2 3 1 2 ...
 $ raven_z  : num  -0.403 -0.403 -0.403 -1.588 -1.588 ...
 $ math_z   : num  -0.351 -0.256 -1.146 -1.48 -1.626 ...
 $ english_z: num  0.7718 0.4376 -0.0204 -1.9804 -1.9337 ...

1.2 Effect of the School on the Student’s Academic Progress

In the “Examples of Math Scores” 1.2 graph, we see, for some selected schools, a comparison between mathematics scores in the fifth year against the third. The lines correspond to classical linear regressions fitted on students’ in the same school. They intend to show the score pattern in each school.

Examples of Math Scores. Scores on selected schools. The lines correspond to classical linear regressions fitted on students’ in the same school. They intend to show the score pattern in each school.

Figure 1.2: Examples of Math Scores. Scores on selected schools. The lines correspond to classical linear regressions fitted on students’ in the same school. They intend to show the score pattern in each school.

In some schools, the regression line is almost parallel and completely above that of another school (e.g., 36:light blue over 3:orange), which corresponds to the fact that most students in the school above got better scores at 5th grade, than those in the school below. Consequently, the top school was more effective in promoting student learning than the bottom one.

When the trend lines between two schools are not parallel, the one above on the left side corresponds to where the most lagging students had better progress from 3rd to 5th grade, while the one higher on the right side is where the most proficient students made relatively more progress. Consequently, the lines with little slope correspond to schools where the gap between lagging and skilled students is small.

Having the scores for english and math for the 3rd and 5th year, we could analyze the school’s effect on performance by subject, but we would end up with two classifications of schools, one for english and another for math. Even though we might see how to consolidate those results into a single ranking, this seems like an overly complicated route.

Merging Subjects

We have chosen to merge the english and math scores and consider them just scores at the third score_3_z and fifth score_5_z year. Joining the english and math scores this way helps us demonstrate the use of Multilevel Regressions (MLR) without overly complicated analysis. Likewise, following the same criteria, we have kept only the students who stayed in the same school from 3rd to 5th year.

str(jsp_sc, give.attr = F)

tibble [1,774 × 8] (S3: tbl_df/tbl/data.frame)
 $ school   : Factor w/ 49 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ class    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ gender   : Factor w/ 2 levels "m","f": 2 1 1 2 1 2 2 2 2 1 ...
 $ social   : Factor w/ 9 levels "I","II","III nonmanual",..: 9 2 9 4 9 9 4 2 7 5 ...
 $ student  : Factor w/ 1192 levels "1","2","3","4",..: 1 3 4 6 7 8 11 14 15 16 ...
 $ raven_z  : num [1:1774, 1] -0.445 -0.619 -1.868 -1.449 -1.303 ...
 $ score_3_z: num [1:1774, 1] -0.435 1.598 -0.297 -0.964 -0.571 ...
 $ score_5_z: num [1:1774, 1] -1.15896 1.8403 -0.00456 -2.1354 -0.83616 ...

The outcome variable $y$ in our model is the 5th year score_5_z score, and the potential predictor variables are score_3_z, raven_z, school, gender, and social. As we centered score_5_z, the outcome variable is the student’s 5th-year score as a deviation from its mean. As usual, the intercept shall have the unit of $y$. In our case, due to the scaling, it is the standard deviation of the 5th-year scores. For short, we will call $\sigma_y$ to the standard deviation of score_5_z.

The interception corresponds to the case when the numeric predictors are zero, score_3_z = 0 and raven_z = 0. As these variables are centered at their mean, the interception is the expected score in the 5th year for students with mean numeric predictors. In other words, the interception is the expected 5th-year score for a 3rd year average student.

1.2.1 Causal Model

Mortimer’s study’s primary goal is to find if some schools are more effective than others in promoting student learning and development. The improvement of the scores from the third to the fifth year in the school is indicative of the student’s progress. In causal analysis terminology, the school is the exposure, and the student’s 5th-year score is the outcome when adjusted by the 3rd year ones.

Figure 1.3: JSP Causal Diagram. Factors that influence student performance.

The “JSP Causal Diagram” 1.3, shows our causal model, where we consider the school, the student’s cognitive ability, and his/her other personal traits as drivers of his progress in the school. We use the Raven score as a proxy of his cognitive ability.

In several schools, the number of students’ fathers on each socioeconomic classification is quite unbalanced towards the upper or lower classes. Hence, it seems appropriate to consider the father’s socioeconomic class influences the school’s selection. We also consider the father’s class also influences the environment and the facilities the student has to develop and study, affecting his/her scores and cognitive skills.

Finally, we also consider the gender may influence the pupil’s focus and predisposition to learn in the school years.

1.3 The Reference Model

To get an idea of how much our following models’ are progressing, we shall build the simplest possible one, where the 5th-year scores are constant. \[\begin{align*} \textrm{score_5\z}^{(i)} &= \beta_0 + \sigma_epsilon^{(i)} \\ \epsilon^{(i)} &\sim \mathcal{N}(0, \sigma_{\epsilon}^2) \end{align*}\]

(jsp_null <- brm(score_5_z ~ 1, data = jsp_sc))
loo(jsp_null)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: score_5_z ~ 1 
   Data: jsp_sc (Number of observations: 1774) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
        total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.00      0.02    -0.05     0.05 1.00     3520     2341

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.00      0.02     0.97     1.03 1.00     3591     2725

Computed from 4000 by 1774 log-likelihood matrix
         Estimate   SE
elpd_loo  -2518.7 27.2
p_loo         1.9  0.1
looic      5037.4 54.4
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).

In this model the intercept $\beta_0 = 0.00$ is the mean of score_5_z, which is zero due to its centering. Recall that $\sigma_{\epsilon}$, represented by sigma in the listing above, in this model, is the standard deviation of $y$, and because of the scaling sigma it is equal to $1$.

1.4 With Intercept by School, Father’ Class, and Student

To assess whether it is appropriate to group by school, by the father’s socioeconomic classification social, and by student, we will fit a model only with intercept in each of these groups. At the population level, we will try the predictors we expect to use in our final model.

jsp_m0 <- brm(
  score_5_z ~ gender + score_3_z + raven_z +
    (1 | school) + (1 | social) + (1 | student),
  data = jsp_sc, refresh = 0
)

There were 127 divergent transitions after warmup. 
The largest R-hat is 1.06, indicating chains have not mixed.
Effective Samples Size (ESS) is too low, indicating posterior means and medians 
and tail quantiles may be unreliable.

The messages in the fitting output, like divergences after warmup come from stan, and they are warning us that it has difficulty to sample the posterior distribution. In this case, we need more iterations iter = 3000, and more warmup = 2000 (by default, they are 2000 and 1000 respectively). We also need a higher value for the parameters adapt_delta = 0.999 and max_treedepth = 12 (by default, they are $0.8$ and $10$ respectively). You can use the suggested link above to get documentation about these warnings.

jsp_m0 <- brm(
  score_5_z ~ gender + score_3_z + raven_z +
    (1 | school) + (1 | social) + (1 | student),
  data = jsp_sc,
  seed = 123,
  iter = 3000, warmup = 2000,
  control = list(adapt_delta = 0.99, max_treedepth = 12)
)
jsp_m0

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: score_5_z ~ gender + score_3_z + raven_z + (1 | school) + (1 | social) + (1 | student) 
   Data: jsp_sc (Number of observations: 1774) 
Samples: 4 chains, each with iter = 3000; warmup = 2000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~school (Number of levels: 48) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.24      0.03     0.18     0.31 1.00     1575     2560

~social (Number of levels: 9) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.08      0.04     0.01     0.19 1.00     1119     1323

~student (Number of levels: 887) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.29      0.02     0.24     0.34 1.00     1073     1418

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.02      0.06    -0.13     0.09 1.00     1794     2635
genderf       0.05      0.03    -0.02     0.11 1.00     5082     3561
score_3_z     0.59      0.02     0.55     0.63 1.00     3162     3419
raven_z       0.19      0.02     0.15     0.23 1.00     3504     3347

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.55      0.01     0.52     0.57 1.00     1429     1913

The intercepts at school, social and student levels are their mean effect on the expected 5th-grade score, with $\sigma_y$ as its unit. The population-level intercept is for the boys, while genderf is the intercept for the girls, where the average girl has $0.05-(-0.02)=0.07\,\sigma_y$ better score than the average boy.

The variance between schools is $0.24^2$, between social categories is $0.08^2$, and between students is $0.29^2$, which correspond to a $ICC=0.24^2/(0.55^2+0.24^2+0.08^2+0.29^2)=12.8\%$, $1.4\%$, and $18.7\%$ respectively. The average number of observations per school is $37$, per social category is $197$, and per student is $2$. Then, the design effect index for schools is $\textit{deff}=1+(37-1)\times 0.128=5.6>1.1$, $3.74>1.1$ for social, and $1.19>1.1$ for student, which means it is required to keep modeling the data with those groups (Lai 2015).

1.5 Towards the Final Model

We will keep school, social, and student groups as required by the design effect index, but now we will add score_3_z and raven_z as predictors in school and social groups. In other words, we think these variables may have different influences on the student depending on the school and the father’s socioeconomic classification social.

formula <- score_5_z ~ 1 + gender + score_3_z + raven_z + (1 | student) +
  (1 + score_3_z + raven_z | school) + (1 + score_3_z + raven_z | social)

We have the categorical variable gender only at the population level. It is not a good idea to add a group with this variable. It has only two possible values, and technically, it is impossible to find the variance between only two values with a reasonable certainty level. On the other hand, by having our model both intercept and gender at the population level, we will obtain the global intercept but only the intercept for one of the genders (as Intercept and genderf in the model summary above). It is not identifiable to have a global intercept plus one for each gender; that is why we lose one gender intercept. To avoid that, we shall remove the global intercept to get it for both genders.

formula <- score_5_z ~ 0 + gender + score_3_z + raven_z + (1 | student) +
  (1 + score_3_z + raven_z | school) + (1 + score_3_z + raven_z | social)

1.5.1 Setting Prior Distributions

This time, we will establish some prior distributions instead of using the default ones as we have been doing. To see the prior distributions of the model, we can use the get_prior() function.

(jsp_priors <- get_prior(formula, jsp_sc))

                  prior class      coef   group resp dpar nlpar bound
1                           b                                        
2                           b   genderf                              
3                           b   genderm                              
4                           b   raven_z                              
5                           b score_3_z                              
6                lkj(1)   cor                                        
7                         cor            school                      
8                         cor            social                      
9  student_t(3, 0, 2.5)    sd                                        
10                         sd            school                      
11                         sd Intercept  school                      
12                         sd   raven_z  school                      
13                         sd score_3_z  school                      
14                         sd            social                      
15                         sd Intercept  social                      
16                         sd   raven_z  social                      
17                         sd score_3_z  social                      
18                         sd           student                      
19                         sd Intercept student                      
20 student_t(3, 0, 2.5) sigma

In the list above, we see that since we have omitted the global intercept, we will obtain the coefficients genderf and genderm (rows 2 and 3), corresponding to the intercept for each gender, as we wanted. In this article you will find suggestions for prior distributions.

Parameters with blank prior value have a prior distribution assigned by default by stan, while those that are not blank are brms defaults. It is possible to establish a prior distribution for each parameter separately, or the same distribution can be established for a subset of these, as denoted by rows 1, 6, and 9. Thus —for example— it can be seen that the priors with class = b are assigned by stan (they are blank), while those of the classes class = cor, class = sd and class = sigma have brm default distributions (rows 6, 9, 18). More exactly, row 6 indicates that the prior with class = cor (the correlation between coefficients of the same group) for school and social groups is lkj(1); this doesn’t apply to students groups because they have only intercepts (Lewandowski, Kurowicka, and Joe 2001).

The class = b corresponds to the prior distribution for the coefficients at the population level. As our variables are standardized, we will give those coefficients the distribution $\mathcal{N}(0,2)$.

jsp_priors[1, 1] <- prior(normal(0, 2))

The prior for the class sd with the column group not blank, correspond to the standard deviation of the predictors in each group, such as $\sigma_{\alpha_0}$ and $\sigma_{\alpha_1}$ in (??). For these, we are going to use an Exponential(1) distribution.

jsp_priors[9, 1] <- prior(exponential(1))

The class = sigma corresponds to the population level standard deviation $\sigma_{\epsilon}$. As our outcome variable is scaled, and its variance is 1, we will use a Cauchy(0,1) distribution.

jsp_priors[20, 1] <- prior(cauchy(0, 1))

As a standard deviation can only have positive values, when we assign it a distribution with possible negative values, such as Normal or Cauchy, internally, brm adjusts the case to use only the distribution’ positive side.

1.5.2 JSP | Prior Predictive Checking

We can check the goodness of our prior distributions by directly calculating their predictions. We sample parameters from the prior distribution and introduce them into the model formula, along with the predictors in the data, to obtain $y_{\textrm{rep}}$ from the sampling distribution. This procedure is analogous to the Posterior Predictive Checking we did earlier. But instead of using the distribution of the parameters conditioned to the data, we use their prior distribution.

We can do this easily with brm() using the parameter sample_prior = "only"; then we graph the distribution of the predicted values $y_{rep}$ and we will compare it with that of the data using pp_check().

jsp_m1_prior <- brm(
  formula,
  data = jsp_sc,
  seed = 123,
  prior = jsp_priors,
  sample_prior = "only",
  iter = 1200, warmup = 200, refresh = 0
)

pp_check(jsp_m1_prior, nsamples = 70) +
  theme(legend.position = c(0.8, 0.5)) +
  coord_cartesian(xlim = c(-20, 20))

Figure 1.4: Prior Predictive Checking. The predictions of our prior distributions look good, with greater dispersion than the observed data.

The predictions from our prior distribution look good. They cover a range of possibilities reasonably greater than the data.

1.6 Fitting the Model to Data

Having confirmed our prior distribution is reasonable, we fit the model to the data.

jsp_m1 <- brm(
  formula,
  data = jsp_sc,
  seed = 123,
  prior = jsp_priors,
  iter = 2500, warmup = 1500,
  control = list(adapt_delta = 0.99)
)
jsp_m1 <- add_criterion(jsp_m1, "loo")
summary(jsp_m1)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: score_5_z ~ 0 + gender + score_3_z + raven_z + (1 | student) + 
  (1 + score_3_z + raven_z | school) + (1 + score_3_z + raven_z | social) 
   Data: jsp_sc (Number of observations: 1774) 
Samples: 4 chains, each with iter = 2500; warmup = 1500; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~school (Number of levels: 48) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                0.24      0.03     0.19     0.31 1.00     1187     1818
sd(score_3_z)                0.11      0.03     0.06     0.16 1.00     1363     1872
sd(raven_z)                  0.03      0.02     0.00     0.09 1.00     1008     1684
cor(Intercept,score_3_z)    -0.75      0.15    -0.97    -0.39 1.00     1386     1461
cor(Intercept,raven_z)       0.16      0.44    -0.75     0.89 1.00     3238     2462
cor(score_3_z,raven_z)      -0.27      0.46    -0.93     0.77 1.00     2636     2592

~social (Number of levels: 9) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                0.08      0.04     0.01     0.18 1.01     1029     1178
sd(score_3_z)                0.03      0.02     0.00     0.08 1.00     2140     2145
sd(raven_z)                  0.06      0.04     0.00     0.16 1.00     1088     1310
cor(Intercept,score_3_z)     0.01      0.50    -0.87     0.88 1.00     4214     2632
cor(Intercept,raven_z)      -0.07      0.46    -0.87     0.82 1.00     2442     2214
cor(score_3_z,raven_z)      -0.15      0.51    -0.93     0.83 1.00     1861     2851

~student (Number of levels: 887) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.28      0.03     0.23     0.33 1.00      754     1185

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
genderm      -0.01      0.05    -0.12     0.10 1.00      966     1413
genderf       0.04      0.05    -0.07     0.14 1.00      985     1475
score_3_z     0.59      0.03     0.53     0.65 1.00     1831     2716
raven_z       0.19      0.03     0.12     0.25 1.00     2243     2045

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.54      0.01     0.52     0.57 1.00      928     1842

The variance of raven_z between schools (0.03) and score_3_z between social classes (0.03) seems too small. Let’s try without those predictors.

jsp_m2 <- update(
  jsp_m1,
  formula = score_5_z ~ 0 + gender + score_3_z + raven_z +
    (1 | student) +
    (1 + score_3_z | school) + (1 + raven_z | social)
)

loo(jsp_m2, jsp_m1)

: 
Model comparisons:
       elpd_diff se_diff
jsp_m2 0.0       0.0    
jsp_m1 0.0       1.1

The difference between the previous model jsp_m1 and this last one jsp_m2 is not significant, we will keep the simplest model, the last one (Occam’s Razor).

1.7 The Final Model

Our final model, including the prior distributions, is: \[\begin{align*} \text{--- At population level:} \\ \text{score\_5\_z} &= \beta_0^{(gender)} + \beta_1 \, \text{raven\_z} + \beta_2 \, \text{score\_3\_z} + \pmb{\alpha}^{(school)} + \pmb{\gamma}^{(social)} + \delta_0^{(student)} + \epsilon \\ \text{priors:} \\ \delta_0^{(student)} &\sim \mathcal{N}(0,\sigma_{student}^2) \\ \sigma_{student} &\sim \mathrm{Exponential}(1) \\ \epsilon &\sim \mathrm{Cauchy}(0, 1) \\[4pt] \text{--- At school level:} \\ \pmb{\alpha}^{(school)} &= \alpha_0^{(school)} + \alpha_1^{(school)} \, \text{score\_3\_z} \\ \begin{pmatrix} \alpha_0^{(school)} \\[5pt] \alpha_1^{(school)} \end{pmatrix} &\sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 0 \\ 0 \end{pmatrix} , \Sigma_{school} \end{pmatrix} \\ \text{priors:} \\ \Sigma_{school} &= \operatorname{diag}(\pmb{\tau}_\alpha) \, \pmb{\Omega}_\alpha \text{diag}(\pmb{\tau}_\alpha) \\ \pmb{\tau}_\alpha^{(j)} &\sim \mathrm{Exponential}(1), \; j = 1, 2 \\ \pmb{\Omega}_\alpha &\sim \mathrm{LKJ}(1) \\[4pt] \text{--- At social class level:} \\ \pmb{\gamma}^{(social)} &= \gamma_0^{(social)} + \gamma_1^{(social)} \, \text{raven\_z} \\ \begin{pmatrix} \gamma_0^{(social)} \\[5pt] \gamma_1^{(social)} \end{pmatrix} &\sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 0 \\ 0 \end{pmatrix} , \Sigma\_{social} \end{pmatrix} \\ \text{priors:} \\ \Sigma\_{social} &= \operatorname{diag}(\pmb{\tau}_\gamma) \, \pmb{\Omega}_\gamma \operatorname{diag}(\pmb{\tau}_\gamma) \\ \pmb{\tau}_\gamma^{(j)} &\sim \mathrm{Exponential}(1), \, j = 1, 2 \\ \pmb{\Omega}_\gamma &\sim \mathrm{LKJ}(1) \end{align*}\]

jsp_m2
loo(jsp_m2)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: score_5_z ~ gender + score_3_z + raven_z + (1 | student) + 
  (1 + score_3_z | school) + (1 + raven_z | social) - 1 
   Data: jsp_sc (Number of observations: 1774) 
Samples: 4 chains, each with iter = 2500; warmup = 1500; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~school (Number of levels: 48) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                0.24      0.03     0.18     0.31 1.00     1021     2009
sd(score_3_z)                0.10      0.02     0.06     0.15 1.00     1400     2265
cor(Intercept,score_3_z)    -0.80      0.15    -0.99    -0.43 1.00     1050     1347

~social (Number of levels: 9) 
                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)              0.08      0.05     0.02     0.19 1.00     1074     1258
sd(raven_z)                0.05      0.04     0.00     0.15 1.00      724     1163
cor(Intercept,raven_z)    -0.05      0.52    -0.93     0.88 1.00     1699     2463

~student (Number of levels: 887) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.28      0.02     0.23     0.33 1.00      761     1499

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
genderm      -0.01      0.06    -0.12     0.10 1.00     1046     1580
genderf       0.04      0.06    -0.08     0.15 1.00     1012     1386
score_3_z     0.59      0.02     0.54     0.64 1.00     1968     2668
raven_z       0.18      0.03     0.12     0.25 1.00     2071     1859

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.54      0.01     0.52     0.57 1.00     1113     2499
:
Computed from 4000 by 1774 log-likelihood matrix

         Estimate   SE
elpd_loo  -1631.6 32.9
p_loo       328.2 11.0
looic      3263.2 65.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1710  96.4%   303       
 (0.5, 0.7]   (ok)         58   3.3%   119       
   (0.7, 1]   (bad)         6   0.3%   54        
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

Our final model explains $1-0.54^2/1.00^2 = 71\%$ of the outcome variable variance. Also, $loo$ has improved very much, from elpd_loo= -2518.7 in our reference model to -1631.6 in this final one.

A note that we must put next to each interpretation we make of the model parameters is (*) “… in a regression modeling scores from 3rd to 5th school year, discounting the effect of school, gender, student, student’s father social class, and the student’s abstract reasoning ability measured by the Raven test.”

The population-level intercept is separated by gender (genderm: male, genderf: female) and is very small, although with a slight advantage of girls over boys. We will later analyze this in detail.

At the school level, there is a negative correlation $-0.80$ between the intercept and the slope of score_3_z, that is, the higher the intercept, the lower the slope and vice versa. Later we analyze and interpret this.

1.7.1 Posterior Predictive Checking

Let’s take a look at a couple of graphical checks to get a sense of the goodness of fit.

$JSP: Final Model Check. (Left) Posterior Predictive Check of the model, which looks good. (Right) Residual check $\epsilon$ vs. Fitted Values ($\hat{y}$ predicted), also looks good, there are no patterns neither in the distribution of the points nor their variability. The diagonal patterns are artifacts caused by the discreteness of the data.$

Figure 1.5: JSP: Final Model Check. (Left) Posterior Predictive Check of the model, which looks good. (Right) Residual check $\epsilon$ vs. Fitted Values ($\hat{y}$ predicted), also looks good, there are no patterns neither in the distribution of the points nor their variability. The diagonal patterns are artifacts caused by the discreteness of the data.

The model looks good in the graphic 1.5. On the left side we see the the model predictions distribution supporting the observed data. Even though there are segments where $y$ is at the margin of the predictions, it does not stray too far. On the right side, there is no pattern in the residuals nor their variability. The diagonal patterns are artifacts caused by the discreteness of the data.

1.7.2 One Regression per Student

Since the scores are made up of the math and english subjects, we can draw a line for each student, joining those subjects in a 5th vs. 3rd-year scores graph.

JSP: Score Prediction at selected schools. Only the eight schools with the highest and lowest interceptions are included. Line segments link each student’s grade predictions for English and math. These lines converge in the upper right corner of the graph, corresponding to the fact that the most outstanding students obtain high scores depending less on the school in which they study, as opposed to the most lagging ones (close to the lower-left corner) which obtain more different grades depending on from school.

Figure 1.6: JSP: Score Prediction at selected schools. Only the eight schools with the highest and lowest interceptions are included. Line segments link each student’s grade predictions for English and math. These lines converge in the upper right corner of the graph, corresponding to the fact that the most outstanding students obtain high scores depending less on the school in which they study, as opposed to the most lagging ones (close to the lower-left corner) which obtain more different grades depending on from school.

In the graph JSP: Predictions in selected schools 1.6, each line corresponds to a student, joining his/her scores in English and Mathematics in the third and fifth year. We show the eight schools with higher or lower interceptions. We can see these lines converge roughly in the graph upper right corner. This convergence shows the negative correlation, estimated in $\text{cor(Intercept, \, score\_3\_z)} = -0.80$ between the intercept and slope per school, whose interpretation is:

The more outstanding a student is, the less his/her score depends on the school. Conversely, the more lagging a student is, the more his/her grades varies with the school. Consequently, when the school effect is negative, the most affected are the least skilled students.

1.8 A Hypothesis Predicated on Parameter Values

Having an approximation to the joint distribution of the model parameters, we can test hypotheses that otherwise, with classical point estimates, would be impossible to test with precision. With hypothesis we refer to a proposition that predicates on one or more model parameters, such as “$\beta_0>0$” or “$\alpha_0^{(2)}> \alpha_0^{(3)}$.” Here are some examples.

Example Comparing the Interception of Two Schools

The graph 1.7-A shows the distribution of two schools intercepts ($\alpha_0^{(6)}$ and $\alpha_0^{(29)}$), wherein the base of each distribution, both its credible interval at $95\%$ and its mean are shown in blue. Since the distributions overlap so much, at first glance, it seems counterintuitive to say that we have a $99\%$ certainty that the intercept of the s6 school above is greater than that of the s29 school bottom. In 1.7-B the red dashed line is the y=x line. Seeing that most points are above this line, now it is clear that the s6 intercept is greater than s29. When using classical pointwise regressions, the blue trace in graph A is all the information we have. It is impossible to guess that one school intercept is actually greater than the other one with so much overlap.

JSP: Comparison of Intercept of Schools s29 and s6. In (A), the blue point and thick line are the estimated interception value and range of two schools, which is all you have in a classical pointwise regression. At first glance from that graph, it seems counterintuitive, but the school s6 intercept is higher than the one of the school s29 with a 99% certainty. In (B), you can see both intercepts’ marginal joint distribution, where the red dashed line is the y=x line. Now it is clear that with certainty, one school intercept is greater than the other one.

Figure 1.7: JSP: Comparison of Intercept of Schools s29 and s6. In (A), the blue point and thick line are the estimated interception value and range of two schools, which is all you have in a classical pointwise regression. At first glance from that graph, it seems counterintuitive, but the school s6 intercept is higher than the one of the school s29 with a 99% certainty. In (B), you can see both intercepts’ marginal joint distribution, where the red dashed line is the y=x line. Now it is clear that with certainty, one school intercept is greater than the other one.

In Bayesian inference, we compare two parameters through their values in the sample, and we estimate the certainty with the proportion of the findings. For example, among the $4000$ sample units, in $3958$ of them, the school s6 intercept is greater than that of s29, so the estimated probability is $3958/4000 =98.95\%$. This way, we evaluate the expected value of an Indicator function $g(\theta)$, which is $1$ when the predicate on the parameter value is true and zero otherwise. The predicate states the hypothesis we want to test. \[ \mathop{{}\mathbb{E}}(g(\theta)) = \\ \int g(\theta) \cdot \pi(\theta \mid y) \, \textrm{d}\theta \approx \frac{1}{N} \sum_i g(\theta_i) \]

For our example $g(\theta)=I(\theta^{(6)} > \theta^{(29)})$.

Testing Hypothesis in brms

With brms, you can test hypotheses with the hypothesis() function. In our previous example, the hypothesis is:

hypothesis(jsp_m2,
  c("s6>s29" = "r_school[6,Intercept] > r_school[29,Intercept]"),
  class = NULL
)

Hypothesis Tests for class :
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1     s6>s29     0.35      0.16      0.1     0.62      94.24      0.99    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

In the hypothesis() function output, the Estimate column shows the estimated value of the difference between the schools’ interceptions ($s6-s29$). The Est.Error value is the Estimated value standard error. The column Evid.Ratio is the ratio of the evidence in favor of the hypothesis to that against it $Evid.Ratio = 0.9895 / (1−0.9895) = 94.24$. Finally, Post.Prob is the proportion of the samples in favor of the hypothesis.

JSP: Difference of Interceptions of Schools 6 and 29. In 98% of the samples, the Intercept of school 6 is greater than that of school 29. The dotted line indicates the mean difference value. The annotations in red and blue are the percentage of samples with the difference below and above zero, respectively. SD is the standard deviation of the difference.

Figure 1.8: JSP: Difference of Interceptions of Schools 6 and 29. In 98% of the samples, the Intercept of school 6 is greater than that of school 29. The dotted line indicates the mean difference value. The annotations in red and blue are the percentage of samples with the difference below and above zero, respectively. SD is the standard deviation of the difference.

To better understand the results of the hypothesis() function, the plot 1.8 shows the distribution of the school’s intercepts difference of (s6 minus s29) together with some statistics about it. As expected, the $98.95\%$ of the differences is positive, equal to the Post.Prob column in the result of hypothesis(), and its standard deviation (SD in the graph) is equal to the Est.Error value, while the average difference is $0.35$, as reported in the Estimate column.

An example comparing interceptions by gender

In the jsp_m2 fitting summary, we see in the gender intercept, that the girls genderf (gender:fem) have a very slight advantage (from $0.03-(- 0.01)=0.04 \, \sigma_y$) with respect to the boys genderm (gender:masc). But there is no great certainty that either of these two interceptions is different from zero, as their credible ranges include it. What can we say about the difference between these two interceptions? Perhaps the difference’s variance is so great that it precludes us from saying anything about this difference with reasonable certainty? Let’s solve this with the help of the hypothesis() function.

hypothesis(jsp_m2, "genderf - genderm > 0", class = "b")

Hypothesis Tests for class b:
             Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (genderf-genderm) > 0     0.04      0.03    -0.01     0.09      10.33      0.91       
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
:

We find that in $91\%$ of the samples, the intercept difference favors the girls.

The 3rd-grade average girl student has a slightly better 5th-grade score than the average boy student. The data shows this with an estimated certainty of $91\%$, and the estimated advantage of girls over boys is about $0.04 \, \sigma_y$.

1.8.1 Father’ Socioeconomic-class Effect on the Student

We can test as a hypothesis the sign of the intercept in each social class, that is, test whether the parent’s Socioeconomic class has a positive or negative effect on the student learning ability.

hypothesis(jsp_m2, "Intercept > 0", scope = "ranef", group = "social")

Hypothesis Tests for class :
          Group      Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1             I (Intercept) > 0     0.05      0.07    -0.05     0.18       3.02      0.75     
2            II (Intercept) > 0     0.06      0.06    -0.02     0.15       7.08      0.88     
3 III nonmanual (Intercept) > 0     0.03      0.05    -0.06     0.12       2.35      0.70     
4    III manual (Intercept) > 0    -0.04      0.05    -0.11     0.03       0.21      0.18     
5            IV (Intercept) > 0     0.00      0.06    -0.09     0.08       0.88      0.47     
6             V (Intercept) > 0    -0.07      0.07    -0.19     0.02       0.15      0.13     
7   lterm unemp (Intercept) > 0    -0.06      0.07    -0.18     0.03       0.22      0.18     
8    curr unemp (Intercept) > 0    -0.02      0.07    -0.15     0.09       0.65      0.39     
9 father absent (Intercept) > 0     0.05      0.05    -0.02     0.13       5.78      0.85

Let’s remember that in our model, these interceptions are deviations around the mean at the population level. They are centered at zero, and positive values refer to intercepts above the mean and vice versa. That is why it makes sense to evaluate the certainty we have of these intercepts concerning zero.

When Evid.Ratio is less than 1, it indicates we have more evidence that the effect of the social class is negative than positive. Thus, in the case of row $6$ Evid.Ratio=0.13, we have $1/0.13=7.7$ times more evidence that the effect is negative than positive. Correspondingly, the probability that the class has a negative impact is $1-0.13= 87\%$.

The influence of the father’s socioeconomic class on student performance is small, with an average magnitude of $0.04 \,\sigma_y$ up to an estimated maximum of $0.07 \,\sigma_y$. We are more confident that this influence is favorable for students with parents in the first two social classes and the category father absent; and that it is negative with parents in the last category V, or that they are unemployed in the long term lterm unemp. However, the certainty we have of the referred effects is in the order of $[75\% - 88\%]$.

Surprisingly, the father absent category has a positive effect on student learning, with an advantage of $0.05-(-0.07) = 0.12 \,\sigma_y$ compared to the impact of a parent in the last category V.

1.9 Effect of the School on Student

Let us recall that with the ranef() function, we can obtain the predictors’ coefficients in the groups, such as the intercept and score_3_z by school.

ranef(jsp_m2)$school %>% print(digits = 3)

, , Intercept
    Estimate Est.Error    Q2.5   Q97.5
1  -0.32204    0.0876 -0.4935 -0.1511
2  -0.15156    0.1226 -0.3919  0.0893
3   0.09473    0.1445 -0.1789  0.3798
4   0.02927    0.0893 -0.1440  0.2095
5  -0.06213    0.1028 -0.2666  0.1406
: 
, , score_3_z
    Estimate Est.Error     Q2.5    Q97.5
1   0.12172    0.0542  0.02355  0.235792
2   0.04731    0.0635 -0.07774  0.177896
3  -0.03464    0.0749 -0.18919  0.111111
4  -0.00567    0.0508 -0.10957  0.094687
5   0.02995    0.0575 -0.08256  0.149794
:

School Classification

The intercept in our model is composed by terms at the population level (by gender), and by the intercept in each of the three groups we have. If we call $\hat{\beta_0}$ to that total intercept, we have: \[\begin{equation} \hat{\beta_0} = \beta_0^{(gender)} + \alpha_0^{(school)} + \gamma_0^{(social)} + \delta_0^{(student)} \end{equation}\]

Since this total intercept is the expected 5th-year score for the average third-year student, we can classify schools according to their effect on this average student. However, to make a fair comparison, we must consider the school impact on the same typical student. We look up to this through the expected effect on the student: \[\begin{equation} \mathop{\mathbb{E}}(\hat{\beta_0}) = \mathop{\mathbb{E}}(\beta_0^{(gender)}) + \alpha_0^{(school)} + \mathop{\mathbb{E}}(\gamma_0^{(social)}) + \mathop{\mathbb{E}}(\delta_0^{(student)}) \tag{1.1} \end{equation}\]

Where the expectation $\mathbb{E}()$ is taken over the student’ characteristics, and all of these terms will be constant regarding the students. The only term that depends on the school is the $\alpha_0^{(school)}$. We shall use this term to classify the schools by ther effect on the average student.

$Classification of Schools Effect on Average Pupil. Each point represents the estimated intercept, and the vertical bars their credible interval at 50% and 95%. The relative effect between best and worst school is enormous, about $1 \,\sigma_y$ one standard deviation in the student’s 5th-year score.$

Figure 1.9: Classification of Schools Effect on Average Pupil. Each point represents the estimated intercept, and the vertical bars their credible interval at 50% and 95%. The relative effect between best and worst school is enormous, about $1 \,\sigma_y$ one standard deviation in the student’s 5th-year score.

In the graph Classification of Schools Effect on Average Pupil 1.9, the schools are classified by their impact on the average third-year student; the credible intervals are at 50% and 95%. The relative effect between the best and the worst school is enormous. It is approximately one standard deviation $1 \,\sigma_y$ in the student’s 5th-year score. This is in agreement with the standard deviation of these schools’ intercepts $(0.24)$.

Comparison of Effect between the Best and Worst Schools

If we consider that 52.1% of the students in the data are girls, we can estimate $\mathop{\mathbb{E}}_{student}(\beta_0^{(gender)})$ in (1.1), with the sum of the intercepts by gender weighted by the frequency: \[\begin{align*} \beta_0 &= \begin{cases} -0.00991 & \text{if gender}= \text{m (47.9\%)} \\ \phantom{-}0.03656 & \text{if gender}= \text{f (52.1\%)} \end{cases} \\ \implies \mathop{\mathbb{E}}_{student}(\beta_0) &= 47.9\% \times -0.00991 + 52.1\% \times +0.03656 = 0.01429 \end{align*}\]

In the same way we estimate the expected intercept by the father’ socioeconomic class social $\mathop{\mathbb{E}}_{student}(\gamma_0^{(social)})$ at $-0.00424$, and the expected slope of raven $\mathop{\mathbb{E}}_{student}(\gamma_1^{(social)})$, at $0.00418$. Using these values, the regressions for the schools with the highest and lowest interceptions (best and worst schools) are: \[\begin{align*} \text{Best school:} \\ \text{score\_5\_z} &= 0.5563 + 0.1874 \, \text{raven\_z} + 0.4041 \, \text{score\_3\_z} \\ \text{Worst School:} \\ \text{score\_5\_z} &= -0.4463 + 0.1874 \, \text{raven\_z} + 0.7315 \, \text{score\_3\_z} \end{align*}\]

The average student in the best school has a score of $0.5563 - (-0.4463) = 1.0 \,\sigma_y$ higher than one in the worst. For the worst school students to catch up with the ones in the best school, it is required that: \[\begin{align*} -0.4463 + 0.1874 \, \text{raven\_z} + 0.7315 \, \text{score\_3\_z} &= 0.5563\\ 0.1874 \, \text{raven\_z} + 0.7315 \, \text{score\_3\_z} &= 1.0 \end{align*}\]

Where one of the infinite solutions is $\text{raven\_z} = 1.1$ and $\text{score\_3\_z} = 1.1$, since $0.1874 \times 1.1 + 0.7315 \times 1.1 = 1.0 \, \sigma_y$, which allows us to state:

The school has a significant impact on student learning. The average student in the school with the most positive influence (best school) has an estimated advantage of $1$ standard deviation in the 5th-year score over the same average student in the worst school.

For a student in the worst school to catch up with the average student in the best one, he/she must have scored in 3rd year and abstract reasoning raven with $1.1$ standard deviations above the population average.

1.10 Classification of Schools by their Effect on the 3rd-year Average Student

To classify the schools, we will cross-compare each school’s intercept against all the others (i.e., $48 \times 47/2 = 1128$ comparisons) using the hypothesis test we saw earlier. In each comparison, we will conclude that one school has a more significant positive effect on the student than another when the certainty is greater than $95 \%$. Remember that classifying schools by interception are equivalent to doing so according to their effect on the 3rd-year average student.

In the graph JSP: Comparison of each school against all the others 1.10, each green point flags that the school on the horizontal axis has a more significant positive effect on the average student than the school on the vertical axis, with certainty greater than $95\%$. For example, school 27 has a more significant effect than schools 29 and 1. The red dots indicate the opposite, so the 7 has less impact on the student than the 31 schools.

JSP: Comparing each school against all others. Each green dot indicates the school in the row has a greater effect (95% or more) than the school in the column. Each red dot indicates the opposite, also at 95%.

Figure 1.10: JSP: Comparing each school against all others. Each green dot indicates the school in the row has a greater effect (95% or more) than the school in the column. Each red dot indicates the opposite, also at 95%.

The middle zone without points contains the combinations of schools for which there is no certainty (at 95%) which one is better than the other. A larger volume of data is required to solve these cases. However, we have been able to discriminate in $461$ out of $1128$ comparisons; that’s the $41 \%$ of all of them. If we had only the intervals shown in the graph JSP: interceptions by school 1.9, as in the classic point estimation regressions, we would have been able to discriminate the difference of the effect in much fewer schools.

The Junior School Project

Oscar de Lama

2021-03-14