Learn multilevel models: Model comparison

Andrew Ellis

Approximate leave-one-out cross-validation

Another way to evaluate and compare models is on their ability to make predictions for “out-of-sample data,” that is, future observations, using what we learned from the observed data. We can use cross-validation to test which of the models under consideration is able to learn the most from our data in order to make the better predictions.

The idea here is that the instead of assessing how well a model can predict the data (prior predictive density), we are assessing how well a model can predict new data based on the posterior predictive density, \(log \, p(\tilde{y}|y, \mathcal{M})\). Essentially, this is a technique for approximating the expected log pointwise density (elpd), based on importance sampling (Vehtari, Gelman, and Gabry 2017). A major advantage of cross validation methods in comparison with Bayes factor is that the specification of priors is less critical.

We will use the same dataset as in the bridge sampling example.

Show code

simulate_treatment <- function(a = 3.5,
                              b = -1,
                              sigma_a = 1,
                              sigma_b = 0.8,
                              rho = -0.7,
                              n_subjects = 10,
                              n_trials = 20,
                              sigma = 0.5) {

    # Ccombine the terms
    mu <- c(a, b)
    cov_ab <- sigma_a * sigma_b * rho
    SD  <- matrix(c(sigma_a^2, cov_ab,
                       cov_ab, sigma_b^2), ncol = 2)

   #  sigmas <- c(sigma_a, sigma_b)          # standard deviations
   #  rho <- matrix(c(1, rho,             # correlation matrix
   #                 rho, 1), nrow = 2)
   # 
   # # now matrix multiply to get covariance matrix
   # SD <- diag(sigmas) %*% rho %*% diag(sigmas)

    varying_effects <-
        MASS::mvrnorm(n_subjects, mu, SD) |>
        # as_tibble(.name_repair = "unique") |>
        data.frame() |>
        purrr::set_names("a_j", "b_j")

    d_linpred <-
        varying_effects |>
        mutate(subject  = 1:n_subjects) |>
        expand(nesting(subject, a_j, b_j), post = c(0, 1)) |>
        mutate(mu = a_j + b_j * post,
               sigma = sigma) |>
        mutate(treatment = ifelse(post == 0, "pre", "post"),
               treatment = factor(treatment, levels = c("pre", "post")),
               subject = as_factor(subject))

    d <- d_linpred |>
        slice(rep(1:n(), each = n_trials)) |>
        mutate(response = rnorm(n = n(), mean = mu, sd = sigma))

   d
}

Show code

plot_linpred <- function(d, violin = FALSE) {
  
  d_linpred <- d |>
  group_by(subject, treatment) |> distinct(mu, .keep_all = TRUE)
  
  p <- d_linpred |>
    ggplot(aes(x = treatment, y = mu))
  
  if (isTRUE(violin)) {
  p <- p + 
      geom_violin(aes(x = treatment, y = response,
                          fill = treatment), 
                  alpha = 0.5,
                  data = d) +
      geom_jitter(aes(x = treatment, y = response,
                          color = treatment), 
                  width = 0.1, size = 2,
                  data = d)
  }
  
  p <- p +
    geom_line(aes(group = 1), color = "#8B9DAF", size = 1, linetype = 3) +
    geom_point(aes(fill = treatment), 
               shape = 21, 
               colour = "black", 
               size = 4, 
               stroke = 2) +
    scale_fill_brewer(type = "qual") +
    scale_color_brewer(type = "qual") +
    coord_cartesian(ylim = c(0, 8)) +
    ylab("Response") +
    theme(legend.position = "none",
          axis.ticks.x    = element_blank()) +
    facet_wrap(~ subject) +
  ggtitle("Linear predictor")
  
  p
}

We have a hierarchical dataset, consisting of 30 subjects, measured at two time points (with 20 trials per time point).

The parameters used to generate the data are:

params <- tribble(~Parameter, ~`Default value`, ~Desccription,
                  "a", 2, "average pre-treatment effect (intercepts)",
                  "b", 0.5, "average difference between pre and post",
                  "sigma_a", 1, "std dev in intercepts",
                  "sigma_b", 2, "std dev in differences (slopes)",
                  "rho", -0.2, "correlation between intercepts and slopes",
                  "sigma", 0.5, "residual standard deviation") 
params |> 
    kableExtra::kbl() |> 
    kableExtra::kable_paper("hover", full_width = T)

Parameter	Default value	Desccription
a	2.0	average pre-treatment effect (intercepts)
b	0.5	average difference between pre and post
sigma_a	1.0	std dev in intercepts
sigma_b	2.0	std dev in differences (slopes)
rho	-0.2	correlation between intercepts and slopes
sigma	0.5	residual standard deviation

These are the parameter we would like to recover using our model.

set.seed(867)
d <- simulate_treatment(n_subjects = 30,
                        a = 2, b = 0.5,
                        sigma_a = 1, sigma_b = 1.5,
                        rho = -0.2,
                        sigma = 0.5,
                        n_trials = 20)

We can plot 12 random subjects to get an idea of what the true means look like.

random_subjects <- sample(levels(d$subject), 12)
d |> 
  filter(subject %in% random_subjects) |> plot_linpred()

library(brms)

priors <- prior(normal(0, 1), class = b) +
  prior(lkj(2), class = cor)

fit1 <- brm(response ~ treatment + (treatment | subject),
                    prior = priors,
                    data = d,
                    file = "../../models/fit_loo_1",
                    file_refit = "on_change")

fit1

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: response ~ treatment + (treatment | subject) 
   Data: d (Number of observations: 1200) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~subject (Number of levels: 30) 
                             Estimate Est.Error l-95% CI u-95% CI
sd(Intercept)                    0.94      0.13     0.73     1.23
sd(treatmentpost)                1.61      0.22     1.25     2.11
cor(Intercept,treatmentpost)    -0.01      0.18    -0.35     0.33
                             Rhat Bulk_ESS Tail_ESS
sd(Intercept)                1.00      880     1029
sd(treatmentpost)            1.00      802     1177
cor(Intercept,treatmentpost) 1.00      735     1176

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept         1.94      0.17     1.60     2.29 1.00      619
treatmentpost     0.65      0.29     0.08     1.20 1.01      805
              Tail_ESS
Intercept          861
treatmentpost     1207

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.50      0.01     0.48     0.53 1.00     4100     2694

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

loo_fit1 <- loo(fit1)

loo_fit1


Computed from 4000 by 1200 log-likelihood matrix

         Estimate   SE
elpd_loo   -911.9 24.8
p_loo        59.4  2.4
looic      1823.9 49.6
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

The function loo reports three quantities with their standard errors:

elpd_loo is the sum of pointwise predictive accuracy (a larger, less negative number indicates better predictions).
p_loo is an estimate of effective complexity of the model; p_loo can be interpreted as the effective number of parameters. If p_loo is larger than the number of data points or parameters, this may indicate a severe model misspecification.
looic is simply -2*elpd_loo. It’s mainly for historical reasons, and converts elpd to a measure of deviance, on the same scale as AIC and DIC.

Next we will fit our null model (using more iterations than the default, as per Stan’s recommendation).

fit_null_varying_effects <- brm(response ~ 1 + (treatment | subject),
                    prior = prior(lkj(2), class = cor),
                    data = d,
                    iter = 4000,
                    file = "../../models/fit_loo_null",
                    file_refit = "on_change")

loo_fit_null_varying_effects <- loo(fit_null_varying_effects)

loo_fit_null_varying_effects


Computed from 8000 by 1200 log-likelihood matrix

         Estimate   SE
elpd_loo   -911.5 24.7
p_loo        58.9  2.4
looic      1823.0 49.5
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

To compare the models, we need to look at the difference between elpd_loo and the standard error (SE) of that difference:

loo_compare(loo_fit1, loo_fit_null_varying_effects)

                         elpd_diff se_diff
fit_null_varying_effects  0.0       0.0   
fit1                     -0.4       0.3

The difference in elpd_loo is not bigger than two times the SD (rule of thumb), meaning that from the perspective of LOO, the models are almost indistinguishable.

d <- mutate(d,
            n = row_number(),
            diff_elpd = loo_fit1$pointwise[,"elpd_loo"] -
                        loo_fit_null_varying_effects$pointwise[,"elpd_loo"])

d |> 
  ggplot(aes(x = n, y = diff_elpd, color = treatment)) +
  geom_point(alpha = .4, position = position_jitter(w = 0.01, h = 0)) +
  scale_color_brewer(type = "qual")

d |> 
  ggplot(aes(x = treatment, y = diff_elpd, color = treatment)) +
  geom_point(alpha = .4, position = position_jitter(w = 0.1, h = 0)) +
  scale_color_brewer(type = "qual")

pp_check(fit1)

pp_check(fit_null_varying_effects)

fit_loo_null <- brm(response ~ 1 + (treatment | subject),
                    prior = prior(lkj(2), class = cor),
                    data = d,
                    iter = 4000,
                    file = "../../models/fit_loo_null-2",
                    file_refit = "on_change")

loo_fit_loo_null <- loo(fit_loo_null)

loo_compare(loo_fit1, loo_fit_null_varying_effects, loo_fit_loo_null)

                         elpd_diff se_diff
fit_loo_null              0.0       0.0   
fit_null_varying_effects  0.0       0.1   
fit1                     -0.4       0.3

Let’s try this again, using a bigger true difference between treatment conditions.

set.seed(867)
d2 <- simulate_treatment(n_subjects = 30,
                        a = 3, b = 2,
                        sigma_a = 1, sigma_b = 1.5,
                        rho = -0.2,
                        sigma = 0.5,
                        n_trials = 20)

We can plot 12 random subjects to get an idea of what the true means look like.

random_subjects <- sample(levels(d2$subject), 12)
d2 |> 
  filter(subject %in% random_subjects) |> plot_linpred()

priors <- prior(normal(0, 1), class = b) +
  prior(lkj(2), class = cor)

fit2 <- brm(response ~ treatment + (treatment | subject),
                    prior = priors,
                    data = d2,
                    file = "../../models/fit_loo_d2",
                    file_refit = "on_change") |> 
  add_criterion("loo")

fit_null_2 <- brm(response ~ 1 + (treatment | subject),
                    prior = prior(lkj(2), class = cor),
                    data = d2,
                    iter = 4000,
                    file = "../../models/fit_loo_null_d2",
                    file_refit = "on_change") |> 
    add_criterion("loo")

loo_compare(fit2, fit_null_2)

           elpd_diff se_diff
fit2        0.0       0.0   
fit_null_2 -0.3       0.5

pp_check(fit2)

pp_check(fit_null_2)

R^2

The usual definition of \(R^2\) (variance of the predicted values \(\hat{y}\) divided by the variance of the observations \(y\)) has problems as a measure of model fit—it always favours models with more parameters (it overfits) and it doesn’t generalize well outside of the single-level Gaussian framework.

However Gelman et al. (2019) have created a Bayesian \(R^2\), which is computed as

\[ \frac{\text{Explained variance}}{\text{Explained variance} + \text{Residual variance}} \]

The classical \(R^2\) is given by

\[ \frac{\text{var}(\hat{y})}{\text{var}(y)} \]

bayes_R2(fit2) %>% round(digits = 3)

   Estimate Est.Error  Q2.5 Q97.5
R2    0.924     0.001 0.921 0.927

bayes_R2(fit_null_2) %>% round(digits = 3)

   Estimate Est.Error  Q2.5 Q97.5
R2    0.924     0.001 0.921 0.927

The R package performance also implements both conditional and marginal \(R^2\) for multilevel models.

marginal \(R^2\) considers only the variance of the fixed effects
conditional \(R^2\) consider both fixed and random effects

Real-world example

library(lme4)
data("sleepstudy")

sleepstudy <- sleepstudy %>% 
  as_tibble() %>% 
  mutate(Subject = as.character(Subject))

ggplot(sleepstudy) + 
  aes(x = Days, y = Reaction) + 
  stat_smooth(method = "lm", se = FALSE) +
  geom_point() +
  facet_wrap(~Subject) +
  labs(x = "Days of sleep deprivation", 
       y = "Average reaction time (ms)") + 
  scale_x_continuous(breaks = 0:4 * 2)

fit_no_pooling <- brm(Reaction ~ Days*Subject,
                      prior = prior(normal(0, 10), class = b),
                      data = sleepstudy,
                      file = "../../models/sleepstudy_no_pooling")

fit_partial_pooling <- brm(Reaction ~ 1 + Days + (1 + Days | Subject),
                      prior = prior(normal(0, 10), class = b),
                      data = sleepstudy,
                      file = "../../models/sleepstudy_partial_pooling")

loo_no_pooling <- loo(fit_no_pooling)
loo_partial_pooling <- loo(fit_partial_pooling)

plot(loo_no_pooling)

loo_compare(loo_no_pooling, loo_partial_pooling)

                    elpd_diff se_diff
fit_partial_pooling  0.0       0.0   
fit_no_pooling      -7.1       6.7

fit_sleep_null_vary_effects <- brm(Reaction ~ 1 + (1 + Days | Subject),
                      prior = prior(lkj(2), class = cor),
                      data = sleepstudy,
                      file = "../../models/fit_sleep_null_vary_effects")

loo_sleep_null_vary_effects <- loo(fit_sleep_null_vary_effects)
loo_sleep_null_vary_effects


Computed from 4000 by 180 log-likelihood matrix

         Estimate   SE
elpd_loo   -862.5 22.8
p_loo        36.2  9.0
looic      1725.1 45.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     174   96.7%   654       
 (0.5, 0.7]   (ok)         3    1.7%   277       
   (0.7, 1]   (bad)        2    1.1%   11        
   (1, Inf)   (very bad)   1    0.6%   9         
See help('pareto-k-diagnostic') for details.

loo_compare(loo_sleep_null_vary_effects, loo_partial_pooling)

                            elpd_diff se_diff
fit_partial_pooling          0.0       0.0   
fit_sleep_null_vary_effects -1.7       1.5

Gelman, Andrew, Ben Goodrich, Jonah Gabry, and Aki Vehtari. 2019. “R-Squared for Bayesian Regression Models.” The American Statistician 73 (3): 307–9. https://doi.org/10.1080/00031305.2018.1549100.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

Model comparison

Approximate leave-one-out cross-validation

R^2

Real-world example

References

Corrections

Reuse

Citation