Model comparison

multilevel models LOO R squared

Using loo (approximate leave-one-out-cross-validation) and Bayesian R squared

true
06-04-2021

Approximate leave-one-out cross-validation

Another way to evaluate and compare models is on their ability to make predictions for “out-of-sample data,” that is, future observations, using what we learned from the observed data. We can use cross-validation to test which of the models under consideration is able to learn the most from our data in order to make the better predictions.

The idea here is that the instead of assessing how well a model can predict the data (prior predictive density), we are assessing how well a model can predict new data based on the posterior predictive density, \(log \, p(\tilde{y}|y, \mathcal{M})\). Essentially, this is a technique for approximating the expected log pointwise density (elpd), based on importance sampling (Vehtari, Gelman, and Gabry 2017). A major advantage of cross validation methods in comparison with Bayes factor is that the specification of priors is less critical.

We will use the same dataset as in the bridge sampling example.

Show code
simulate_treatment <- function(a = 3.5,
                              b = -1,
                              sigma_a = 1,
                              sigma_b = 0.8,
                              rho = -0.7,
                              n_subjects = 10,
                              n_trials = 20,
                              sigma = 0.5) {

    # Ccombine the terms
    mu <- c(a, b)
    cov_ab <- sigma_a * sigma_b * rho
    SD  <- matrix(c(sigma_a^2, cov_ab,
                       cov_ab, sigma_b^2), ncol = 2)

   #  sigmas <- c(sigma_a, sigma_b)          # standard deviations
   #  rho <- matrix(c(1, rho,             # correlation matrix
   #                 rho, 1), nrow = 2)
   # 
   # # now matrix multiply to get covariance matrix
   # SD <- diag(sigmas) %*% rho %*% diag(sigmas)

    varying_effects <-
        MASS::mvrnorm(n_subjects, mu, SD) |>
        # as_tibble(.name_repair = "unique") |>
        data.frame() |>
        purrr::set_names("a_j", "b_j")

    d_linpred <-
        varying_effects |>
        mutate(subject  = 1:n_subjects) |>
        expand(nesting(subject, a_j, b_j), post = c(0, 1)) |>
        mutate(mu = a_j + b_j * post,
               sigma = sigma) |>
        mutate(treatment = ifelse(post == 0, "pre", "post"),
               treatment = factor(treatment, levels = c("pre", "post")),
               subject = as_factor(subject))

    d <- d_linpred |>
        slice(rep(1:n(), each = n_trials)) |>
        mutate(response = rnorm(n = n(), mean = mu, sd = sigma))

   d
}
Show code
plot_linpred <- function(d, violin = FALSE) {
  
  d_linpred <- d |>
  group_by(subject, treatment) |> distinct(mu, .keep_all = TRUE)
  
  p <- d_linpred |>
    ggplot(aes(x = treatment, y = mu))
  
  if (isTRUE(violin)) {
  p <- p + 
      geom_violin(aes(x = treatment, y = response,
                          fill = treatment), 
                  alpha = 0.5,
                  data = d) +
      geom_jitter(aes(x = treatment, y = response,
                          color = treatment), 
                  width = 0.1, size = 2,
                  data = d)
  }
  
  p <- p +
    geom_line(aes(group = 1), color = "#8B9DAF", size = 1, linetype = 3) +
    geom_point(aes(fill = treatment), 
               shape = 21, 
               colour = "black", 
               size = 4, 
               stroke = 2) +
    scale_fill_brewer(type = "qual") +
    scale_color_brewer(type = "qual") +
    coord_cartesian(ylim = c(0, 8)) +
    ylab("Response") +
    theme(legend.position = "none",
          axis.ticks.x    = element_blank()) +
    facet_wrap(~ subject) +
  ggtitle("Linear predictor")
  
  p
}

We have a hierarchical dataset, consisting of 30 subjects, measured at two time points (with 20 trials per time point).

The parameters used to generate the data are:

params <- tribble(~Parameter, ~`Default value`, ~Desccription,
                  "a", 2, "average pre-treatment effect (intercepts)",
                  "b", 0.5, "average difference between pre and post",
                  "sigma_a", 1, "std dev in intercepts",
                  "sigma_b", 2, "std dev in differences (slopes)",
                  "rho", -0.2, "correlation between intercepts and slopes",
                  "sigma", 0.5, "residual standard deviation") 
params |> 
    kableExtra::kbl() |> 
    kableExtra::kable_paper("hover", full_width = T)
Parameter Default value Desccription
a 2.0 average pre-treatment effect (intercepts)
b 0.5 average difference between pre and post
sigma_a 1.0 std dev in intercepts
sigma_b 2.0 std dev in differences (slopes)
rho -0.2 correlation between intercepts and slopes
sigma 0.5 residual standard deviation

These are the parameter we would like to recover using our model.

set.seed(867)
d <- simulate_treatment(n_subjects = 30,
                        a = 2, b = 0.5,
                        sigma_a = 1, sigma_b = 1.5,
                        rho = -0.2,
                        sigma = 0.5,
                        n_trials = 20)

We can plot 12 random subjects to get an idea of what the true means look like.

random_subjects <- sample(levels(d$subject), 12)
d |> 
  filter(subject %in% random_subjects) |> plot_linpred()

priors <- prior(normal(0, 1), class = b) +
  prior(lkj(2), class = cor)

fit1 <- brm(response ~ treatment + (treatment | subject),
                    prior = priors,
                    data = d,
                    file = "../../models/fit_loo_1",
                    file_refit = "on_change")
fit1
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: response ~ treatment + (treatment | subject) 
   Data: d (Number of observations: 1200) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~subject (Number of levels: 30) 
                             Estimate Est.Error l-95% CI u-95% CI
sd(Intercept)                    0.94      0.13     0.73     1.23
sd(treatmentpost)                1.61      0.22     1.25     2.11
cor(Intercept,treatmentpost)    -0.01      0.18    -0.35     0.33
                             Rhat Bulk_ESS Tail_ESS
sd(Intercept)                1.00      880     1029
sd(treatmentpost)            1.00      802     1177
cor(Intercept,treatmentpost) 1.00      735     1176

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept         1.94      0.17     1.60     2.29 1.00      619
treatmentpost     0.65      0.29     0.08     1.20 1.01      805
              Tail_ESS
Intercept          861
treatmentpost     1207

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.50      0.01     0.48     0.53 1.00     4100     2694

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
loo_fit1 <- loo(fit1)

loo_fit1

Computed from 4000 by 1200 log-likelihood matrix

         Estimate   SE
elpd_loo   -911.9 24.8
p_loo        59.4  2.4
looic      1823.9 49.6
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

The function loo reports three quantities with their standard errors:

Next we will fit our null model (using more iterations than the default, as per Stan’s recommendation).

fit_null_varying_effects <- brm(response ~ 1 + (treatment | subject),
                    prior = prior(lkj(2), class = cor),
                    data = d,
                    iter = 4000,
                    file = "../../models/fit_loo_null",
                    file_refit = "on_change")
loo_fit_null_varying_effects <- loo(fit_null_varying_effects)

loo_fit_null_varying_effects

Computed from 8000 by 1200 log-likelihood matrix

         Estimate   SE
elpd_loo   -911.5 24.7
p_loo        58.9  2.4
looic      1823.0 49.5
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

To compare the models, we need to look at the difference between elpd_loo and the standard error (SE) of that difference:

loo_compare(loo_fit1, loo_fit_null_varying_effects)
                         elpd_diff se_diff
fit_null_varying_effects  0.0       0.0   
fit1                     -0.4       0.3   

The difference in elpd_loo is not bigger than two times the SD (rule of thumb), meaning that from the perspective of LOO, the models are almost indistinguishable.

d <- mutate(d,
            n = row_number(),
            diff_elpd = loo_fit1$pointwise[,"elpd_loo"] -
                        loo_fit_null_varying_effects$pointwise[,"elpd_loo"])

d |> 
  ggplot(aes(x = n, y = diff_elpd, color = treatment)) +
  geom_point(alpha = .4, position = position_jitter(w = 0.01, h = 0)) +
  scale_color_brewer(type = "qual")

d |> 
  ggplot(aes(x = treatment, y = diff_elpd, color = treatment)) +
  geom_point(alpha = .4, position = position_jitter(w = 0.1, h = 0)) +
  scale_color_brewer(type = "qual")

pp_check(fit1)

pp_check(fit_null_varying_effects)

fit_loo_null <- brm(response ~ 1 + (treatment | subject),
                    prior = prior(lkj(2), class = cor),
                    data = d,
                    iter = 4000,
                    file = "../../models/fit_loo_null-2",
                    file_refit = "on_change")
loo_fit_loo_null <- loo(fit_loo_null)
loo_compare(loo_fit1, loo_fit_null_varying_effects, loo_fit_loo_null)
                         elpd_diff se_diff
fit_loo_null              0.0       0.0   
fit_null_varying_effects  0.0       0.1   
fit1                     -0.4       0.3   

Let’s try this again, using a bigger true difference between treatment conditions.

set.seed(867)
d2 <- simulate_treatment(n_subjects = 30,
                        a = 3, b = 2,
                        sigma_a = 1, sigma_b = 1.5,
                        rho = -0.2,
                        sigma = 0.5,
                        n_trials = 20)

We can plot 12 random subjects to get an idea of what the true means look like.

random_subjects <- sample(levels(d2$subject), 12)
d2 |> 
  filter(subject %in% random_subjects) |> plot_linpred()

priors <- prior(normal(0, 1), class = b) +
  prior(lkj(2), class = cor)

fit2 <- brm(response ~ treatment + (treatment | subject),
                    prior = priors,
                    data = d2,
                    file = "../../models/fit_loo_d2",
                    file_refit = "on_change") |> 
  add_criterion("loo")
fit_null_2 <- brm(response ~ 1 + (treatment | subject),
                    prior = prior(lkj(2), class = cor),
                    data = d2,
                    iter = 4000,
                    file = "../../models/fit_loo_null_d2",
                    file_refit = "on_change") |> 
    add_criterion("loo")
loo_compare(fit2, fit_null_2)
           elpd_diff se_diff
fit2        0.0       0.0   
fit_null_2 -0.3       0.5   
pp_check(fit2)

pp_check(fit_null_2)

R^2

The usual definition of \(R^2\) (variance of the predicted values \(\hat{y}\) divided by the variance of the observations \(y\)) has problems as a measure of model fit—it always favours models with more parameters (it overfits) and it doesn’t generalize well outside of the single-level Gaussian framework.

However Gelman et al. (2019) have created a Bayesian \(R^2\), which is computed as

\[ \frac{\text{Explained variance}}{\text{Explained variance} + \text{Residual variance}} \]

The classical \(R^2\) is given by

\[ \frac{\text{var}(\hat{y})}{\text{var}(y)} \]

bayes_R2(fit2) %>% round(digits = 3)
   Estimate Est.Error  Q2.5 Q97.5
R2    0.924     0.001 0.921 0.927
bayes_R2(fit_null_2) %>% round(digits = 3)
   Estimate Est.Error  Q2.5 Q97.5
R2    0.924     0.001 0.921 0.927

The R package performance also implements both conditional and marginal \(R^2\) for multilevel models.

Real-world example

library(lme4)
data("sleepstudy")
sleepstudy <- sleepstudy %>% 
  as_tibble() %>% 
  mutate(Subject = as.character(Subject))
ggplot(sleepstudy) + 
  aes(x = Days, y = Reaction) + 
  stat_smooth(method = "lm", se = FALSE) +
  geom_point() +
  facet_wrap(~Subject) +
  labs(x = "Days of sleep deprivation", 
       y = "Average reaction time (ms)") + 
  scale_x_continuous(breaks = 0:4 * 2)

fit_no_pooling <- brm(Reaction ~ Days*Subject,
                      prior = prior(normal(0, 10), class = b),
                      data = sleepstudy,
                      file = "../../models/sleepstudy_no_pooling")
fit_partial_pooling <- brm(Reaction ~ 1 + Days + (1 + Days | Subject),
                      prior = prior(normal(0, 10), class = b),
                      data = sleepstudy,
                      file = "../../models/sleepstudy_partial_pooling")
loo_no_pooling <- loo(fit_no_pooling)
loo_partial_pooling <- loo(fit_partial_pooling)
plot(loo_no_pooling)

loo_compare(loo_no_pooling, loo_partial_pooling)
                    elpd_diff se_diff
fit_partial_pooling  0.0       0.0   
fit_no_pooling      -7.1       6.7   
fit_sleep_null_vary_effects <- brm(Reaction ~ 1 + (1 + Days | Subject),
                      prior = prior(lkj(2), class = cor),
                      data = sleepstudy,
                      file = "../../models/fit_sleep_null_vary_effects")
loo_sleep_null_vary_effects <- loo(fit_sleep_null_vary_effects)
loo_sleep_null_vary_effects

Computed from 4000 by 180 log-likelihood matrix

         Estimate   SE
elpd_loo   -862.5 22.8
p_loo        36.2  9.0
looic      1725.1 45.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     174   96.7%   654       
 (0.5, 0.7]   (ok)         3    1.7%   277       
   (0.7, 1]   (bad)        2    1.1%   11        
   (1, Inf)   (very bad)   1    0.6%   9         
See help('pareto-k-diagnostic') for details.
loo_compare(loo_sleep_null_vary_effects, loo_partial_pooling)
                            elpd_diff se_diff
fit_partial_pooling          0.0       0.0   
fit_sleep_null_vary_effects -1.7       1.5   
Gelman, Andrew, Ben Goodrich, Jonah Gabry, and Aki Vehtari. 2019. “R-Squared for Bayesian Regression Models.” The American Statistician 73 (3): 307–9. https://doi.org/10.1080/00031305.2018.1549100.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/awellis/learnmultilevelmodels, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Ellis (2021, June 4). Learn multilevel models: Model comparison. Retrieved from https://awellis.github.io/learnmultilevelmodels/model-comparison.html

BibTeX citation

@misc{ellis2021model,
  author = {Ellis, Andrew},
  title = {Learn multilevel models: Model comparison},
  url = {https://awellis.github.io/learnmultilevelmodels/model-comparison.html},
  year = {2021}
}