Using loo (approximate leave-one-out-cross-validation) and Bayesian R squared
Another way to evaluate and compare models is on their ability to make predictions for “out-of-sample data,” that is, future observations, using what we learned from the observed data. We can use cross-validation to test which of the models under consideration is able to learn the most from our data in order to make the better predictions.
The idea here is that the instead of assessing how well a model can predict the data (prior predictive density), we are assessing how well a model can predict new data based on the posterior predictive density, \(log \, p(\tilde{y}|y, \mathcal{M})\). Essentially, this is a technique for approximating the expected log pointwise density (elpd
), based on importance sampling (Vehtari, Gelman, and Gabry 2017). A major advantage of cross validation methods in comparison with Bayes factor is that the specification of priors is less critical.
We will use the same dataset as in the bridge sampling example.
simulate_treatment <- function(a = 3.5,
b = -1,
sigma_a = 1,
sigma_b = 0.8,
rho = -0.7,
n_subjects = 10,
n_trials = 20,
sigma = 0.5) {
# Ccombine the terms
mu <- c(a, b)
cov_ab <- sigma_a * sigma_b * rho
SD <- matrix(c(sigma_a^2, cov_ab,
cov_ab, sigma_b^2), ncol = 2)
# sigmas <- c(sigma_a, sigma_b) # standard deviations
# rho <- matrix(c(1, rho, # correlation matrix
# rho, 1), nrow = 2)
#
# # now matrix multiply to get covariance matrix
# SD <- diag(sigmas) %*% rho %*% diag(sigmas)
varying_effects <-
MASS::mvrnorm(n_subjects, mu, SD) |>
# as_tibble(.name_repair = "unique") |>
data.frame() |>
purrr::set_names("a_j", "b_j")
d_linpred <-
varying_effects |>
mutate(subject = 1:n_subjects) |>
expand(nesting(subject, a_j, b_j), post = c(0, 1)) |>
mutate(mu = a_j + b_j * post,
sigma = sigma) |>
mutate(treatment = ifelse(post == 0, "pre", "post"),
treatment = factor(treatment, levels = c("pre", "post")),
subject = as_factor(subject))
d <- d_linpred |>
slice(rep(1:n(), each = n_trials)) |>
mutate(response = rnorm(n = n(), mean = mu, sd = sigma))
d
}
plot_linpred <- function(d, violin = FALSE) {
d_linpred <- d |>
group_by(subject, treatment) |> distinct(mu, .keep_all = TRUE)
p <- d_linpred |>
ggplot(aes(x = treatment, y = mu))
if (isTRUE(violin)) {
p <- p +
geom_violin(aes(x = treatment, y = response,
fill = treatment),
alpha = 0.5,
data = d) +
geom_jitter(aes(x = treatment, y = response,
color = treatment),
width = 0.1, size = 2,
data = d)
}
p <- p +
geom_line(aes(group = 1), color = "#8B9DAF", size = 1, linetype = 3) +
geom_point(aes(fill = treatment),
shape = 21,
colour = "black",
size = 4,
stroke = 2) +
scale_fill_brewer(type = "qual") +
scale_color_brewer(type = "qual") +
coord_cartesian(ylim = c(0, 8)) +
ylab("Response") +
theme(legend.position = "none",
axis.ticks.x = element_blank()) +
facet_wrap(~ subject) +
ggtitle("Linear predictor")
p
}
We have a hierarchical dataset, consisting of 30 subjects, measured at two time points (with 20 trials per time point).
The parameters used to generate the data are:
params <- tribble(~Parameter, ~`Default value`, ~Desccription,
"a", 2, "average pre-treatment effect (intercepts)",
"b", 0.5, "average difference between pre and post",
"sigma_a", 1, "std dev in intercepts",
"sigma_b", 2, "std dev in differences (slopes)",
"rho", -0.2, "correlation between intercepts and slopes",
"sigma", 0.5, "residual standard deviation")
params |>
kableExtra::kbl() |>
kableExtra::kable_paper("hover", full_width = T)
Parameter | Default value | Desccription |
---|---|---|
a | 2.0 | average pre-treatment effect (intercepts) |
b | 0.5 | average difference between pre and post |
sigma_a | 1.0 | std dev in intercepts |
sigma_b | 2.0 | std dev in differences (slopes) |
rho | -0.2 | correlation between intercepts and slopes |
sigma | 0.5 | residual standard deviation |
These are the parameter we would like to recover using our model.
set.seed(867)
d <- simulate_treatment(n_subjects = 30,
a = 2, b = 0.5,
sigma_a = 1, sigma_b = 1.5,
rho = -0.2,
sigma = 0.5,
n_trials = 20)
We can plot 12 random subjects to get an idea of what the true means look like.
fit1
Family: gaussian
Links: mu = identity; sigma = identity
Formula: response ~ treatment + (treatment | subject)
Data: d (Number of observations: 1200)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~subject (Number of levels: 30)
Estimate Est.Error l-95% CI u-95% CI
sd(Intercept) 0.94 0.13 0.73 1.23
sd(treatmentpost) 1.61 0.22 1.25 2.11
cor(Intercept,treatmentpost) -0.01 0.18 -0.35 0.33
Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.00 880 1029
sd(treatmentpost) 1.00 802 1177
cor(Intercept,treatmentpost) 1.00 735 1176
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 1.94 0.17 1.60 2.29 1.00 619
treatmentpost 0.65 0.29 0.08 1.20 1.01 805
Tail_ESS
Intercept 861
treatmentpost 1207
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.50 0.01 0.48 0.53 1.00 4100 2694
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
loo_fit1 <- loo(fit1)
loo_fit1
Computed from 4000 by 1200 log-likelihood matrix
Estimate SE
elpd_loo -911.9 24.8
p_loo 59.4 2.4
looic 1823.9 49.6
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
The function loo reports three quantities with their standard errors:
elpd_loo
is the sum of pointwise predictive accuracy (a larger, less negative number indicates better predictions).p_loo
is an estimate of effective complexity of the model; p_loo can be interpreted as the effective number of parameters. If p_loo is larger than the number of data points or parameters, this may indicate a severe model misspecification.looic
is simply -2*elpd_loo
. It’s mainly for historical reasons, and converts elpd to a measure of deviance, on the same scale as AIC and DIC.Next we will fit our null model (using more iterations than the default, as per Stan’s recommendation).
loo_fit_null_varying_effects <- loo(fit_null_varying_effects)
loo_fit_null_varying_effects
Computed from 8000 by 1200 log-likelihood matrix
Estimate SE
elpd_loo -911.5 24.7
p_loo 58.9 2.4
looic 1823.0 49.5
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
To compare the models, we need to look at the difference between elpd_loo
and the standard error (SE
) of that difference:
loo_compare(loo_fit1, loo_fit_null_varying_effects)
elpd_diff se_diff
fit_null_varying_effects 0.0 0.0
fit1 -0.4 0.3
The difference in elpd_loo
is not bigger than two times the SD
(rule of thumb), meaning that from the perspective of LOO, the models are almost indistinguishable.
d <- mutate(d,
n = row_number(),
diff_elpd = loo_fit1$pointwise[,"elpd_loo"] -
loo_fit_null_varying_effects$pointwise[,"elpd_loo"])
d |>
ggplot(aes(x = n, y = diff_elpd, color = treatment)) +
geom_point(alpha = .4, position = position_jitter(w = 0.01, h = 0)) +
scale_color_brewer(type = "qual")
d |>
ggplot(aes(x = treatment, y = diff_elpd, color = treatment)) +
geom_point(alpha = .4, position = position_jitter(w = 0.1, h = 0)) +
scale_color_brewer(type = "qual")
pp_check(fit1)
pp_check(fit_null_varying_effects)
loo_fit_loo_null <- loo(fit_loo_null)
loo_compare(loo_fit1, loo_fit_null_varying_effects, loo_fit_loo_null)
elpd_diff se_diff
fit_loo_null 0.0 0.0
fit_null_varying_effects 0.0 0.1
fit1 -0.4 0.3
Let’s try this again, using a bigger true difference between treatment conditions.
set.seed(867)
d2 <- simulate_treatment(n_subjects = 30,
a = 3, b = 2,
sigma_a = 1, sigma_b = 1.5,
rho = -0.2,
sigma = 0.5,
n_trials = 20)
We can plot 12 random subjects to get an idea of what the true means look like.
priors <- prior(normal(0, 1), class = b) +
prior(lkj(2), class = cor)
fit2 <- brm(response ~ treatment + (treatment | subject),
prior = priors,
data = d2,
file = "../../models/fit_loo_d2",
file_refit = "on_change") |>
add_criterion("loo")
fit_null_2 <- brm(response ~ 1 + (treatment | subject),
prior = prior(lkj(2), class = cor),
data = d2,
iter = 4000,
file = "../../models/fit_loo_null_d2",
file_refit = "on_change") |>
add_criterion("loo")
loo_compare(fit2, fit_null_2)
elpd_diff se_diff
fit2 0.0 0.0
fit_null_2 -0.3 0.5
pp_check(fit2)
pp_check(fit_null_2)
The usual definition of \(R^2\) (variance of the predicted values \(\hat{y}\) divided by the variance of the observations \(y\)) has problems as a measure of model fit—it always favours models with more parameters (it overfits) and it doesn’t generalize well outside of the single-level Gaussian framework.
However Gelman et al. (2019) have created a Bayesian \(R^2\), which is computed as
\[ \frac{\text{Explained variance}}{\text{Explained variance} + \text{Residual variance}} \]
The classical \(R^2\) is given by
\[ \frac{\text{var}(\hat{y})}{\text{var}(y)} \]
The R package performance
also implements both conditional and marginal \(R^2\) for multilevel models.
sleepstudy <- sleepstudy %>%
as_tibble() %>%
mutate(Subject = as.character(Subject))
ggplot(sleepstudy) +
aes(x = Days, y = Reaction) +
stat_smooth(method = "lm", se = FALSE) +
geom_point() +
facet_wrap(~Subject) +
labs(x = "Days of sleep deprivation",
y = "Average reaction time (ms)") +
scale_x_continuous(breaks = 0:4 * 2)
plot(loo_no_pooling)
loo_compare(loo_no_pooling, loo_partial_pooling)
elpd_diff se_diff
fit_partial_pooling 0.0 0.0
fit_no_pooling -7.1 6.7
loo_sleep_null_vary_effects <- loo(fit_sleep_null_vary_effects)
loo_sleep_null_vary_effects
Computed from 4000 by 180 log-likelihood matrix
Estimate SE
elpd_loo -862.5 22.8
p_loo 36.2 9.0
looic 1725.1 45.6
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 174 96.7% 654
(0.5, 0.7] (ok) 3 1.7% 277
(0.7, 1] (bad) 2 1.1% 11
(1, Inf) (very bad) 1 0.6% 9
See help('pareto-k-diagnostic') for details.
loo_compare(loo_sleep_null_vary_effects, loo_partial_pooling)
elpd_diff se_diff
fit_partial_pooling 0.0 0.0
fit_sleep_null_vary_effects -1.7 1.5
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/awellis/learnmultilevelmodels, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ellis (2021, June 4). Learn multilevel models: Model comparison. Retrieved from https://awellis.github.io/learnmultilevelmodels/model-comparison.html
BibTeX citation
@misc{ellis2021model, author = {Ellis, Andrew}, title = {Learn multilevel models: Model comparison}, url = {https://awellis.github.io/learnmultilevelmodels/model-comparison.html}, year = {2021} }