A general method for estimating Bayes factors
While the Savage-Dickey method is very useful for nested models, it is restricted to model comparisons between nested models, and it is not the most robust method.
We will look at an alternative method for obtaining Bayes factor, one that is both more reliable, and can be applied to model comparisons between non-nested models (as long as they are fit to the same data).
We will first simulate a dataset, using the function we defined in session 2.
simulate_treatment <- function(a = 3.5,
b = -1,
sigma_a = 1,
sigma_b = 0.8,
rho = -0.7,
n_subjects = 10,
n_trials = 20,
sigma = 0.5) {
# Ccombine the terms
mu <- c(a, b)
cov_ab <- sigma_a * sigma_b * rho
SD <- matrix(c(sigma_a^2, cov_ab,
cov_ab, sigma_b^2), ncol = 2)
# sigmas <- c(sigma_a, sigma_b) # standard deviations
# rho <- matrix(c(1, rho, # correlation matrix
# rho, 1), nrow = 2)
#
# # now matrix multiply to get covariance matrix
# SD <- diag(sigmas) %*% rho %*% diag(sigmas)
varying_effects <-
MASS::mvrnorm(n_subjects, mu, SD) |>
# as_tibble(.name_repair = "unique") |>
data.frame() |>
purrr::set_names("a_j", "b_j")
d_linpred <-
varying_effects |>
mutate(subject = 1:n_subjects) |>
expand(nesting(subject, a_j, b_j), post = c(0, 1)) |>
mutate(mu = a_j + b_j * post,
sigma = sigma) |>
mutate(treatment = ifelse(post == 0, "pre", "post"),
treatment = factor(treatment, levels = c("pre", "post")),
subject = as_factor(subject))
d <- d_linpred |>
slice(rep(1:n(), each = n_trials)) |>
mutate(response = rnorm(n = n(), mean = mu, sd = sigma))
d
}
plot_linpred <- function(d, violin = FALSE) {
d_linpred <- d |>
group_by(subject, treatment) |> distinct(mu, .keep_all = TRUE)
p <- d_linpred |>
ggplot(aes(x = treatment, y = mu))
if (isTRUE(violin)) {
p <- p +
geom_violin(aes(x = treatment, y = response,
fill = treatment),
alpha = 0.5,
data = d) +
geom_jitter(aes(x = treatment, y = response,
color = treatment),
width = 0.1, size = 2,
data = d)
}
p <- p +
geom_line(aes(group = 1), color = "#8B9DAF", size = 1, linetype = 3) +
geom_point(aes(fill = treatment),
shape = 21,
colour = "black",
size = 4,
stroke = 2) +
scale_fill_brewer(type = "qual") +
scale_color_brewer(type = "qual") +
coord_cartesian(ylim = c(0, 8)) +
ylab("Response") +
theme(legend.position = "none",
axis.ticks.x = element_blank()) +
facet_wrap(~ subject) +
ggtitle("Linear predictor")
p
}
We will create a hierarchical dataset, consisting of 30 subjects, measured at two time points (with 20 trials per time point).
The parameters used to generate the data are:
params <- tribble(~Parameter, ~`Default value`, ~Desccription,
"a", 2, "average pre-treatment effect (intercepts)",
"b", 0.5, "average difference between pre and post",
"sigma_a", 1, "std dev in intercepts",
"sigma_b", 2, "std dev in differences (slopes)",
"rho", -0.2, "correlation between intercepts and slopes",
"sigma", 0.5, "residual standard deviation")
params |>
kableExtra::kbl() |>
kableExtra::kable_paper("hover", full_width = T)
Parameter | Default value | Desccription |
---|---|---|
a | 2.0 | average pre-treatment effect (intercepts) |
b | 0.5 | average difference between pre and post |
sigma_a | 1.0 | std dev in intercepts |
sigma_b | 2.0 | std dev in differences (slopes) |
rho | -0.2 | correlation between intercepts and slopes |
sigma | 0.5 | residual standard deviation |
These are the parameter we would like to recover using our model.
set.seed(867)
d <- simulate_treatment(n_subjects = 30,
a = 2, b = 0.5,
sigma_a = 1, sigma_b = 1.5,
rho = -0.2,
sigma = 0.5,
n_trials = 20)
We can plot 12 random subjects to get an idea of what the true means look like.
We will now attempt to estimate the parameter, using the model given by
\[\begin{align*} \text{response}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha_{\text{subject}_i} + \beta_{\text{subject}_i} \text{treatment}_i \\ \begin{bmatrix} \alpha_\text{subject} \\ \beta_\text{treatment} \end{bmatrix} & \sim \text{MVNormal} \left (\begin{bmatrix} \alpha \\ \beta \end{bmatrix}, \mathbf{S} \right ) \\ \mathbf S & = \begin{bmatrix} \sigma_\alpha & 0 \\ 0 & \sigma_\beta \end{bmatrix} \mathbf R \begin{bmatrix} \sigma_\alpha & 0 \\ 0 & \sigma_\beta \end{bmatrix} \\ \end{align*}\]
We are assuming that our respone variable is conditionally normally distributed, with mean \(\mu\) and standard deviation \(\sigma\). We will predict the mean using a linear model, i.e. as a function of the treatments (pre and post).
The treatment variable is a factor with two levels, so we can use treamtent coding (the default in R) to give us an intercept representing the reference category, and an indicator variable representing the difference between the reference and non-reference categories.
levels(d$treatment)
[1] "pre" "post"
The first level of the variable will automatiaclly be used as the reference.
contrasts(d$treatment)
post
pre 0
post 1
In the output of the contrasts()
function the column post
shows the indicator variable that we will get, and the rows show the values it takes when the trial is from the pre or post condition, respectively.
Our formula will look like this: response ~ 1 + treatment
. However, don’t simply want to estimate the avarage effect using a complete pooling model—we want to account for the fatc that every subject has their of mean response in the pre condition, and their own effect of the treatment, i.e. difference between pre and post treatment conditions.
We know (because we generated the data), that the average response in the pre condition is \(\alpha = 2\) and the average difference between pre and post is \(\beta = 0.5\), but if we look at the individual subjects, each one has their own effects \(a_j\) and \(b_j\). We want to simultaneously estimate the average effects, and the individual effects.
d |>
group_by(subject) |> slice_head(n = 1)
# A tibble: 30 x 8
# Groups: subject [30]
subject a_j b_j post mu sigma treatment response
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 1 2.22 1.39 0 2.22 0.5 pre 1.96
2 2 0.0683 1.11 0 0.0683 0.5 pre -0.213
3 3 3.46 -0.441 0 3.46 0.5 pre 3.12
4 4 1.61 2.25 0 1.61 0.5 pre 2.21
5 5 0.844 0.496 0 0.844 0.5 pre 1.46
6 6 1.52 0.510 0 1.52 0.5 pre 1.23
7 7 2.13 0.436 0 2.13 0.5 pre 2.02
8 8 1.75 1.92 0 1.75 0.5 pre 0.653
9 9 2.20 -2.16 0 2.20 0.5 pre 2.23
10 10 1.12 2.31 0 1.12 0.5 pre 1.01
# … with 20 more rows
This is what the lines
\[ \mu_i = \alpha_{\text{subject}_i} + \beta_{\text{subject}_i} \text{treatment}_i \]
\[\begin{align*} \begin{bmatrix} \alpha_\text{subject} \\ \beta_\text{treatment} \end{bmatrix} & \sim \text{MVNormal} \left (\begin{bmatrix} \alpha \\ \beta \end{bmatrix}, \mathbf{S} \right ) \end{align*}\]
are referring to. Each set of subject=specific \(a_j\) and \(b_j\) parameters is jointly drawn from a multivariate normal, with average effects \(\alpha\) and \(\beta\), and a covariance matrix \(\Sigma\). In diagonal, \(\Sigma\) contains the standard deviations, which tell us how much variation there is between subjects around the average effects. Lastly, we can estimate the correlation between \(a_j\) and \(b_j\); a negative correlation, for eaxmple, would imply that a large value in the pre condition would be associated with a smaller difference between pre and post.
The equations can be expressed as an R formula
response ~ treatment + (treatment | subject)
response ~ treatment + (treatment | subject)
If we were doing frequentist maximum likelook estimation, we would use this formula with the lmer()
function from the lme4
package.
fit_lme4 <- lme4::lmer(response ~ treatment + (treatment | subject),
data = d, REML = TRUE)
summary(fit_lme4)
Linear mixed model fit by REML ['lmerMod']
Formula: response ~ treatment + (treatment | subject)
Data: d
REML criterion at convergence: 2042.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.7884 -0.6602 -0.0186 0.6604 2.7987
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 0.7934 0.8908
treatmentpost 2.3808 1.5430 -0.02
Residual 0.2538 0.5038
Number of obs: 1200, groups: subject, 30
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.9472 0.1639 11.879
treatmentpost 0.6861 0.2832 2.423
Correlation of Fixed Effects:
(Intr)
treatmntpst -0.027
Can you identify all the parameters in the output, and find the corresponding true parameter values?
If you want p-values, you can use lmerTest::lmer()
, because lme4
doesn’t report p-values.
fit_lmer <- lmerTest::lmer(response ~ treatment + (treatment | subject),
data = d)
summary(fit_lmer)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: response ~ treatment + (treatment | subject)
Data: d
REML criterion at convergence: 2042.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.7884 -0.6602 -0.0186 0.6604 2.7987
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 0.7934 0.8908
treatmentpost 2.3808 1.5430 -0.02
Residual 0.2538 0.5038
Number of obs: 1200, groups: subject, 30
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.9472 0.1639 29.0000 11.879 1.16e-12 ***
treatmentpost 0.6861 0.2832 28.9999 2.423 0.0219 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
treatmntpst -0.027
Since we are Bayesian now, we would like to obtain posterior distributions for all the parameters. This means that we need priors. Let’s ask brms what priors we can set in this model, and what the defaults are.
get_prior(response ~ treatment + (treatment | subject),
data = d) |>
as_tibble() |> select(1:4)
# A tibble: 10 x 4
prior class coef group
<chr> <chr> <chr> <chr>
1 "" b "" ""
2 "" b "treatmentpost" ""
3 "lkj(1)" cor "" ""
4 "" cor "" "subject"
5 "student_t(3, 2.2, 2.5)" Intercept "" ""
6 "student_t(3, 0, 2.5)" sd "" ""
7 "" sd "" "subject"
8 "" sd "Intercept" "subject"
9 "" sd "treatmentpost" "subject"
10 "student_t(3, 0, 2.5)" sigma "" ""
The defaults are acceptable, except for the flat prior of the average difference between pre and post, treatmentpost
. This is not a good idea when estimating the parameter, but we could estimate the parameter from the data. When the goal is to obtain a Bayes factor, however, a flat prior makes it impossible to sample from the prior, and we need to set an informative prior.
\[\begin{align*} \alpha & \sim \operatorname{Student}(3, 2.7, 2.5) \\ \beta & \sim \operatorname{Uniform}(-\infty, \infty) \\ \sigma & \sim \operatorname{Half-Student}(3, 0, 2.5) \\ \sigma_\alpha & \sim \operatorname{Half-Student}(3, 0, 2.5) \\ \sigma_\beta & \sim \operatorname{Half-Student}(3, 0, 2.5) \\ \mathbf R & \sim \operatorname{LKJcorr}(1) \end{align*}\]
We well therefore set a normal(0, 1)
prior on the difference. This expresses our prior belief that the difference will lie between -2 and 2 with 95% certainty.
Do you think this is a good prior?
Additionallu, we will set an LKJ(2)
prior on the correlation between intercepts and slopes, meaning that we expect the correlation to be centred at 0, and favouring smaller values, rather than being uniform between \([-1, 1]\).
\[\begin{align*} \alpha & \sim \operatorname{Student}(3, 2.7, 2.5) \\ \beta & \sim \operatorname{Normal}(0, 11) \\ \sigma & \sim \operatorname{HalfCauchy}(0, 1) \\ \sigma_\alpha & \sim \operatorname{Half-Student}(3, 0, 2.5) \\ \sigma_\beta & \sim \operatorname{Half-Student}(3, 0, 2.5) \\ \mathbf R & \sim \operatorname{LKJcorr}(2) \end{align*}\]
Now we are ready to fit the model. We will now make sure that we save all the parameters using save_pars = save_pars(all = TRUE),
, even those that are not save by default. This is necessary if we want to estimate the marginal likelhood using bridge sampling. The other thing is that now need to run the model for longer, i.e. we need many more iterations. A rule of thumb is a 10 fold increase in the number of iterations, but here we will settle for 10’000.
summary(fit_twosided)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: response ~ treatment + (treatment | subject)
Data: d (Number of observations: 1200)
Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup samples = 20000
Group-Level Effects:
~subject (Number of levels: 30)
Estimate Est.Error l-95% CI u-95% CI
sd(Intercept) 0.94 0.14 0.72 1.25
sd(treatmentpost) 1.62 0.23 1.25 2.14
cor(Intercept,treatmentpost) -0.01 0.18 -0.36 0.33
Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.00 3676 5219
sd(treatmentpost) 1.00 3944 6364
cor(Intercept,treatmentpost) 1.00 3285 4596
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 1.95 0.17 1.61 2.29 1.00 2585
treatmentpost 0.64 0.28 0.08 1.19 1.00 2884
Tail_ESS
Intercept 4662
treatmentpost 5648
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.50 0.01 0.48 0.53 1.00 20542 13834
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Can you identify all the parameters?
fixef(fit_twosided)
Estimate Est.Error Q2.5 Q97.5
Intercept 1.9468557 0.1738593 1.61017682 2.288157
treatmentpost 0.6412073 0.2825676 0.07713288 1.185885
mcmc_plot(fit_twosided)
conditional_effects(fit_twosided, dpar = "mu")
The posterior distribution of the parameter of interest gives us a 95% credible interval that is positive, but now we want to quantify evidence for the altervative hypothesis.
The next step is to specify an explicit null model. In this case, our null model is nested within the alternative, with the parameter representing the difference between treatments fixed to zero.
We can specify this model (with the varying treatment effects)
and the model without botj population-level and varying treatment effects.
We can now obtain the marginal log-likelihoods for these models using the bridge_sampler()
function.
margLogLik_twosided <- bridge_sampler(fit_twosided,
silent = TRUE)
margLogLik_null_varying_effects <- bridge_sampler(fit_null_varying_effects,
silent = TRUE)
margLogLik_null <- bridge_sampler(fit_null,
silent = TRUE)
Bayes factors can then be computed manually, or using the bayes_factor()
function.
exp(margLogLik_twosided$logml - margLogLik_null_varying_effects$logml)
[1] 3.373619
BF_twosided_null <- bayes_factor(margLogLik_twosided, margLogLik_null_varying_effects)
BF_twosided_null
Estimated Bayes factor in favor of x1 over x2: 3.37362
This gives us a Bayes factor in favour of the model that includes the difference between treatments of approximately 3.33.
You can also directly compute the Bayes factor without the intermediate step.
bayes_factor(fit_twosided, fit_null_varying_effects)
The same Bayes factor can be obtained via the Savage-Dickey density ration method:
fit_twosided_SD <- brm(response ~ treatment + (treatment | subject),
prior = priors,
data = d,
sample_prior = TRUE,
iter = 2000,
file = "models/fit_twosided_SD",
file_refit = "on_change")
BF_01 <- fit_twosided_SD |> hypothesis("treatmentpost = 0")
BF_10 <- 1/BF_01$hypothesis$Evid.Ratio
BF_10
[1] 3.05984
This results in a Bayes factor that is very similar to the one obtained via bridge sampling.
For comparison, let’s look at traditional ways of analyzing these data. The most common way to do this would be to perfom the data analysis at two levels. At the first level, we sumarise the data per subject and condition, aggregating over trials (Here we are ignoring our uncertainty in the subject means).
d_sum <- d |>
group_by(subject, treatment) |>
summarise(response = mean(response))
d_sum
# A tibble: 60 x 3
# Groups: subject [30]
subject treatment response
<fct> <fct> <dbl>
1 1 pre 2.26
2 1 post 3.55
3 2 pre 0.198
4 2 post 1.34
5 3 pre 3.35
6 3 post 3.12
7 4 pre 1.73
8 4 post 4.13
9 5 pre 0.777
10 5 post 1.19
# … with 50 more rows
Subsequently, we perform a statistical test at the level 2, that at the group level, using the subject-specific means.
t.test(response ~ treatment, data = d_sum, var.equal = FALSE)
Welch Two Sample t-test
data: response by treatment
t = -1.8925, df = 42.982, p-value = 0.06518
alternative hypothesis: true difference in means between group pre and group post is not equal to 0
95 percent confidence interval:
-1.41724003 0.04504785
sample estimates:
mean in group pre mean in group post
1.947236 2.633333
Call:
lm(formula = response ~ treatment, data = d_sum)
Residuals:
Min 1Q Median 3Q Max
-3.6903 -0.7625 0.1156 0.7018 5.0725
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9472 0.2564 7.596 2.92e-10 ***
treatmentpost 0.6861 0.3625 1.892 0.0634 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.404 on 58 degrees of freedom
Multiple R-squared: 0.05816, Adjusted R-squared: 0.04192
F-statistic: 3.581 on 1 and 58 DF, p-value: 0.06342
summary(fit_lmer)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: response ~ treatment + (treatment | subject)
Data: d
REML criterion at convergence: 2042.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.7884 -0.6602 -0.0186 0.6604 2.7987
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 0.7934 0.8908
treatmentpost 2.3808 1.5430 -0.02
Residual 0.2538 0.5038
Number of obs: 1200, groups: subject, 30
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.9472 0.1639 29.0000 11.879 1.16e-12 ***
treatmentpost 0.6861 0.2832 28.9999 2.423 0.0219 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
treatmntpst -0.027
Follow all the steps shown above, but vary the parameters of the data generating function. You can set the true difference between treatment conditions to be zero, for example, or increase the standard deviation of the subjects’ difference parameters.
For example, setting the true difference to be zero (shown here for only 12 subjects—you should use more).
dex1 <- simulate_treatment(b = 0, n_subjects = 12)
plot_linpred(dex1)
Setting the true difference to be zero and increasing the variability between subjects.
dex2 <- simulate_treatment(b = 0, sigma_b = 2.5, n_subjects = 12)
plot_linpred(dex2)
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/awellis/learnmultilevelmodels, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ellis (2021, June 4). Learn multilevel models: Estimating a Bayes factor using bridge sampling. Retrieved from https://awellis.github.io/learnmultilevelmodels/bayes-factor-bridgesampling.html
BibTeX citation
@misc{ellis2021estimating, author = {Ellis, Andrew}, title = {Learn multilevel models: Estimating a Bayes factor using bridge sampling}, url = {https://awellis.github.io/learnmultilevelmodels/bayes-factor-bridgesampling.html}, year = {2021} }