class: center, middle, inverse, title-slide # Model comparison ## Part 1
Bayes factors ### Andrew Ellis ### Bayesian multilevel modelling workshop 2021 ### 5-29-2021 --- layout: true <!-- Home icon --> <div class="my-footer"> <span> <a href="https://awellis.github.io/learnmultilevelmodels/" target="_blank">
</a> Graduate School workshop 2021 </span> </div> <!-- Name (left) --> <!-- <div class="my-footer"> --> <!-- <span> --> <!-- Andrew Ellis - <a href="https://kogpsy.github.io/neuroscicomplab" target="_blank">kogpsy.github.io/neuroscicomplab</a> --> <!-- </span> --> <!-- </div> --> <!-- slide separator (for xaringan) --> --- ```r library(tidyverse) library(rmarkdown) library(kableExtra) library(countdown) theme_set(theme_grey(base_size = 14) + theme(panel.grid = element_blank())) ``` ## Introduction .pull-left[ Sofar we have -estimated parameters - summarised posterior distributions + credible Intervals + point estimates ] -- .pull-right[ We would like to be able to compare models: - e.g.. model 1 explains the data better than model 2 - Model 1 has higher probability than model 2 ] --- .pull-left[ .discussion[ Which model comparison methods do you of? - Look at the example. How would you go about showing that the intervention has an effect? ] ] .pull-right[ We have a within factor ("time") and a between factor (`intervention`). <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-3-1.png" width="100%" /> ]
03
:
00
--- ## Possible methods - hypothesis testing, e.g. .pull-left[ ```r t.test(diff ~ intervention, data = dwide) ``` ] .pull-right[ ```r m1 <- lmer(score ~ intervention + time + (1|id), data = d) m2 <- lmer(score ~ intervention * time + (1|id), data = d) ``` ] - model comparison - explained variance - information criteria - cross validation -- Let's look a Bayesian method for model comparison. --- class: middle .pull-left-narrow[ .huge-blue-number[1] ] .pull-right-wide[ .larger[ Theory ] ] --- ## Bayesian model comparison We'll look at three methods 1) Bayes Factors 2) Out of sample predictive accuracy: Approximate leave-one-out cross validation (LOO) 3) Posterior predictive checking using test statistics (Posterior predictive p-values). -- Let's begon with Bayes factors. These are not everybody's cup of tea; some statistians don't like them, whereas they are popular among psychologists, who often want to test hypotheses. --- .panelset[ .panel[.panel-name[Bayesian workflow] <img src="images/Bayesian-workflow-1.png" width="100%" /> ] .panel[.panel-name[Posterior evaluation] <img src="images/Bayesian-workflow-2.png" width="80%" /> ] .panel[.panel-name[Model comparison] <img src="images/Bayesian-workflow-3.png" width="80%" /> ] ] --- ## Bayes factor Bayes' theorem, explicitly showing the dependancy of the parameters `\(\mathbf{\theta}\)` on the model `\(\mathcal{M}\)`. $$ p(\theta | y, \mathcal{M}) = \frac{p(y|\theta, \mathcal{M}) p(\theta | \mathcal{M})}{p(y | \mathcal{M})}$$ `\(\mathcal{M}\)` refers to a specific model. -- `\(p(y | \mathcal{M})\)` is the probability of the data, averaged over all possible parameter values of model `\(\mathcal{M}\)`. It is known as the marginal likelihood. -- .panelset[ .panel[.panel-name[Bayes theorem] $$ P(\theta|Data) = \frac{ P(Data|\theta) * P(\theta) } {P(Data)} $$ ] .panel[.panel-name[Bayes theorem with marginal likelihood] $$ P(\theta|Data) \propto P(Data|\theta) * P(\theta) $$ ] .panel[.panel-name[Marginal likelihood] $$ p(y | \mathcal{M}) = \int_{\theta}{p(y | \theta, \mathcal{M}) p(\theta|\mathcal{M})d\theta} $$ To compuate the marginal likelihood, we have to average over all values of `\(\theta\)`. ] ] --- ## Complexity - Marginal likelihood is also known as the __model evidence__. This depends on what kind of predictions a moedl can make. - A model that makes many predictions is a complext model. -- Complexity depends on - Number of parameters - Prior distributions of these parameters -- In frequentist models priors don't exist, so it only depends on the number of parameters. --- ## Prior distributions Uninformative priors make many predictions with low likelihood. ```r n_points <- 100 theta_grid <- seq(from=0, to=1 , length.out = n_points) likelihood <- dbinom(6, size = 9, prob = theta_grid) compute_posterior = function(likelihood, prior){ unstandardized_posterior <- likelihood * prior posterior <- unstandardized_posterior / sum(unstandardized_posterior) par(mfrow=c(1, 3)) plot(theta_grid , prior, type="l", main="Prior", col = "dodgerblue3", lwd = 2) plot(theta_grid , likelihood, type="l", main="Likelihood", col = "firebrick3", lwd = 2) plot(theta_grid , posterior , type="l", main="Posterior", col = "darkorchid3", lwd = 2) } ``` --- ## Prior distributions Uninformative priors make many predictions with low likelihood. .panelset[ .panel[.panel-name[Uniformer Prior] ```r prior <- dbeta(x = theta_grid, shape1 = 1, shape2 = 1) compute_posterior(likelihood, prior) ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> ] .panel[.panel-name[Informative prior] ```r prior <- dbeta(x = theta_grid, shape1 = 20, shape2 = 20) compute_posterior(likelihood, prior) ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-12-1.png" width="100%" /> ] .panel[.panel-name[Opiniated prior (wrong)] ```r prior <- dbeta(x = theta_grid, shape1 = 2, shape2 = 40) compute_posterior(likelihood, prior) ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> ] .panel[.panel-name[Opiniated prior (better)] ```r prior <- dbeta(x = theta_grid, shape1 = 48, shape2 = 30) compute_posterior(likelihood, prior) ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> ] ] --- ## Ockham's Razor - Complex models have a lower marginal likelihood - If we prefer models with higher marginal likelihood, we are choosing the more parsimonious model. -- Bayes theorem $$ p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)} $$ und $$ p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)} $$ --- ## Modelcomparison Model odds $$ \frac{p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}} {p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}} $$ -- `\(p(y)\)` can be eliminated. -- `$$\underbrace{\frac{p(\mathcal{M}_1 | y)} {p(\mathcal{M}_2 | y)}}_\text{Posterior odds} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)} \cdot \underbrace{ \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}}_\text{Prior odds}$$` -- <br> `\(\frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}\)` are the **prior odds**, and `\(\frac{p(\mathcal{M}_1 | y)}{p(\mathcal{M}_2 | y)}\)` are the **posterior odds**. --- Assuming the prior odds are 1, we are only interested in the ratio of marginal likelihoods. $$ \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)} $$ -- This term is called the **Bayes factor**: it is multiplied with the prior odds, and indicates under which model the data are more likely. We write `\(BF_{12}\)` - this is the Bayes factor for model 1 vs model 2. $$ BF_{12} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}$$ -- `\(BF_{12}\)` indicates the degree of evidence provide by the data for `\(\mathcal{M}_1\)` , relative to `\(\mathcal{M}_2\)`. Example: `\(BF_{12} = 5\)`: the data are 5 times more probable under model 1 than under model 2. --- ## Classification <img src="images/bf-classification.png" width="80%" /> --- ## Bayes factor - BF depend on the prior distributions. - BF Are challenging to compute. + Savage-Dickey Density Ratio with `Stan`/`brms` + Package [BayesFactor](https://cran.r-project.org/web/packages/BayesFactor/vignettes/manual.html) (for general linear models) + [JASP](https://jasp-stats.org/): a GUI for `BayesFactor` + Bridge sampling with `brms`: diffucult to use, but the most reliable --- class: middle .pull-left-narrow[ .huge-blue-number[2] ] .pull-right-wide[ .larger[ Anwendung: Binomial model ] ] --- ## Savage-Dickey Density Ratio If we have two nested models, we can use the Savage-Dickey density ratio. Under the null model: `\(H_0: \theta = \theta_0\)` Under the alterantive model: `\(H_1: \theta \neq \theta_0\)` -- `\(\theta\)` needs a distribution under model 1, e.g.. `\(\theta \sim \text{Beta}(1, 1)\)` -- The Savage-Dickey density ratio trick: We look at model 1 and divide the posterior by the prior density at the null value `\(\theta_0\)`. .footnote[ In the mull model the parameter is fixed (point hypothesis). ] --- ## Savage-Dickey Density Ratio Let's look at an example from Wagenmakers (2010): You observe that someboy corrcetly answers 9 out of 10 questions ```r d <- tibble(s = 9, k = 10) ``` What is probability that this happend by chance ( `\(\theta = 0.5\)` )? .footnote[ Wagenmakers, Eric-Jan, Tom Lodewyckx, Himanshu Kuriyal, and Raoul Grasman. “Bayesian Hypothesis Testing for Psychologists: A Tutorial on the Savage–Dickey Method.” Cognitive Psychology 60, no. 3 (May 1, 2010): 158–89. https://doi.org/10.1016/j.cogpsych.2009.12.001. ] --- .panelset[ .panel[.panel-name[Prior] <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> ] .panel[.panel-name[Posterior] <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Bayes Factors.</caption> <thead> <tr> <th style="text-align:right;"> x </th> <th style="text-align:right;"> Prior </th> <th style="text-align:right;"> Posterior </th> <th style="text-align:right;"> BF01 </th> <th style="text-align:right;"> BF10 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.5 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.107 </td> <td style="text-align:right;"> 0.107 </td> <td style="text-align:right;"> 9.309 </td> </tr> </tbody> </table> ] ] --- ## Savage-Dickey Density Ratio .panelset[ .panel[.panel-name[Formula] ```r library(brms) m1_formula <- bf( s | trials(k) ~ 0 + Intercept, family = binomial(link = "identity")) ``` ] .panel[.panel-name[Get Priors] ```r get_prior(m1_formula, data = d) ``` ``` ## prior class coef group resp dpar nlpar bound source ## (flat) b default ## (flat) b Intercept (vectorized) ``` ] .panel[.panel-name[Set Priors] ```r priors <- set_prior("beta(1, 1)", class = "b", lb = 0, ub = 1) ``` ] .panel[.panel-name[Posterior] ```r m1 <- brm(m1_formula, prior = priors, data = d, * sample_prior = TRUE, file = "models/05-m1") ``` ] .panel[.panel-name[Summary] ```r summary(m1) ``` ``` ## Family: binomial ## Links: mu = identity ## Formula: s | trials(k) ~ 0 + Intercept ## Data: d (Number of observations: 1) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 0.84 0.10 0.60 0.98 1.01 1273 1320 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). ``` ] ] --- ## Posterior summary ```r m1 %>% mcmc_plot(c("b_Intercept", "prior_b")) ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-25-1.png" width="100%" /> --- ## Prior and posterior .pull-left[ ```r samples <- m1 %>% posterior_samples("b") ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Six first rows of posterior samples.</caption> <thead> <tr> <th style="text-align:right;"> b_Intercept </th> <th style="text-align:right;"> prior_b </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.84 </td> <td style="text-align:right;"> 0.94 </td> </tr> <tr> <td style="text-align:right;"> 0.73 </td> <td style="text-align:right;"> 0.31 </td> </tr> <tr> <td style="text-align:right;"> 0.78 </td> <td style="text-align:right;"> 0.26 </td> </tr> <tr> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> 0.74 </td> </tr> <tr> <td style="text-align:right;"> 0.79 </td> <td style="text-align:right;"> 0.70 </td> </tr> <tr> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> 0.74 </td> </tr> </tbody> </table> ] .pull-right[ ```r samples %>% pivot_longer(everything(), names_to = "Type", values_to = "value") %>% ggplot(aes(value, color = Type)) + geom_density(size = 1.5) + scale_color_viridis_d(end = 0.8) + labs(x = bquote(theta), y = "Density") + geom_vline(xintercept = .9) ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-27-1.png" width="100%" /> ] --- ## Savage-Dickey Density Ratio ### With `brms` ```r h <- m1 %>% hypothesis("Intercept = 0.5") print(h, digits = 4) ``` ``` ## Hypothesis Tests for class b: ## Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star ## 1 (Intercept)-(0.5) = 0 0.3361 0.1007 0.0953 0.4754 0.0862 0.0793 * ## --- ## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses. ## '*': For one-sided hypotheses, the posterior probability exceeds 95%; ## for two-sided hypotheses, the value tested against lies outside the 95%-CI. ## Posterior probabilities of point hypotheses assume equal prior probabilities. ``` --- ## Savage-Dickey Density Ratio ```r plot(h) ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-29-1.png" width="100%" /> --- class: middle .pull-left-narrow[ .huge-blue-number[3] ] .pull-right-wide[ .larger[ NOrmal distribution ] ] --- ## Bayes factor for comparing two means ```r smart = tibble(IQ = c(101,100,102,104,102,97,105,105,98,101,100,123,105,103, 100,95,102,106,109,102,82,102,100,102,102,101,102,102, 103,103,97,97,103,101,97,104,96,103,124,101,101,100, 101,101,104,100,101), Group = "SmartDrug") placebo = tibble(IQ = c(99,101,100,101,102,100,97,101,104,101,102,102,100,105, 88,101,100,104,100,100,100,101,102,103,97,101,101,100,101, 99,101,100,100,101,100,99,101,100,102,99,100,99), Group = "Placebo") TwoGroupIQ <- bind_rows(smart, placebo) %>% mutate(Group = fct_relevel(as.factor(Group), "Placebo")) ``` --- ```r t.test(IQ ~ Group, data = TwoGroupIQ) ``` ``` ## ## Welch Two Sample t-test ## ## data: IQ by Group ## t = -1.6222, df = 63.039, p-value = 0.1098 ## alternative hypothesis: true difference in means between group Placebo and group SmartDrug is not equal to 0 ## 95 percent confidence interval: ## -3.4766863 0.3611848 ## sample estimates: ## mean in group Placebo mean in group SmartDrug ## 100.3571 101.9149 ``` --- .panelset[ .panel[.panel-name[Formula] ```r m2_formula <- bf(IQ ~ 1 + Group) ``` ] .panel[.panel-name[Get Priors] ```r get_prior(m2_formula, data = TwoGroupIQ) ``` ``` ## prior class coef group resp dpar nlpar bound source ## (flat) b default ## (flat) b GroupSmartDrug (vectorized) ## student_t(3, 101, 2.5) Intercept default ## student_t(3, 0, 2.5) sigma default ``` ] .panel[.panel-name[Set Priors] ```r priors = c(set_prior("normal(0, 1)", class = "b", coef = "GroupSmartDrug")) ``` ] .panel[.panel-name[Posterior] ```r m2 <- brm(m2_formula, prior = priors, data = TwoGroupIQ, cores = parallel::detectCores(), sample_prior = TRUE, file = here::here("models/05-m2-iq-bf")) ``` ] .panel[.panel-name[Summary] ```r summary(m2) ``` ``` ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: IQ ~ 1 + Group ## Data: TwoGroupIQ (Number of observations: 89) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 100.76 0.61 99.58 101.96 1.00 4294 2943 ## GroupSmartDrug 0.78 0.70 -0.63 2.13 1.00 4380 3215 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 4.73 0.36 4.10 5.50 1.00 3911 3181 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). ``` ] ] --- ```r m2 %>% mcmc_plot("GroupSmartDrug") ``` <img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-37-1.png" width="100%" /> --- ```r BF <- hypothesis(m2, hypothesis = 'GroupSmartDrug = 0') BF ``` ``` ## Hypothesis Tests for class b: ## Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star ## 1 (GroupSmartDrug) = 0 0.78 0.7 -0.63 2.13 0.82 0.45 ## --- ## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses. ## '*': For one-sided hypotheses, the posterior probability exceeds 95%; ## for two-sided hypotheses, the value tested against lies outside the 95%-CI. ## Posterior probabilities of point hypotheses assume equal prior probabilities. ``` ```r 1/BF$hypothesis$Evid.Ratio ``` ``` ## [1] 1.215046 ```