Model comparison

# Model comparison
## Part 1 <br/> Bayes factors
### Andrew Ellis
### Bayesian multilevel modelling workshop 2021
### 5-29-2021

---

layout: true
  

<div class="my-footer">
<span>
<a href="https://awellis.github.io/learnmultilevelmodels/" target="_blank"><svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#0F4C81;overflow:visible;position:relative;"><path d="M280.37 148.26L96 300.11V464a16 16 0 0 0 16 16l112.06-.29a16 16 0 0 0 15.92-16V368a16 16 0 0 1 16-16h64a16 16 0 0 1 16 16v95.64a16 16 0 0 0 16 16.05L464 480a16 16 0 0 0 16-16V300L295.67 148.26a12.19 12.19 0 0 0-15.3 0zM571.6 251.47L488 182.56V44.05a12 12 0 0 0-12-12h-56a12 12 0 0 0-12 12v72.61L318.47 43a48 48 0 0 0-61 0L4.34 251.47a12 12 0 0 0-1.6 16.9l25.5 31A12 12 0 0 0 45.15 301l235.22-193.74a12.19 12.19 0 0 1 15.3 0L530.9 301a12 12 0 0 0 16.9-1.6l25.5-31a12 12 0 0 0-1.7-16.93z"/></svg></a> Graduate School workshop 2021
</span>
</div>

---

```r
library(tidyverse)
library(rmarkdown)
library(kableExtra)
library(countdown)

theme_set(theme_grey(base_size = 14) +
            theme(panel.grid = element_blank()))
```

## Introduction

.pull-left[
Sofar we have
-estimated parameters
- summarised posterior distributions
    + credible Intervals
    + point estimates

]

- e.g.. model 1  explains the data better than model 2
- Model 1 has higher probability than model 2
]

---

- Look at the example. How would you go about showing that the intervention has an effect?
]
]

We have a within factor ("time") and a between factor (`intervention`).

]

---

## Possible methods

- hypothesis testing, e.g.

```r
t.test(diff ~ intervention,
       data = dwide)
```
]
.pull-right[

```r
m1 <- lmer(score ~ intervention + time + (1|id), 
                  data = d)
m2 <- lmer(score ~ intervention * time + (1|id), 
                  data = d)
```

]

- model comparison

- explained variance

- information criteria

- cross validation

Let's look a Bayesian method for model comparison.

---

---

## Bayesian model comparison
We'll look at three methods

1) Bayes Factors

2) Out of sample predictive accuracy: Approximate leave-one-out cross validation (LOO)

3) Posterior predictive checking using test statistics (Posterior predictive p-values).

Let's begon with Bayes factors. These are not everybody's cup of tea; some statistians don't like them, whereas they are popular among psychologists, who often want to test hypotheses.

---

]

]

]
]

---

## Bayes factor

Bayes' theorem, explicitly showing the dependancy of the parameters `$\mathbf{\theta}$` on the model `$\mathcal{M}$`.

$$ p(\theta | y, \mathcal{M}) = \frac{p(y|\theta, \mathcal{M}) p(\theta | \mathcal{M})}{p(y | \mathcal{M})}$$

`$\mathcal{M}$` refers to a specific model.

`$p(y | \mathcal{M})$` is the probability of the data, averaged over all possible parameter values of model  `$\mathcal{M}$`. It is known as the marginal likelihood.

$$ P(\theta|Data) = \frac{ P(Data|\theta) * P(\theta) } {P(Data)} $$
]
.panel[.panel-name[Bayes theorem with marginal likelihood]
$$ P(\theta|Data) \propto P(Data|\theta) * P(\theta) $$
]
.panel[.panel-name[Marginal likelihood]

$$ p(y | \mathcal{M}) = \int_{\theta}{p(y | \theta, \mathcal{M}) p(\theta|\mathcal{M})d\theta} $$
To compuate the marginal likelihood, we have to average over all values of `$\theta$`.

]
]

---

## Complexity

- Marginal likelihood is also known as the __model evidence__. This depends on what kind of predictions a moedl can make.

- A model that makes many predictions is a complext model.

--
Complexity depends on

- Number of parameters

- Prior distributions of these parameters

In frequentist models priors don't exist, so it only depends on the number of parameters.

---

## Prior distributions

Uninformative priors make many predictions with low likelihood.

```r
n_points <- 100
theta_grid <- seq(from=0, to=1 , length.out = n_points)
likelihood <- dbinom(6, size = 9, prob = theta_grid)

compute_posterior = function(likelihood, prior){
  unstandardized_posterior <- likelihood * prior
  posterior <- unstandardized_posterior / sum(unstandardized_posterior)
  par(mfrow=c(1, 3))
  plot(theta_grid , prior, type="l", main="Prior", col = "dodgerblue3", lwd = 2)
  plot(theta_grid , likelihood, type="l", main="Likelihood", col = "firebrick3", lwd = 2)
  plot(theta_grid , posterior , type="l", main="Posterior", col = "darkorchid3", lwd = 2)
}
```

---

## Prior distributions

Uninformative priors make many predictions with low likelihood.

```r
prior <- dbeta(x = theta_grid, shape1 = 1, shape2 = 1)
compute_posterior(likelihood, prior)
```

<img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-11-1.png" width="100%" />
]

```r
prior <- dbeta(x = theta_grid, shape1 = 20, shape2 = 20)
compute_posterior(likelihood, prior)
```

<img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-12-1.png" width="100%" />
]

```r
prior <- dbeta(x = theta_grid, shape1 = 2, shape2 = 40)
compute_posterior(likelihood, prior)
```

<img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-13-1.png" width="100%" />
]

```r
prior <- dbeta(x = theta_grid, shape1 = 48, shape2 = 30)
compute_posterior(likelihood, prior)
```

<img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-14-1.png" width="100%" />
]
]

---

## Ockham's Razor

- Complex models have a lower marginal likelihood
- If we prefer models with higher marginal likelihood, we are choosing the more parsimonious model.

Bayes theorem

$$ p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)} $$

und

$$ p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)} $$

---

## Modelcomparison

Model odds

$$ \frac{p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}} {p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}} $$

`$p(y)$` can be eliminated.

`$$\underbrace{\frac{p(\mathcal{M}_1 | y)} {p(\mathcal{M}_2 | y)}}_\text{Posterior odds} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)} \cdot \underbrace{ \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}}_\text{Prior odds}$$`

<br>

`$\frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}$` are the **prior odds**, and `$\frac{p(\mathcal{M}_1 | y)}{p(\mathcal{M}_2 | y)}$` are the **posterior odds**.

---

Assuming the prior odds are 1, we are only interested in the ratio of marginal likelihoods.

$$ \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)} $$

This term is called the **Bayes factor**: it is multiplied with the prior odds, and indicates under which model the data are more likely.

We write `$BF_{12}$` - this is the Bayes factor for model 1 vs model 2.

$$ BF_{12} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}$$

`$BF_{12}$` indicates the degree of evidence provide by the data for  `$\mathcal{M}_1$` , relative to  `$\mathcal{M}_2$`.

Example: `$BF_{12} = 5$`: the data are 5 times more probable under model 1 than under model 2.

---

## Classification

---

## Bayes factor

- BF depend on the prior distributions.
- BF Are challenging to compute.

+ Savage-Dickey Density Ratio with `Stan`/`brms`
    
    + Package [BayesFactor](https://cran.r-project.org/web/packages/BayesFactor/vignettes/manual.html) (for general linear models)
    
    + [JASP](https://jasp-stats.org/): a GUI for `BayesFactor`
    
    + Bridge sampling with `brms`: diffucult to use, but the most reliable

---

.pull-left-narrow[
  .huge-blue-number[2]
]
.pull-right-wide[
  .larger[
  Anwendung: Binomial model
  ]
]

---

## Savage-Dickey Density Ratio

If we have two nested models, we can use the Savage-Dickey density ratio.

Under the null model: `$H_0: \theta = \theta_0$`

Under the alterantive model: `$H_1: \theta \neq \theta_0$`

`$\theta$` needs a distribution under model 1, e.g.. `$\theta \sim \text{Beta}(1, 1)$`

The Savage-Dickey density ratio trick: We look at model 1 and divide the posterior by the prior density at the null value `$\theta_0$`.

---

## Savage-Dickey Density Ratio

Let's look at an example from Wagenmakers (2010):

You observe that someboy corrcetly answers 9 out of 10 questions

```r
d <- tibble(s = 9, k = 10)
```

What is probability that this happend by chance ( `$\theta = 0.5$` )?

.footnote[
Wagenmakers, Eric-Jan, Tom Lodewyckx, Himanshu Kuriyal, and Raoul Grasman. “Bayesian Hypothesis Testing for Psychologists: A Tutorial on the Savage–Dickey Method.” Cognitive Psychology 60, no. 3 (May 1, 2010): 158–89. https://doi.org/10.1016/j.cogpsych.2009.12.001.
]

---

.panelset[
.panel[.panel-name[Prior]
<img src="03-bayes-factors-slides_files/figure-html/unnamed-chunk-17-1.png" width="100%" />
]

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Bayes Factors.</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> x </th>
   <th style="text-align:right;"> Prior </th>
   <th style="text-align:right;"> Posterior </th>
   <th style="text-align:right;"> BF01 </th>
   <th style="text-align:right;"> BF10 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.107 </td>
   <td style="text-align:right;"> 0.107 </td>
   <td style="text-align:right;"> 9.309 </td>
  </tr>
</tbody>
</table>
]
]

---

## Savage-Dickey Density Ratio

```r
library(brms)

m1_formula <- bf(
  s | trials(k) ~ 0 + Intercept,
  family = binomial(link = "identity"))
```

]

```r
get_prior(m1_formula, 
          data = d)
```

```
##   prior class      coef group resp dpar nlpar bound       source
##  (flat)     b                                            default
##  (flat)     b Intercept                             (vectorized)
```
]
.panel[.panel-name[Set Priors]

```r
priors <- set_prior("beta(1, 1)", 
                   class = "b", 
                   lb = 0, ub = 1)
```

]

```r
m1 <- brm(m1_formula,
         prior = priors,
         data = d,
*        sample_prior = TRUE,
         file = "models/05-m1")
```

]
.panel[.panel-name[Summary]

```r
summary(m1)
```

```
##  Family: binomial 
##   Links: mu = identity 
## Formula: s | trials(k) ~ 0 + Intercept 
##    Data: d (Number of observations: 1) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.84      0.10     0.60     0.98 1.01     1273     1320
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

]
]

---

## Posterior summary

```r
m1 %>% 
  mcmc_plot(c("b_Intercept", "prior_b"))
```

---

## Prior and posterior

```r
samples <- m1 %>% 
  posterior_samples("b")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Six first rows of posterior samples.</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> b_Intercept </th>
   <th style="text-align:right;"> prior_b </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.84 </td>
   <td style="text-align:right;"> 0.94 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.73 </td>
   <td style="text-align:right;"> 0.31 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.78 </td>
   <td style="text-align:right;"> 0.26 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.80 </td>
   <td style="text-align:right;"> 0.74 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.79 </td>
   <td style="text-align:right;"> 0.70 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.80 </td>
   <td style="text-align:right;"> 0.74 </td>
  </tr>
</tbody>
</table>

]

```r
samples %>% 
  pivot_longer(everything(), 
               names_to = "Type", 
               values_to = "value") %>%
  ggplot(aes(value, color = Type)) +
  geom_density(size = 1.5) +
  scale_color_viridis_d(end = 0.8) +
  labs(x = bquote(theta), y = "Density") +
  geom_vline(xintercept = .9) 
```

]

---

## Savage-Dickey Density Ratio

### With `brms`

```r
h <- m1 %>% 
  hypothesis("Intercept = 0.5")

print(h, digits = 4)
```

```
## Hypothesis Tests for class b:
##              Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
## 1 (Intercept)-(0.5) = 0   0.3361    0.1007   0.0953   0.4754     0.0862    0.0793    *
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
```

---

## Savage-Dickey Density Ratio

```r
plot(h)
```

---

---

## Bayes factor for comparing two means

```r
smart = tibble(IQ = c(101,100,102,104,102,97,105,105,98,101,100,123,105,103,
                      100,95,102,106,109,102,82,102,100,102,102,101,102,102,
                      103,103,97,97,103,101,97,104,96,103,124,101,101,100,
                      101,101,104,100,101),
               Group = "SmartDrug")

placebo = tibble(IQ = c(99,101,100,101,102,100,97,101,104,101,102,102,100,105,
                        88,101,100,104,100,100,100,101,102,103,97,101,101,100,101,
                        99,101,100,100,101,100,99,101,100,102,99,100,99),
                 Group = "Placebo")

TwoGroupIQ <- bind_rows(smart, placebo)  %>%
    mutate(Group = fct_relevel(as.factor(Group), "Placebo"))
```

---

```r
t.test(IQ ~ Group,
       data = TwoGroupIQ)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  IQ by Group
## t = -1.6222, df = 63.039, p-value = 0.1098
## alternative hypothesis: true difference in means between group Placebo and group SmartDrug is not equal to 0
## 95 percent confidence interval:
##  -3.4766863  0.3611848
## sample estimates:
##   mean in group Placebo mean in group SmartDrug 
##                100.3571                101.9149
```

---

```r
m2_formula <- bf(IQ ~ 1 + Group)
```

]

```r
get_prior(m2_formula,
          data = TwoGroupIQ)
```

```
##                   prior     class           coef group resp dpar nlpar bound       source
##                  (flat)         b                                                 default
##                  (flat)         b GroupSmartDrug                             (vectorized)
##  student_t(3, 101, 2.5) Intercept                                                 default
##    student_t(3, 0, 2.5)     sigma                                                 default
```
]

```r
priors = c(set_prior("normal(0, 1)",
                    class = "b", coef = "GroupSmartDrug"))
```

]

```r
m2 <- brm(m2_formula,
          prior = priors,
          data = TwoGroupIQ,
          cores = parallel::detectCores(),
          sample_prior = TRUE,
          file = here::here("models/05-m2-iq-bf"))
```

]

```r
summary(m2)
```

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: IQ ~ 1 + Group 
##    Data: TwoGroupIQ (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept        100.76      0.61    99.58   101.96 1.00     4294     2943
## GroupSmartDrug     0.78      0.70    -0.63     2.13 1.00     4380     3215
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     4.73      0.36     4.10     5.50 1.00     3911     3181
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```
]
]

---

```r
m2 %>% 
  mcmc_plot("GroupSmartDrug")
```

---

```r
BF <- hypothesis(m2,
                hypothesis = 'GroupSmartDrug = 0')

BF
```

```
## Hypothesis Tests for class b:
##             Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
## 1 (GroupSmartDrug) = 0     0.78       0.7    -0.63     2.13       0.82      0.45     
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
```

```r
1/BF$hypothesis$Evid.Ratio
```

```
## [1] 1.215046
```