Bayesian Statistics

class: left, bottom, inverse, title-slide

# Bayesian Statistics
## Lecture 8: Introduction to Bayesian Regression
### Yanfei Kang
### 2019/10/25 (updated: 2019-11-25)

---

# What we have learnt

- Bayesian basics.

- Introduction to Bayesian computing.

- Modern Bayesian computing algorithms.

- We will see how to apply these techniques to modeling.

- Let us first recall classical v.s. Bayesian.

---
class: inverse, center, middle

# A recap of Bayesian inference

---
# Inference

- Consider a simple problem of inference.

- Suppose I am interested in the proportion of students in this class who are born in Beijing.  This will be denoted as `$\theta$`.

- In particular, I am interested in whether half of the class is born in Beijing ( `$\theta=0.5$`) or whether it is less than half ( `$\theta<0.5$`).

- I am unable to ask everybody in the class if they were born in Beijing, I can only take a sample.

---
# Classical Framework

- In the **classical framework** `$\theta$` is NOT a random variable.  It is a fixed number that is unknown.

- Using the sample, an estimate `$\hat{\theta}$` can be obtained.

- A `$95\%$` confidence interval around `$\hat{\theta}$` can be constructed.

- The null hypothesis `$\theta=0.5$` can be tested against the alternative.

---
# Intepretation in Classical Framework

- Suppose the `$95\%$` confidence interval is (0.3-0.45).  How is this interpreted?

- Correct (classical) interpretation:
    - If an infinite number of samples is taken, `$95\%$` of the confidence intervals constructed in this way will contain the true value `$\theta$`.
    
- Incorrect (classical) interpretation:
    - There is a `$95\%$` probability that `$\theta$` is in the interval 0.35-0.45.
    
- Reason: `$\theta$` is not a random variable.

---
# Intepretation in Classical Framework
  
- Suppose `$\hat{\theta}=0.4$` and the Null is rejected at the `$5\%$` level of significance.

- Correct (classical) interpretation:
    - If the null were true than the probability of observing `$\hat{\theta}\leq 0.4$` is less than the level of significance ( `$5\%$`).  Therefore the null is rejected and we conclude  `$\hat{\theta}\leq 0.5$`.
    
- Incorrect (classical) interpretation:
    - The probability that `$\theta\leq 0.5$` is `$95\%$`.
    
- Reason: `$\theta$` is not a random variable.

---
# Bayesian Framework

- There is another way to do statistical inference known as the **Bayesian Framework**.

- It relies on a very different understanding of probability.

- In the Bayesian framework probability distributions represent uncertainty about quantities that are unknown.

- In our example, under the Bayesian framework `$\theta$` is a random variable.

---
# Why Bayes?

-  Let `${y}=(y_1,y_2,\ldots,y_n)'$` be the data in the sample, and `$\theta$` be the unknown parameter of interest.

- Inferences about `$\theta$` are based on the distribution `$p(\theta|{y})$`.  This can be found using Bayes' Rule `$$p(\theta|{y})=\frac{p({y}|\theta)p(\theta)}{p({y})}$$`

- Consider each term in the equation.

---
#

### Question 1: what kind of inferences have we made using Bayesian framework? Examples?

### Question 2: what techniques have we learnt? Why do we need sampling algorithms? Examples?

---
class: inverse, center, middle
# The Bayesian linear regression model

---
# Linear regression

- Recall the linear regression model

`$$y_i = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \epsilon_i,$$`
    where `$\epsilon_i \sim N(0, \sigma^2).$`

- We are intertested in the joint distribution of
    `$\beta_0,\beta_1,...\beta_p$` and `$\sigma^2$`.

- You know how to get that by OLS under the Gaussian assumption. We have
    `$\beta|\sigma^2 \sim N\left[(x'x)^{-1}x'y,(x'x)^{-1}\sigma^2\right]$` and `$\sigma^2 \sim \chi^2(n-p)$`.
    
---
# The Bayesian approach

- We know by the Bayes' rule `$$\begin{aligned} p\left(\beta, \sigma^{2} | y, x\right) & \propto p\left(y | \beta, \sigma^{2}, x\right) p\left(\beta, \sigma^{2}\right) \\ &=p\left(y | \beta, \sigma^{2}, x\right) p\left(\beta | \sigma^{2}\right) p\left(\sigma^{2}\right) \end{aligned}$$`
                              
      where `$p(y|\beta,\sigma^2,x)$` is the **likelihood** for the
      model and `$p(\beta,\sigma^2) = p(\beta|\sigma^2)p(\sigma^2)$` is
      the **prior** information of the parameters and
      `$p(\beta,\sigma^2|y,x)$` is called the **posterior**.

- **The Bayesian way**: Let's just draw random samples
      from `$p(\beta,\sigma^2|y,x)$`.

---
# Sampling the Posterior

- Write down the likelihood function.
- Specify the prior.
- Write down the posterior.
- Use Gibbs to draw. Set a initial value for `$\beta^{(0)}$` and `$\sigma^{2(0)}$`.
    - Draw a random vector `$\beta^{(1)}$` from `$p(\beta^{(1)} | \sigma^{2(0)},y,x)$`
    - Draw a random number `$\sigma^{2(1)}$` from `$p(\sigma^{2(1)}| \beta^{(1)} ,y,x)$`
    - Draw a random vector `$\beta^{(2)}$` from `$p(\beta^{(2)} | \sigma^{2(1)},y,x)$`
    - Draw a random number `$\sigma^{2(2)}$` from `$p(\sigma^{2(2)}| \beta^{(2)} ,y,x)$`
    - Draw a random vector `$\beta^{(3)}$` from `$p(\beta^{(3)} | \sigma^{2(2)},y,x)$`
    - Draw a random number `$\sigma^{(3)}$` from `$p(\sigma^{2(3)}| \beta^{(3)} ,y,x)$`
    - ...
    - ...
- Summarize `$\beta^{(1)},\beta^{(2)},...,\beta^{(n)}$`
- Summarize `$\sigma^{(1)},\sigma^{(2)},...,\sigma^{(n)}$`

---
# Lab

Work on random walk Metroplis within Gibbs for multiple linear regression.

- Prior for `$\beta$`: `$\beta \sim N(0, I)$`.

- Prior for `$\sigma^2$`: `$\sigma^2 \sim \chi^2(10)$`.

- Write down the likelihood.

- Write down the joint posterior.

- Go to Gibbs.

- Use random walk proposal for Metropolis.

- Summary your samples.

---
# Lab (continued)

```r
## Generate data for linear regression
n <- 1000
p <- 3
epsilon <- rnorm(n, 0, 0.1)
beta <- matrix(c(-3, 1, 3, 5))
X <- cbind(1, matrix(runif(n * p), n))
y <- X %*% beta + epsilon
```

---
# Lab (continued)

```r
## Log posterior
library(mvtnorm)
LogPosterior <- function(beta, sigma2, y, X) {
    p <- length(beta)
    ## The log likelihood function
    LogLikelihood <- sum(dnorm(x = y, mean = X %*% beta, sd = sqrt(sigma2), 
        log = TRUE))
    ## The priors
    LogPrior4beta <- dmvnorm(x = t(beta), mean = matrix(0, 1, p), sigma = diag(p), 
        log = TRUE)
    LogPrior4sigma2 <- dchisq(x = sigma2, df = 10, log = TRUE)
    LogPrior <- LogPrior4beta + LogPrior4sigma2
    ## The log posterior
    LogPosterior <- LogLikelihood + LogPrior
    return(LogPosterior)
}
```
---
class: inverse, center, middle

# Connections between Bayesian and OLS

### Let's come back to the simple regression model

---
# Frequentist Ordinary Least Square (OLS) Simple Linear Regression

- Obtaining accurate measurements of body fat is expensive and not easy to be done.

- Instead, predictive models that predict the percentage of body fat which use readily available measurements such as abdominal circumference are easy to use and inexpensive.

- Let's use a simple linear regression to predict body fat using abdominal circumference.

- The data set `bodyfat` can be found from the library `BAS`.

---
# Frequentist OLS

```r
library(BAS)
data(bodyfat)
summary(bodyfat)
# Frequentist OLS linear regression
bodyfat.lm = lm(Bodyfat ~ Abdomen, data = bodyfat)
summary(bodyfat.lm)
```

---
# The model

This regression model can be formulated as `$$y_{i}=\alpha+\beta x_{i}+\epsilon_{i}, \quad i=1, \cdots, 252.$$`

We have assumptions for the error  `$$\epsilon_{i} \stackrel{\text { iid }}{\sim} \text { Normal }\left(0, \sigma^{2}\right).$$`

---
# `lm`

.pull-left[

```r
# Extract coefficients
beta = coef(bodyfat.lm)

# Visualize regression line on the scatter plot
library(ggplot2)
ggplot(data = bodyfat, aes(x = Abdomen, y = Bodyfat)) +
  geom_point(color = "blue") +
  geom_abline(intercept = beta[1], slope = beta[2], size = 1) +
  xlab("abdomen circumference (cm)") 
```
]

.pull-right[
![](BS-L8-bregression_files/figure-html/unnamed-chunk-4-1.png)
]

---
# OLS

- We usually use the residuals for diagnostics and estimating the variance of the error term `$\sigma^2$`: `$$\hat{\sigma}^{2}=\frac{1}{n-2} \sum_{i}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}=\frac{1}{n-2} \sum_{i}^{n} \hat{\epsilon}_{i}^{2} = \frac{SSE}{n-2}.$$`
- We have the estimates `$$\hat{\beta}=\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i}\left(x_{i}-\bar{x}\right)^{2}}=\frac{S_{x y}}{S_{x x}}, \quad \quad \hat{\alpha}=\bar{y}-\hat{\beta} \bar{x}=\bar{y}-\frac{S_{x y}}{S_{x x}} \bar{x}.$$`

---
# OLS

- The standard errors are given as 
`$$\begin{array}{l}{\operatorname{se}_{\alpha}=\sqrt{\frac{\operatorname{SSE}}{n-2}\left(\frac{1}{n}+\frac{\bar{x}^{2}}{\mathrm{S}_{x x}}\right)}=\hat{\sigma} \sqrt{\frac{1}{n}+\frac{\bar{x}^{2}}{\mathrm{S}_{x x}}}} \\ {\operatorname{se}_{\beta}=\sqrt{\frac{\mathrm{SSE}}{n-2} \frac{1}{\mathrm{S}_{x x}}}=\frac{\hat{\sigma}}{\sqrt{\mathrm{S}_{x x}}}}\end{array}$$`
- We can then construct confidence intervals.

---
# Bayesian way

- Bear in mind our model This regression model can be formulated as `$$y_{i}=\alpha+\beta x_{i}+\epsilon_{i}, \epsilon_{i} \stackrel{\text { iid }}{\sim} \text { Normal }\left(0, \sigma^{2}\right).$$`

- The Bayesian approach returns a whole distribution of so-called credible values for each parameter.

- Credibility is a term used to express our belief in certain parameter values.

- The beneﬁt of working with distributions is that they provide a natural way to reﬂect uncertainty of estimates.

---
# Bayesian way

- Can you write down the likelihood?

- We now use non-informative prior: `$$p\left(\alpha, \beta, \sigma^{2}\right) \propto \frac{1}{\sigma^{2}}.$$`

---
# The joint posterior

`$$\begin{aligned} p^{*}\left(\alpha, \beta, \sigma^{2} | y_{1}, \cdots, y_{n}\right) & \propto\left[\prod_{i}^{n} p\left(y_{i} | x_{i}, \alpha, \beta, \sigma^{2}\right)\right] p\left(\alpha, \beta, \sigma^{2}\right) \\ & \propto\left[\left(\frac{1}{\left(\sigma^{2}\right)^{1 / 2}} \exp \left(-\frac{\left(y_{1}-\left(\alpha+\beta x_{1}\right)\right)^{2}}{2 \sigma^{2}}\right)\right) \times \cdots\right.\\ &\left.\times\left(\frac{1}{\left(\sigma^{2}\right)^{1 / 2}} \exp \left(-\frac{\left(y_{n}-\left(\alpha+\beta x_{n}\right)\right)^{2}}{2 \sigma^{2}}\right)\right)^{2}\right] \times\left(\frac{1}{\sigma^{2}}\right) \\ & \propto \frac{1}{\left(\sigma^{2}\right)^{(n+2) / 2}} \exp \left(-\frac{\sum_{i}\left(y_{i}-\alpha-\beta x_{i}\right)^{2}}{2 \sigma^{2}}\right) \end{aligned}.$$`

---
# The marginal posterior

`$$\beta | y_{1}, \cdots, y_{n} \sim \mathrm{t}\left(n-2, \hat{\beta}, \frac{\hat{\sigma}^{2}}{\mathrm{S}_{x x}}\right)=\mathrm{t}\left(n-2, \hat{\beta},\left(\mathrm{se}_{\beta}\right)^{2}\right).$$`

`$$\alpha | y_{1}, \cdots, y_{n} \sim \mathrm{t}\left(n-2, \hat{\alpha}, \hat{\sigma}^{2}\left(\frac{1}{n}+\frac{\bar{x}^{2}}{\mathrm{S}_{x x}}\right)\right)=\mathrm{t}\left(n-2, \hat{\alpha},\left(\mathrm{se}_{\alpha}\right)^{2}\right).$$`

`$$\phi=\frac{1}{\sigma^{2}} | y_{1}, \cdots, y_{n} \sim \text { Gamma }\left(\frac{n-2}{2}, \frac{\mathrm{SSE}}{2}\right).$$`

---
# The conditional posterior

`$$\beta | \sigma^{2}, \text { data } \sim \text { Normal }\left(\hat{\beta}, \frac{\sigma^{2}}{\mathrm{S}_{x x}}\right).$$`

`$$\alpha | \sigma^{2}, \text { data } \sim \text { Normal }\left(\hat{\alpha}, \sigma^{2}\left(\frac{1}{n}+\frac{x^{2}}{\mathrm{S}_{x x}}\right)\right).$$`

---
# Credible intervals (CI) for `$\alpha$` and `$\beta$`

```r
output = summary(bodyfat.lm)$coef[, 1:2]
out = cbind(output, confint(bodyfat.lm))
colnames(out) = c("posterior mean", "posterior std", "2.5", "97.5")
round(out, 2)
```

```
##             posterior mean posterior std    2.5   97.5
## (Intercept)         -39.28          2.66 -44.52 -34.04
## Abdomen               0.63          0.03   0.58   0.69
```

---
# Credible intervals for the mean

The mean at point `$x_i$` is `$$\mu_{Y}\left|x_{i}=E\left[Y | x_{i}\right]=\alpha+\beta x_{i}\right.$$`
For the mean, we have its posterior `$$\alpha+\beta x_{i} | \text { data } \sim t\left(n-2, \hat{\alpha}+\hat{\beta} x_{i}, \mathrm{S}_{Y | X_{i}}^{2}\right),$$` where `$$\mathrm{S}_{Y | X_{i}}^{2}=\hat{\sigma}^{2}\left(\frac{1}{n}+\frac{\left(x_{i}-\bar{x}\right)^{2}}{\mathrm{S}_{x x}}\right)$$`.

---
# Credible intervals for the prediction

Any new predition `$y_{n+1}$` at a point `$x_{n+1}$` also follows `$t$`-distribution. 
For the prediction, we have `$$\begin{array}{l}{\qquad y_{n+1} | \text { data, } x_{n+1} \sim \mathrm{t}\left(n-2, \hat{\alpha}+\hat{\beta} x_{n+1}, \mathrm{S}_{Y | X_{n+1}}^{2}\right)} \\  \\ {\qquad \mathrm{S}_{Y | X_{n+1}}^{2}=\hat{\sigma}^{2}+\hat{\sigma}^{2}\left(\frac{1}{n}+\frac{\left(x_{n+1}-\bar{x}\right)^{2}}{\mathrm{S}_{x x}}\right)=\hat{\sigma}^{2}\left(1+\frac{1}{n}+\frac{\left(x_{n+1}-\bar{x}\right)^{2}}{\mathrm{S}_{x x}}\right)}\end{array}$$`

**The variance for predicting a new observation  `$y_{n+1}$` has an extra `$\hat{\sigma}^{2}$`  which comes from the uncertainty of a new observation about the mean `$\mu_Y$` estimated by the regression line.**

---
# Credible intervals

.pull-left[

```r
library(ggplot2)
# Construct current prediction
alpha = bodyfat.lm$coefficients[1]
beta = bodyfat.lm$coefficients[2]
new_x = seq(min(bodyfat$Abdomen), max(bodyfat$Abdomen), 
            length.out = 100)
y_hat = alpha + beta * new_x

# Get lower and upper bounds for mean
ymean = data.frame(predict(bodyfat.lm,
                            newdata = data.frame(Abdomen = new_x),
                            interval = "confidence",
                            level = 0.95))

# Get lower and upper bounds for prediction
ypred = data.frame(predict(bodyfat.lm,
                          newdata = data.frame(Abdomen = new_x),
                          interval = "prediction",
                          level = 0.95))

output = data.frame(x = new_x, y_hat = y_hat, ymean_lwr = ymean$lwr, ymean_upr = ymean$upr, 
                    ypred_lwr = ypred$lwr, ypred_upr = ypred$upr)

# Extract potential outlier data point
outlier = data.frame(x = bodyfat$Abdomen[39], y = bodyfat$Bodyfat[39])

# Scatter plot of original
plot1 = ggplot(data = bodyfat, aes(x = Abdomen, y = Bodyfat)) + geom_point(color = "blue")

# Add bounds of mean and prediction
plot2 = plot1 + 
  geom_line(data = output, aes(x = new_x, y = y_hat, color = "first"), lty = 1) +
  geom_line(data = output, aes(x = new_x, y = ymean_lwr, lty = "second")) +
  geom_line(data = output, aes(x = new_x, y = ymean_upr, lty = "second")) +
  geom_line(data = output, aes(x = new_x, y = ypred_upr, lty = "third")) +
  geom_line(data = output, aes(x = new_x, y = ypred_lwr, lty = "third")) + 
  scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") + 
  scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions")
                        , name = "") + 
  theme_bw() + 
  theme(legend.position = c(1, 0), legend.justification = c(1.5, 0))

# Identify potential outlier
plot2 + geom_point(data = outlier, aes(x = x, y = y), color = "orange", pch = 1, cex = 6)
```
]

.pull-right[
![](BS-L8-bregression_files/figure-html/unnamed-chunk-6-1.png)
]

---
# Interpretation of CI

```r
pred.39 = predict(bodyfat.lm, newdata = bodyfat[39, ], interval = "prediction", level = 0.95)
out = cbind(bodyfat[39,]$Abdomen, pred.39)
colnames(out) = c("abdomen", "prediction", "lower", "upper")
out
```

```
##    abdomen prediction   lower    upper
## 39   148.1   54.21599 44.0967 64.33528
```

---
# Lab

.pull-left[

- Use Bayesian framework to work out the simple linear regression.

- Bear in mind that in Bayesian framework, you actually get a distribution of credible regression lines (see the right figure), representing the uncertainty in the parameter estimates.

- (Optional) Calculate and plot the credible intervals for the mean and the prediction (see slide 33).

]

.pull-right[
![](http://www.dataminingapps.com/wp-content/uploads/2017/09/fig4.png)
]

---
class: inverse, center, middle
# Bayesian Model Choice

---
# Bayesian Information Criterion (BIC)

We will see Bayesian model selections using the Bayesian information criterion, or BIC:
`$$\mathrm{BIC}=-2 \ln (\text { likelihood })+(p+1) \ln (n),$$` where `$n$` is the number of observations, i is the number of predictors.

Remind you that in simple lienar regresssion: `$$\text { likelihood }=p\left(y_{i} | \alpha, \beta, \sigma^{2}\right)=\mathcal{L}\left(\alpha, \beta, \sigma^{2}\right)$$`

---
# BIC

More generally, `$$\text { likelihood }=p(\text { data } | \boldsymbol{\theta}, M)=\mathcal{L}(\boldsymbol{\theta}, M).$$`

In the definition of BIC, the likelihood is `$$\text { likelihood }=p(\text { data } | \hat{\boldsymbol{\theta}}, M)=\mathcal{L}(\hat{\boldsymbol{\theta}}, M).$$`

---
# BIC and `$R^2$`

If we connect BIC and `$R^2$`, we have `$$\mathrm{BIC}=n \ln \left(1-R^{2}\right)+(p+1) \ln (n)+\text { constant }.$$`

- Increasing `$p$`, will result in larger  `$R^2$`.

- Larger `$R^2$` means better goodness of fit, large `$p$` leads to overfitting.

- The second term is to penalize models with too many predictors.

- Trade off between the goodness of fit given by the first term and the model complexity represented by the second term.

---
# Backward Elimination with BIC

We will use the kid’s cognitive score data set `cognitive`.

```r
# Load the library in order to read in data from website
library(foreign)

# Read in cognitive score data set and process data tranformations
cognitive = read.dta("http://www.stat.columbia.edu/~gelman/arm/examples/child.iq/kidiq.dta")

cognitive$mom_work = as.numeric(cognitive$mom_work > 1)
cognitive$mom_hs =  as.numeric(cognitive$mom_hs > 0)
colnames(cognitive) = c("kid_score", "hs","IQ", "work", "age")
```

---
# Backward Elimination with BIC

```r
n = nrow(cognitive)
cog.lm = lm(kid_score ~ ., data=cognitive)
cog.step = stats::step(cog.lm, k=log(n))  
```

Your findings?

---
# Bayesian Model Uncertainty

- We discussed how to use BIC to pick the best model.

- In reality, we may often have several models with similar BIC.

- If we only pick the one with the lowest BIC, we may ignore the presence of other models that are equally good or can provide useful information.

- The credible intervals of coefficients may be narrower since the uncertainty is being ignored when we consider only one model.

- Narrower intervals are not always better if they miss the true values of the parameters.

- To account for the uncertainty, getting the posterior probability of all possible models is necessary.

---
#  Model Uncertainty

- To represent model uncertainty, we need to construct a probability distribution over all possible models where the each probability provides measure of how likely the model is to happen.

- Suppose we have a multiple linear regression
`$$y_{i}=\beta_{0}+\beta_{1}\left(x_{1, i}-\bar{x}_{1}\right)+\beta_{2}\left(x_{2, i}-\bar{x}_{2}\right)+\cdots+\beta_{p}\left(x_{p, i}-\bar{x}_{p}\right)+\epsilon_{i}, \quad 1 \leq i \leq n,$$` how many models in total?

---
# Model uncertainty

- Denote each model as `$M_m, m = 1, 2, \cdots, 2^p$`.

- We want to calculate `$p\left(M_{m} | \text { data }\right)$`:
 `$$p\left(M_{m} | \text { data }\right)=\frac{\text { marginal likelihood of } M_{m} \times p\left(M_{m}\right)}{\sum_{j=1}^{2^p} \text { marginal likelihood of } M_{j} \times p\left(M_{j}\right)}=\frac{p\left(\text { data } | M_{m}\right) p\left(M_{m}\right)}{\sum_{j=1}^{2^p} p\left(\operatorname{data} | M_{m}\right) p\left(M_{m}\right)}.$$`

- Bayes factor `$$B F\left[M_{1}: M_{2}\right]=\frac{p\left(\text { data } | M_{1}\right)}{p\left(\text { data } | M_{2}\right)}.$$`

---
# Calculating Posterior Probability in R

```r
library(BAS)

# Use `bas.lm` for regression
cog_bas = bas.lm(kid_score ~ hs + IQ + work + age, 
                 data = cognitive)
```

---
# Visualisation

.pull-left[

```r
image(cog_bas)
```
]
.pull-right[
![](BS-L8-bregression_files/figure-html/unnamed-chunk-10-1.png)
]

---
# Calculating Posterior Probability in R

```r
round(summary(cog_bas), 3)
```

```
##           P(B != 0 | Y) model 1 model 2 model 3 model 4 model 5
## Intercept         1.000   1.000   1.000   1.000   1.000   1.000
## hs                0.632   1.000   0.000   1.000   1.000   0.000
## IQ                1.000   1.000   1.000   1.000   1.000   1.000
## work              0.188   0.000   0.000   1.000   0.000   1.000
## age               0.136   0.000   0.000   0.000   1.000   0.000
## BF                   NA   1.000   0.429   0.137   0.095   0.112
## PostProbs            NA   0.440   0.284   0.091   0.063   0.049
## R2                   NA   0.214   0.201   0.216   0.215   0.206
## dim                  NA   3.000   2.000   4.000   4.000   3.000
## logmarg              NA  45.193  44.347  43.206  42.842  43.000
```

---
# Calculating Posterior Probability in R

```r
print(cog_bas)
```

```
## 
## Call:
## bas.lm(formula = kid_score ~ hs + IQ + work + age, data = cognitive)
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
## Intercept         hs         IQ       work        age  
##    1.0000     0.6324     1.0000     0.1883     0.1363
```
 This is obtained by  summing the posterior model probabilities across models for each predictor: 
`$$P(\beta \neq 0 | D)=\sum_{k \in\{i: \beta \neq 0 \text{ in } M_i\}} P\left(M_{k} | D\right)$$`
---
# Bayesian model averaging

- Once we have obtained the posterior probability of each model, we can make inference and obtain weighted averages of quantities of interest using these probabilities as weights.

- Models with higher posterior probabilities receive higher weights, while models with lower posterior probabilities receive lower weights.

- This gives the name **Bayesian Model Averaging** (BMA).

- e.g., the probability of the next prediction `$\hat{Y}^*$` after seeing the data can be calculated as a “weighted average” of the prediction of next observation `$\hat{Y_j}^*$` with the posterior probability of `$M_j$` being the weight: `$$\hat{Y}^{*}=\sum_{j=1}^{2^{p}} \hat{Y}_{j}^{*} p\left(M_{j} | \text { data }\right).$$`

---
# BMA

- More generally, we can use this weighted average formula to obtain the value of a quantity of interest `$\Delta$`, `$$p(\Delta | \text { data })=\sum_{j=1}^{2^{p}} p\left(\Delta | M_{j}, \text { data }\right) p\left(M_{j} | \text { data }\right).$$`

- We then have `$$E[\Delta | \text { data }]=\sum_{j=1}^{2^p} E\left[\Delta | M_{j}, \operatorname{data}\right] p\left(M_{j} | \text { data }\right).$$`

---
# Coefficient Summary under BMA

```r
coef(cog_bas)
```

```
## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  16 models 
##            post mean  post SD   post p(B != 0)
## Intercept  86.79724    0.87276   1.00000      
## hs          3.63750    3.29677   0.63239      
## IQ          0.57679    0.06365   1.00000      
## work        0.55219    1.54807   0.18831      
## age         0.03551    0.15273   0.13633
```

---
# Posterior distributions of coefficients

.pull-left[

```r
par(mfrow = c(2, 2))
plot(coef(cog_bas), subset = c(2:5), ask = F)
```

- The vertical bar represents the posterior probability that the coefficient is 0.

- The curve represents the density of plausible values from all the models where the coefficient is non-zero.

]
.pull-right[
![](BS-L8-bregression_files/figure-html/unnamed-chunk-14-1.png)
]

---
# Posterior distributions of coefficients

.pull-left[

```r
plot(confint(coef(cog_bas), parm = 2:5))
```
This uses Monte Carlo sampling to draw from the mixture model over coefficient where models are sampled based on their posterior probabilities.
]
.pull-right[
![](BS-L8-bregression_files/figure-html/unnamed-chunk-15-1.png)

```
## NULL
```
]

---
# Prediction

```r
# Predict with model averaging
predict(cog_bas, estimator = 'BMA')
# Predict with model selection
predict(cog_bas, estimator = 'HPM')
# Predict with BMA
plot(confint(predict(cog_bas, estimator = 'BMA', se.fit = TRUE)))
```
---
# Conclusion

- Bayesian linear regression.

- Connections between Bayesian regression and OLS.

- Bayesian model selection and model averaging.