Bayesian Statistics

class: left, bottom, inverse, title-slide

# Bayesian Statistics
## Lecture 10: Review
### Yanfei Kang
### 2019/11/20 (updated: 2019-12-10)

---

# What have we learnt?

- Basic Bayesian inference.

- Basic Bayesian computation.

- MC and MCMC.

- Bayesian modelling.

---
# Basic Bayesian inference

- We started from Bayes' Rule `$P(A|B) = \frac{P(A \& B)}{P(B)}$`.

- The HIV diagnostic testing example leads us to Bayesian updating, which refers to going from the **prior** to the **posterior**.

- Bayesian inference via Bayes' Rule `$P(\theta|\text{data}) = \frac{P(\text{data}| \theta)P(\theta))}{P(\text{data})}$`.

- We used different types of priors in the inference about the proportion `$p$` of college students get at least eight hours of sleep.

---
# Bayesian inference of the proportion `$p$`

- Prior: `$g(p)$`.

- Data: `$s$` sucesses and `$f$` failures.

- Likelihood: `$L(p) \propto p^{s}(1-p)^{f}, 0<p<1.$`.

- According to the Bayes' Rule: `$g(p|\text{data}) \propto g(p)L(p).$`

---
# Lab 1

In this example, 11 of 27 students sleep a sufficient number of hours. Make Bayesian inference of `$p$` based on discrete prior.

```r
pdisc <- function(pr, p, s, f, plot.it = TRUE) {
    like <- p^s * (1 - p)^f  # calculate likelihood
    post <- pr * like
    post <- post/sum(post)  # calculate posterior
    if (plot.it) 
        {
            par(mfrow = c(2, 1))
            plot(p, pr, type = "h", lty = 2, ylim = c(0, 1), ylab = "prior")
            plot(p, post, type = "h", lty = 1, ylim = c(0, 1), ylab = "posterior")
        }  # plot prior/posterior probability
    result <- list(prior = pr, likelihood = like, posterior = post)
    return(result)
}
```

---
# Lab 1

.pull-left[

```r
# Bayesian inference
p = seq(0.05, 0.95, by = 0.1)
prior = c(1, 5.2, 8, 7.2, 4.6, 2.1, 0.7, 0.1, 0, 0)
prior = prior/sum(prior)
pdisc(prior, p, 11, 16)
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-2-1.png)

```
## $prior
##  [1] 0.034602076 0.179930796 0.276816609 0.249134948 0.159169550
##  [6] 0.072664360 0.024221453 0.003460208 0.000000000 0.000000000
## 
## $likelihood
##  [1] 2.149056e-15 6.422538e-11 2.389573e-09 9.803072e-09 1.074337e-08
##  [6] 3.939035e-09 4.437472e-10 9.833634e-12 1.099179e-14 8.679201e-22
## 
## $posterior
##  [1] 1.451714e-08 2.256022e-03 1.291349e-01 4.767910e-01 3.338350e-01
##  [6] 5.587820e-02 2.098297e-03 6.642742e-06 0.000000e+00 0.000000e+00
```
]

---
# Single parameter models

- We looked at the Bayesian Inference for the Cancer Death Rate.

- Assume `$y_j \sim \text{Poisson}(10n\theta_j)$`.

- Prior: `$\theta_j \sim \text{Beta}(\alpha, \beta)$`.

- Posterior: `$\theta_j|y_j \sim \text{Beta}(\alpha + y_j, \beta + 10n_j)$`.

---
# Lab 2.1

In theory, what is the predictive distribution of `$y$`?

Since `$$p(y)=\frac{p(y | \theta) p(\theta)}{p(\theta | y)},$$` we have

`$$\begin{aligned} p(y_j) & = \frac{\frac{\left(10 n_j \theta_{j}\right)^{y_{j}} e^{-10 n_{j} \theta_{j}}}{y_{j} !} \frac{\beta^{\alpha}}{\Gamma(\alpha)} \theta_{j}^{\alpha-1} e^{-\beta \theta_{j}}}{\frac{\left(\beta+10 n_{j}\right)^{\alpha+y_{j}}}{\Gamma\left(\alpha+y_{j}\right)} \theta_{j}^{\alpha+y_{j}-1} e^{-\left(\beta+10 n_{j}\right) \theta_{j}}} \\ &= \frac{\Gamma\left(\alpha+y_{j}\right) 10 n_{j}^{y_{j}} \beta^{\alpha}}{\Gamma(\alpha) y_{j} !\left(\beta+10 n_{j}\right)^{\alpha+y_{j}}} \sim\text{Neg-Bin}(\alpha, \beta/10n_j)
\end{aligned}$$`

---
# Lab 2.2

Simulate from the predictive distribution of `$y$`.
.pull-left[

```r
sim_pre_dis <- function(p_alpha, p_beta, n) {
    # simulate theta
    thetas <- rbeta(1000, p_alpha, p_beta)
    y <- c()
    for (theta in thetas) {
        # simulate y for every theta
        ys <- rpois(1000, 10 * n * theta)
        y <- c(y, ys)
    }
    return(y)
}
y <- sim_pre_dis(21, 440000, 1000)
hist(y)
```
]
.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-3-1.png)
]

---
# Multi-parameter models

- Normal data with both parameters unknown.

- Prior: `$g\left(\mu, \sigma^{2}\right) \propto 1 / \sigma^{2}.$`

- Posterior: `$$\begin{aligned} g\left(\mu, \sigma^{2} | y\right) & \propto \sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right) \\ &=\sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[\sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}+n(\overline{y}-\mu)^{2}\right]\right) \\ &=\sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[(n-1) s^{2}+n(\overline{y}-\mu)^{2}\right]\right) \end{aligned},$$`

- This can be factorized as the product of conditional and marginal posterior densities.

---
# Multi-parameter models

### Conditional posterior

`$$\mu | \sigma^{2}, y \sim \mathrm{N}\left(\overline{y}, \sigma^{2} / n\right).$$`

### Marginal posterior
`$$\begin{aligned} p\left(\sigma^{2} | y\right) & \propto \sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}}(n-1) s^{2}\right) \sqrt{2 \pi \sigma^{2} / n} \\ & \propto\left(\sigma^{2}\right)^{-(n+1) / 2} \exp \left(-\frac{(n-1) s^{2}}{2 \sigma^{2}}\right) \end{aligned},$$`

which is a scaled inverse- `$\chi^2$` density: `$\sigma^{2} | y \sim \text{Inv-} \chi^2(n-1, s^2).$`

---
# Sampling from the joint posterior distribution

1. First draw `$\sigma^2$` from the marginal posterior distribution.

2. Draw `$\mu$` from the conditional posterior distribution.

**We can also use grid approximation to summarize the posterior**, but when we have more parameters we need **Bayesian Computing**.

---
# Introduction to Bayesian Computing

- Bear in mind that given the prior and data, the posterior is fixed and a Bayesian analysis really boils down to **summarizing the posterior**.

- Approaches to Bayesian Computing

- Just use a point estimate: Maximum a Posteriori (MAP) estimate.

- Approximate the posterior as Gaussian.

- Numerical integration.

- Markov Chain Monte Carlo (MCMC) sampling.

---
# Lab 3

Say `$Y|\theta \sim \text{Poisson}(N\theta),$` and `$\theta \sim \text{Gamma}(a, b)$`, find `$\hat{\theta}_{MAP}$`.

### Posterior

`$$\begin{aligned} 
 P(\theta | Y) &\propto P(Y | \theta) \cdot P(\theta)\\&= \frac{(N \theta)^{\sum y} e^{-N \theta}}{\Pi y !} \times \frac{b^{a} \theta^{a-1} e^{-b \theta}}{\Gamma(a)} \\&=\theta^{(\Sigma y)+a-1} e^{-N \theta} \cdot \frac{N^{y} b^{a} e^{b \theta}}{\Pi y ! \Gamma(a)}\\&=\theta^{\left(\sum y\right)+a-1} e^{-(N+b) \theta} \frac{N^{y} b^{a}}{\Pi y ! \Gamma(a)} \\ & \propto \theta^{\left(\sum y\right)+a-1} e^{-(N+b) \theta} 
\end{aligned}$$`

---
# Lab 3

- MAP: `$\hat{\boldsymbol{\theta}}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} p(\boldsymbol{\theta} | \mathbf{Y})$`.

- For this question, by optimization, we have `$\hat{\theta}_{MAP}=\frac{\sum y+a-1}{N+b}.$`

- Remember that you can also use `optim` function in **R**.

---
# Bayesian CLT

- For large `$n$`, `$$\boldsymbol{\theta} \sim \operatorname{Normal}\left[\hat{\boldsymbol{\theta}}_{M A P}, \mathbf{I}\left(\hat{\boldsymbol{\theta}}_{M A P}\right)^{-1}\right].$$`

- The `$(j, k)$`-th element of `$\mathbf{I}$` is `$-\frac{\partial^{2}}{\partial \theta_{j} \partial \theta_{k}} \log [p(\boldsymbol{\theta} | \mathbf{Y})],$` evaluated at `$\hat{\theta}_{MAP}$`.

- Then we can do everything we do for normal distribution, e.g., sd, intervals etc.

---
# Strategies to summarize posterior

1. If we have familiar distributions, we can simply **simulate** samples in R.

2. **Grid approximation**.
    - Compute values of the posterior on a grid of points.
    - Approximate the continuous posterior by a discrete posterior.
    - Applicable for one- and two-parameter problems.

3. If there are more parameters, we consider **deterministic tools**.
    - MAP
    - Bayesian CLT
    - Numerical Integration
    
4. **Monte Carlo** methods are more general.
    - Independent Monte Carlo 
    - Markov Chain Monte Carlo

---
# Independent Monte Carlo: Direct sampling

- For any continuous r.v. `$X$`, `$F_{X}(X) \sim U(0,1)$`. **Why?**

- Suppose that we are given `$U \sim U(0,1)$` and want to simulate a continuous r.v. `$X$` with cdf `$F_X$`. Put `$Y = F_{X}^{-1}(U)$`, we have 
`$$F_{Y}(y)=\mathbb{P}(Y \leq y)=\mathbb{P}\left(F_{X}^{-1}(U) \leq y\right)=\mathbb{P}\left(U \leq F_{X}(y)\right)=F_{X}(y).$$`

---
# Independent Monte Carlo: Rejection sampling

- Although we cannot easily sample from `$f(\theta|y)$`, there exists another density `$g(\theta)$` (called *proposal* density), like a Normal distribution or perhaps a `$t$`-distribution, from which it is easy for us to sample.

- Then we can sample from  `$g(\theta)$` directly and then "reject" the samples in a strategic way to make the resulting "non-rejected" samples look like they came from `$f(\theta|y)$`.

---
# Independent Monte Carlo: Rejection sampling

1. Sample `$U \sim \operatorname{Unif}(0,1)$`.

2. Sample `$\theta$` at random from the probability density proportional to `$g(\theta)$`.

3. If `$U \leq \frac{f(\theta|y)}{M g(\theta)}$`, accept `$\theta$` as a draw from `$f$`. If the drawn `$\theta$` is rejected, return to step 1. (With probability `$\frac{f(\theta | y)}{M g(\theta)}$`), accept `$\theta$`).

The algorithm can be repeated until the desired number of samples from the target density  `$f$` has been accepted.

---
# Lab 4

Use the rejection sampling algorithm to simulate from `$N(0, 1)$`.  Try the following proposal densities.

1. Logistic Distribution.
2. Cauchy distribution.
3. Student `$t_5$` distribution.
4. `$N(1,2)$` distribution.

For each proposal record the percentage of iterates that are accepted.
Which is the best proposal?

---
# Lab 4

.pull-left[

```r
curve(dnorm(x), -6, 6, xlab="x", ylab = "Density", n=200)
curve(dlogis(x), -6, 6,add=TRUE, col=2, n=200)
curve(dcauchy(x), -6, 6, add=TRUE, col=3, n=200)
curve(dt(x, df=5), -6, 6, add=TRUE, col=4, n=200)
curve(dnorm(x, 1, 2), -6, 6, add=TRUE, col=5, n=200)
legend("topright", c("N(0,1)", "logistic", "cauchy", "t(5)", "N(1,2)"), col=1:5, bty="n", lty=1)
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-4-1.png)
]

---
# Lab 4

```r
dist_ratio <- function(x, dist_a, dist_b, ...) {
  return(dist_a(x) / dist_b(x, ...))
}
1 / optimise(dist_ratio, c(-2,2), dnorm, dlogis, maximum=TRUE)$objective
```

```
## [1] 0.6266571
```

```r
1 / optimise(dist_ratio, c(-2,2), dnorm, dcauchy, maximum=TRUE)$objective
```

```
## [1] 0.6577446
```

```r
1 / optimise(dist_ratio, c(-2,2), dnorm, dt, df=5, maximum=TRUE)$objective
```

```
## [1] 0.9078776
```

```r
1 / optimise(dist_ratio, c(-3,3), dnorm, dnorm, mean=1, sd=2, maximum=TRUE)$objective
```

```
## [1] 0.4232409
```

---
# Independent Monte Carlo: Importance sampling

- We can rewrite the target `$$\mathbb{E}_{f}[h(\theta|y)]=\mathbb{E}_{g}\left[\frac{f(\theta|y)}{g(\theta)} h(\theta|y)\right].$$`
- If `$\theta_{1}, \dots, \theta_{n} \sim g$`, we get 
`$$\tilde{\mu}_{n}=\frac{1}{n} \sum_{i=1}^{n} \frac{f\left(\theta_{i}|y\right)}{g\left(\theta_{i}\right)} h\left(\theta_{i}|y\right)=\frac{1}{n} \sum_{i=1}^{n} w_{i} h\left(\theta_{i}|y\right) \approx \mathbb{E}_{f}[h(\theta|y)].$$`
- `$w_i$` are referred to as the importance weights.

---
# Limitations of independent Monte Carlo methods

- The ability to sample from the posterior distribution is essential to the practice of Bayesian statistics.

- Direct sampling
    - Sometimes difficult to find the inverse function.

- Rejection sampling and importance sampling 
    - Doesn’t work well if proposal is very different from the target.
    - Yet constructing a proposal similar to the target can be difficult.

- Intuition of MCMC
    - Instead of a fixed proposal, use an adaptive proposal.
    
---
# The key to MCMC's success

- Not the Markov property.

- But the approximate distributions are improved at each step in simulation (in the sense of converging to the target distribution).

---
# Metropolis

To make sure this Markov chain converges to our target, we need to include the following.

1. At step `$i+1$`, set `$x^{[\mathrm{old}]} = x^{[i]}$`.
2. Generate `$x^{[\mathrm{new}]} \sim N\left(x^{[\mathrm{old}]}, 1\right)$`, and compute `$$\alpha=\min \left(1, \frac{p\left(x^{[\mathrm{new}]}\right)}{p\left(x^{[\mathrm{old}]}\right)}\right).$$`
3. Set `$x^{[i+1]} = x^{[\mathrm{new}]}$` with probability `$\alpha$` (accept); set `$x^{[i+1]} = x^{[\mathrm{old}]}$` with probability `$1-\alpha$` (reject).

---
# Example

.pull-left[
Here is a target distribution to sample from. It’s the weighted sum of two normal distributions. Sample using the Metropolis algorithm, and plot the trace.

```r
p <- 0.4
mu <- c(-1, 2)
sd <- c(0.5, 2)
f <- function(x) {
    p * dnorm(x, mu[1], sd[1]) + (1 - p) * dnorm(x, mu[2], sd[2])
}
curve(f(x), -4, 8, n = 301, las = 1)
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-6-1.png)
]

---
#

### Define `$q$`

```r
q <- function(x) rnorm(1, x, 4)
```

### Define the step

```r
step <- function(x, f, q) {
  ## Pick new point
  xp <- q(x)
  ## Acceptance probability:
  alpha <- min(1, f(xp) / f(x))
  ## Accept new point with probability alpha:
  if (runif(1) < alpha)
    x <- xp
  ## Returning the point:
  x
}
```

---
#  Let's run it

```r
run <- function(x, f, q, nsteps) {
  res <- matrix(NA, nsteps, length(x))
  for (i in seq_len(nsteps))
    res[i,] <- x <- step(x, f, q)
  return(res)
}
res <- run(-10, f, q, 1100)
```

---
# Trace plot (1100 samples, with the first 100 burn-in)

---
# Does it resembles our target?

.pull-left[

```r
hist(res[101:1100], 50, freq=FALSE, main="", ylim=c(0, .4), las=1,
     xlab="x", ylab="Probability density")
z <- integrate(f, -Inf, Inf)$value
curve(f(x) / z, add=TRUE, col="red", n=200)
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-11-1.png)
]

---
# Lab 5

1. Try 10000 samples.

2. Try different initial values (even very unrealistic ones).

3. Try different values of standard deviation for `$q$` (e.g., 50, 0.1). Plot their trace plots together with the previous one. Calculate their acceptance rates. Your findings?

4. Based on 3, plot the autocorrelation plots of the three traces. Your findings?

---
# Lab5

.pull-left[

```r
res.long <- run(-10, f, q, 50000)
hist(res.long, 100, freq=FALSE, main="", ylim=c(0, .4), las=1,
     xlab="x", ylab="Probability density", col="grey")
z <- integrate(f, -Inf, Inf)$value
curve(f(x) / z, add=TRUE, col="red", n=200)
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-12-1.png)
]

---
# Lab5

.pull-left[

```r
res.fast <- run(-10, f, function(x) rnorm(1, x,  50), 1000)
res.slow <- run(-10, f, function(x) rnorm(1, x,  .1), 1000)
layout(matrix(c(1, 2), 1, 2), widths=c(4, 1))
par(mar=c(4.1, .5, .5, .5), oma=c(0, 4.1, 0, 0))
plot(res, type="s", xpd=NA, ylab="Parameter", xlab="Sample", las=1,
     col="grey")
lines(res.fast, col="red")
lines(res.slow, col="blue")
plot(f(xx), xx, type="l", yaxs="i", axes=FALSE)
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-13-1.png)
]

---
# Lab5

.pull-left[

```r
par(mfrow=c(3, 1), mar=c(4, 2, 3.5, .5))
acf(res.slow, las=1, main="Small steps")
acf(res, las=1, main="Intermediate")
acf(res.fast, las=1, main="Large steps")
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-14-1.png)
]

---
# Lab5

```r
coda::effectiveSize(res)
```

```
##     var1 
## 173.5628
```

```r
coda::effectiveSize(res.fast)
```

```
##     var1 
## 31.53674
```

```r
coda::effectiveSize(res.slow)
```

```
##     var1 
## 1.823605
```

---
# Metropolis-Hastings

1. At step `$i+1$`, set `$x^{[\mathrm{old}]} = x^{[i]}$`.
2. Simulate `$x^{[\mathrm{new}]}$` from the proposal density `$q\left(x^{[\mathrm{new}]}|x^{[\mathrm{old}]}\right)$`, and compute `$$\alpha=\min \left(1, \frac{p\left(x^{[\mathrm{new}]}\right) q\left(x^{[\mathrm{old}]}|x^{[\mathrm{new}]}\right)}{p\left(x^{[\mathrm{old}]}\right) q\left(x^{[\mathrm{new}]}|x^{[\mathrm{old}]}\right)}\right).$$`
3. Set `$x^{[i+1]} = x^{[\mathrm{new}]}$` with probability `$\alpha$` (accept); set `$x^{[i+1]} = x^{[\mathrm{old}]}$` with probability `$1-\alpha$` (reject).

---
# Gibbs sampler

- This gives the Gibbs Sampler. If our current state is `$(x^{[i]}, z^{[i]})$`, then we update our parameter values with the following steps.
    1. Generate `$x^{[i+1]} \sim p\left(x|z^{[i]}\right).$`
    2. Generate `$z^{[i+1]} \sim p\left(z|x^{[i+1]}\right).$`
    
- `$x$` and `$z$` can be swapped around.

- It works for more than two variables (think about the case with three parameters).

- Always make sure the conditioning variables are at the most recent state.

---
# Lab 6

Simulate using Gibbs sampler from a bivariate Normal distribution with mean `$(\mu_1, \mu_2)$` and known variance matrix  `$\Sigma=\left(\begin{array}{cc}{\sigma_{1}^{2}} & {\rho} \\ {\rho} & {\sigma_{2}^{2}}\end{array}\right)$`, and plot the sample traces.

---
# Lab 6

```r
gibbs <- function(n, mu1, s1, mu2, s2, rho) {
    mat <- matrix(ncol = 2, nrow = n)
    x <- 0
    y <- 0
    mat[1, ] <- c(x, y)
    for (i in 2:n) {
        x <- rnorm(1, mu1 + (s1/s2) * rho * (y - mu2), sqrt((1 - rho^2) * s1^2))
        y <- rnorm(1, mu2 + (s2/s1) * rho * (x - mu1), sqrt((1 - rho^2) * s2^2))
        mat[i, ] <- c(x, y)
    }
    mat
}
gibbs_sim1 <- gibbs(1000, 1, 0.5, 2, 8, 0.4)
```

---
# Bayesian linear regression

- Recall the linear regression model

`$$y_i = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \epsilon_i,$$`
    where `$\epsilon_i \sim N(0, \sigma^2).$`

- We are intertested in the joint distribution of
    `$\beta_0,\beta_1,...\beta_p$` and `$\sigma^2$`.

- You know how to get that by OLS under the Gaussian assumption. We have
    `$\beta|\sigma^2 \sim N\left[(x'x)^{-1}x'y,(x'x)^{-1}\sigma^2\right]$` and `$\sigma^2 \sim \chi^2(n-p)$`.
    
---
# The Bayesian approach

- We know by the Bayes' rule `$$\begin{aligned} p\left(\beta, \sigma^{2} | y, x\right) & \propto p\left(y | \beta, \sigma^{2}, x\right) p\left(\beta, \sigma^{2}\right) \\ &=p\left(y | \beta, \sigma^{2}, x\right) p\left(\beta | \sigma^{2}\right) p\left(\sigma^{2}\right) \end{aligned}$$`
                              
      where `$p(y|\beta,\sigma^2,x)$` is the **likelihood** for the
      model and `$p(\beta,\sigma^2) = p(\beta|\sigma^2)p(\sigma^2)$` is
      the **prior** information of the parameters and
      `$p(\beta,\sigma^2|y,x)$` is called the **posterior**.

- **The Bayesian way**: Let's just draw random samples
      from `$p(\beta,\sigma^2|y,x)$`.

---
# Sampling the Posterior

- Write down the likelihood function.
- Specify the prior.
- Write down the posterior.
- Use Gibbs to draw. Set a initial value for `$\beta^{(0)}$` and `$\sigma^{2(0)}$`.
    - Draw a random vector `$\beta^{(1)}$` from `$p(\beta^{(1)} | \sigma^{2(0)},y,x)$`
    - Draw a random number `$\sigma^{2(1)}$` from `$p(\sigma^{2(1)}| \beta^{(1)} ,y,x)$`
    - Draw a random vector `$\beta^{(2)}$` from `$p(\beta^{(2)} | \sigma^{2(1)},y,x)$`
    - Draw a random number `$\sigma^{2(2)}$` from `$p(\sigma^{2(2)}| \beta^{(2)} ,y,x)$`
    - Draw a random vector `$\beta^{(3)}$` from `$p(\beta^{(3)} | \sigma^{2(2)},y,x)$`
    - Draw a random number `$\sigma^{(3)}$` from `$p(\sigma^{2(3)}| \beta^{(3)} ,y,x)$`
    - ...
    - ...
- Summarize `$\beta^{(1)},\beta^{(2)},...,\beta^{(n)}$`
- Summarize `$\sigma^{(1)},\sigma^{(2)},...,\sigma^{(n)}$`

---
# Lab 7

Work on random walk Metroplis within Gibbs for multiple linear regression.

- Prior for `$\beta$`: `$\beta \sim N(0, I)$`.

- Prior for `$\sigma^2$`: `$\sigma^2 \sim \chi^2(10)$`.

- Write down the likelihood.

- Write down the joint posterior.

- Go to Gibbs.

- Use random walk proposal for Metropolis.

- Summary your samples.

---
# Lab 7.1

```r
## Generate data for linear regression
n <- 1000
p <- 3
epsilon <- rnorm(n, 0, 0.1)
beta <- matrix(c(-3, 1, 3, 5))
X <- cbind(1, matrix(runif(n * p), n))
y <- X %*% beta + epsilon
```

---
# Lab 7.1

```r
## Log posterior
library(mvtnorm)
LogPosterior <- function(beta, sigma2, y, X) {
    p <- length(beta)
    ## The log likelihood function
    LogLikelihood <- sum(dnorm(x = y, mean = X %*% beta, sd = sqrt(sigma2), 
        log = TRUE))
    ## The priors
    LogPrior4beta <- dmvnorm(x = t(beta), mean = matrix(0, 1, p), sigma = diag(p), 
        log = TRUE)
    LogPrior4sigma2 <- dchisq(x = sigma2, df = 10, log = TRUE)
    LogPrior <- LogPrior4beta + LogPrior4sigma2
    ## The log posterior
    LogPosterior <- LogLikelihood + LogPrior
    return(LogPosterior)
}
```

---
# Lab 7.1

```r
step <- function(x, f, q, ...) {
    ## Pick new point
    xp <- q(x)
    ## Acceptance probability:
    alpha <- min(1, exp(f(xp, ...) - f(x, ...)))
    ## Accept new point with probability alpha:
    if (runif(1) < alpha) 
        x <- xp
    ## Returning the point:
    return(x)
}
```

---
# Lab 7.1

```r
q1 <- function(x){x + rnorm(length(x), sd=0.5)}
q2 <- function(x){rgamma(1, 1)}
mgibbs <- function(y, X, beta.init, sigma2.init, q1, q2, f, nIter = 1000, nBurnin = nIter/10){
  mean.X <- apply(X, 2, mean)
  X.demean <- apply(X, 2, function(x){x - mean(x)})
  X <- cbind(1, X.demean)
  beta <- matrix(NA, nIter, ncol(X))
  sigma2 <- matrix(NA, nIter, 1)
  beta[1, ] <- beta.init
  sigma2[1, ] <- sigma2.init
  for (i in 2:nIter) {
    beta[i, ] <- step(beta[i-1, ], LogPosterior, q1, sigma2=sigma2[i-1], y=y, X=X)
    sigma2[i, ] <- step(sigma2[i-1], LogPosterior, q2, beta=beta[i,], y=y, X=X)}
  beta[, 1] = beta[,1] - beta[, 2:ncol(X)] %*% as.matrix(mean.X, ncol(X) - 1, 1)
  return(list(beta = beta[-c(1:nBurnin), ],
              sigma2 = sigma2[-c(1:nBurnin)]))
}
```

---
# Lab 7.1

```r
toy_mgibbs <- mgibbs(y, X[, 2:4], c(-1,2,3,4), 0.5, q1, q2, LogPosterior, 5000)
apply(toy_mgibbs$beta, 2, mean)
```

```
## [1] -3.083617  1.070846  2.989839  5.064991
```

---
# Lab 7.2

```r
library(BAS)
data("bodyfat")
X <- matrix(bodyfat$Abdomen)
y <- matrix(bodyfat$Bodyfat)
bodyfat.mgibbs <- mgibbs(y, X, c(-10,0.6), 0.5, q1, q2, LogPosterior, 10000)
bodyfat.coef <- apply(bodyfat.mgibbs$beta, 2, mean)
```

---
# Lab 7.2

.pull-left[

```r
plot(X, y, xlab = 'Abdomen', ylab = 'Bodyfat')
for (i in 1:dim(bodyfat.mgibbs$beta)[1]){
  abline(a = bodyfat.mgibbs$beta[i, 1], b = bodyfat.mgibbs$beta[i, 2], col = 'lightblue')
}
abline(a = bodyfat.coef[1], b = bodyfat.coef[2], col = 2)
legend("bottomright", c('Posteror mean', 'Posterior sample regression lines'), col = c('red', 'lightblue'), lty = 1)
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-23-1.png)
]

---
# Lab 7.2

```r
library(ggplot2)
# Construct current prediction
beta0 = bodyfat.coef[1]; beta1 = bodyfat.coef[2]
n.newx = 100; alpha = 0.05
new_x = seq(min(bodyfat$Abdomen), max(bodyfat$Abdomen), 
            length.out = n.newx)
y_hat = beta0 + beta1 * new_x
# Get lower and upper bounds for mean and pred
ymean_lwr = ymean_upr = ypred_lwr = ypred_upr = rep(NA, n.newx)
for (i in 1:n.newx){
  y_hats = bodyfat.mgibbs$beta %*% matrix(c(1, new_x[i]), 2, 1)
  ymean_lwr[i] = quantile(y_hats, alpha/2)
  ymean_upr[i] = quantile(y_hats, 1 - alpha/2)
  ypred_hats = bodyfat.mgibbs$beta %*% matrix(c(1, new_x[i]), 2, 1) + sapply(1:dim(bodyfat.mgibbs$beta)[1], function(i){rnorm(1, 0, bodyfat.mgibbs$sigma2[i])})
  ypred_lwr[i] = quantile(ypred_hats, alpha/2)
  ypred_upr[i] = quantile(ypred_hats, 1- alpha/2)
}
output = data.frame(x = new_x, y_hat = y_hat, ymean_lwr = ymean_lwr, ymean_upr = ymean_upr,  ypred_lwr = ypred_lwr, ypred_upr = ypred_upr)
```

---
# Lab 7.2

.pull-left[

```r
ggplot(data = bodyfat, aes(x = Abdomen, y = Bodyfat)) + geom_point(color = "blue") + 
    geom_line(data = output, aes(x = new_x, y = y_hat, color = "first"), lty = 1) + 
    geom_line(data = output, aes(x = new_x, y = ymean_lwr, lty = "second")) + 
    geom_line(data = output, aes(x = new_x, y = ymean_upr, lty = "second")) + 
    geom_line(data = output, aes(x = new_x, y = ypred_upr, lty = "third")) + 
    geom_line(data = output, aes(x = new_x, y = ypred_lwr, lty = "third")) + 
    scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") + 
    scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions"), 
        name = "") + theme_bw() + theme(legend.position = c(1, 0), legend.justification = c(1.5, 
    0)) + xlab("abdomen circumference (cm)")
```
]

.pull-right[
![](BS-L10-review_files/figure-html/unnamed-chunk-25-1.png)
]

---
# BMA

- More generally, we can use this weighted average formula to obtain the value of a quantity of interest `$\Delta$`, `$$p(\Delta | \text { data })=\sum_{j=1}^{2^{p}} p\left(\Delta | M_{j}, \text { data }\right) p\left(M_{j} | \text { data }\right).$$`

- We then have `$$E[\Delta | \text { data }]=\sum_{j=1}^{2^p} E\left[\Delta | M_{j}, \operatorname{data}\right] p\left(M_{j} | \text { data }\right).$$`

---
# Classification with logistic regression

- Response is assumed to be binary.

- Example: Spam/Ham. Covariates: symbols, etc.

- Logistic regression: `$$p(y_i = 1|x_i) = \frac{\exp(x_i^{'}\beta)}{1 + \exp(x_i^{'}\beta)}.$$`

- Likelihood: `$$p(y | X, \beta) = \prod_i^n \frac{(\exp(x_i^{'}\beta))^{y_i}}{1 + \exp(x_i^{'}\beta)}.$$`

- Prior: `$\beta \sim N(0, \tau^2I)$`.

---
# Normal approximation of posterior

- Remember MAP?

- How to realize normal approximation in R?

---
# Bayesian forecasting

- The Bayesian paradigm: quantifies uncertainty about: **unknown|known** using a **distribution**.

- In forecasting the unknown quantity of interest is `$y_{T+1}$`:

`$$\begin{aligned} p\left(y_{T+1} | \mathbf{y}\right) &=\int_{\theta} p\left(y_{T+1}, \boldsymbol{\theta} | \mathbf{y}\right) d \boldsymbol{\theta} \\ &=\int_{\theta} p\left(y_{T+1} | \boldsymbol{\theta}, \mathbf{y}\right) p(\boldsymbol{\theta} | \mathbf{y}) d \boldsymbol{\theta} \\ &=E_{\theta | \mathbf{y}}\left[p\left(y_{T+1} | \boldsymbol{\theta}, \mathbf{y}\right)\right] \end{aligned}.$$`

- Bayesian forecasting is to **evaluate the posterior expectation**.

---
class: inverse, center, middle
# All the best!