Bayesian Statistics and Computing

---

# Bayesian Statistics and Computing

## Lecture 13: Independent Monte Carlo Methods

#### *Yanfei Kang | BSC | Beihang University*

---

# Strategies to summarize posterior

1. If we have familiar distributions, we can simply **simulate** samples in R. **What if there isn’t a built in function for doing this?**

2. **Grid approximation**.
    - Compute values of the posterior on a grid of points.
    - Approximate the continuous posterior by a discrete posterior.
    - Applicable for one- and two-parameter problems.

3. If there are more parameters, we consider **deterministic tools**.
    - MAP
    - Bayesian CLT
    - Numerical Integration
    
4. **Monte Carlo** methods are more general.
    - Independent Monte Carlo 
    - Markov Chain Monte Carlo

---
class: inverse, center, middle
# Direct sampling

---
# Inverse CDF transformation

- Suppose that we are given `$U \sim U(0,1)$` and want to simulate a continuous r.v. `$X$` with cdf `$F_X$`. Put `$Y = F_{X}^{-1}(U)$`, we have 
`$$F_{Y}(y)=\mathbb{P}(Y \leq y)=\mathbb{P}\left(F_{X}^{-1}(U) \leq y\right)=\mathbb{P}\left(U \leq F_{X}(y)\right)=F_{X}(y).$$`

- For any continuous r.v. `$X$`, `$F_{X}(X) \sim U(0,1)$`. **Why?**

???

Define `$Z=F_X(x)$`, then 
`$$F_{Z}(x)=P\left(F_{X}(X) \leq x\right)=P\left(X \leq F_{X}^{-1}(x)\right)=F_{X}\left(F_{X}^{-1}(x)\right)=x.$$`

---
# Example: exponential distribution

- If `$X \sim \exp(\lambda)$`, then
`$$F_{X}(x)=\left\{\begin{array}{ll}{0} & {\text { for } x<0} \\
{1-e^{-\lambda x}} & {\text { for } x \geq 0}\end{array}\right.$$`

- How to work with inverse CDF transformation?

---
# Example: exponential distribution

```r
lambda <- 1
u <- runif(20)
x <- -1/lambda * log(1 - u)
curve(1 - exp(-x), 0, 5, ylab = "u", main = "Inverse CDF transformation for exp(1)")
for (i in 1:length(x)) {
    lines(c(x[i], x[i]), c(u[i], 0), lty = "dashed")
    lines(c(x[i], 0), c(u[i], u[i]), lty = "dashed")
}
```
]
.pull-right[
<img src="figure/unnamed-chunk-1-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Example: exponential distribution

```r
# Direct sampling
lambda <- 1
u <- runif(1000)
x <- -1/lambda * log(1 - u)
hist(x)
```
]

.pull-right[
<img src="figure/unnamed-chunk-2-1.svg" width="504" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle
# Rejection sampling

---
# What if we can not write the inverse?

### Rejection sampling

- Although we cannot easily sample from `$f(\theta|y)$`, there exists another density `$g(\theta)$` (called *proposal* density), like a Normal distribution or perhaps a `$t$`-distribution, from which it is easy for us to sample.

- Then we can sample from  `$g(\theta)$` directly and then "reject" the samples in a strategic way to make the resulting "non-rejected" samples look like they came from `$f(\theta|y)$`.

---
# Rejection sampling

1. Sample `$U \sim \operatorname{Unif}(0,1)$`.

2. Sample `$\theta$` at random from the probability density proportional to `$g(\theta)$`.

3. If `$U \leq \frac{f(\theta|y)}{M g(\theta)}$`, accept `$\theta$` as a draw from `$f$`. If the drawn `$\theta$` is rejected, return to step 1. (With probability `$\frac{f(\theta | y)}{M g(\theta)}$`), accept `$\theta$`).

The algorithm can be repeated until the desired number of samples from the target density  `$f$` has been accepted.

---
# Illustration

.pull-right[
- The top curve is an approximation function, `$Mg(\theta)$`, and the bottom curve is the target density, `$f(\theta|y)$`. 
- As required, `$Mg(\theta) ≥ f(\theta|y)$` for all `$\theta$`. 
- The vertical line indicates a single random draw `$\theta$` from the density proportional to `$g$`. 
- The probability that a sampled draw `$\theta$` is accepted is the ratio of the height of the lower curve to the height of the higher curve at the value `$\theta$`.
]

---
# What should `$g(\theta)$` look like?

1. `$\mathcal{X}_f \subset \mathcal{X}_g$`, where `$\mathcal{X}_f$` is the support of `$f(\theta|y)$`.

2. For any `$\theta \in \mathcal{X}_f$`, `$\frac{f(\theta|y)}{g(\theta)} \leq M$`, that is 
`$$M=\sup _{\theta \in \mathcal{X}_{f}} \frac{f(\theta|y)}{g(\theta)}<\infty.$$`

**A simple choice is choose `$g(\theta)$` that heavier tails.**

---
# Example

.pull-left[
Suppose we want to get samples from `$N(0,1)$`, We could use the `$t_2$` distribution as our proposal density as it has heavier tails than the Normal.

```r
curve(dnorm(x), -6, 6, xlab = expression(theta), ylab = "Density", n = 200)
curve(dt(x, 2), -6, 6, add = TRUE, col = 4, n = 200)
legend("topright", c("Normal density", "t density"), col = c(1, 4), bty = "n", lty = 1)
x <- rt(200, 2)
rug(x, col = 4)
y <- rnorm(200)
rug(y, col = 1)
```
]

.pull-right[
<img src="figure/unnamed-chunk-3-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Probability of `$\theta$` accepted

`$$\begin{aligned} \mathbb{P}(\theta \text { accepted }) &=\mathbb{P}\left(U \leq \frac{f(\theta|y)}{M g(\theta)}\right) \\ &=\int \mathbb{P}\left(U \leq \frac{f(\theta|y)}{M g(\theta)} | \theta\right) g(\theta) d \theta \\ &=\int \frac{f(\theta|y)}{M g(\theta)} g(\theta) d \theta \\ &=\frac{1}{M} \end{aligned}$$`

1. This has implications for how we choose the proposal density  `$g$`.

2. We want to choose  `$g$` so that it matches `$f$` as closely as possible.

---
# Back to the example

.pull-left[
<img src="figure/unnamed-chunk-4-1.svg" width="504" style="display: block; margin: auto;" />
]

.pull-right[
`$f/g$` is maximized at `$x=1$`, so `$M = 1.257$`. Acceptance probability is about 0.8.
]

---
### Changing df

---
### Why rejection sampling work?

It can be shown that the distribution function of the accepted values  is equal to the distribution function corresponding to the target density:
$$ P(\Theta \leq \theta | \Theta \text{ accepted}) = F(\theta).$$

### What is the problem of rejection sampling?

---
# Rejection sampling

- Suppose we are interested in `$\mathbb{E}_{f}[h(\theta|y)]$`, what do we do?

- Note that in most cases, the candidates generated from  `$g$` fall within the domain of `$f$`, so that they are in fact values that could plausibly come from `$f$`.

- They are simply over- or under-represented in the frequency with which they appear.

- For example, if  `$g$` has heavier tails than  `$f$`, then there will be too many extreme values generated from  `$g$`. Rejection sampling simply thins out those extreme values to obtain the right proportion.

- But what if we could take those rejected values and, instead of discarding them, simply downweight or upweight them in a specific way?

---
# Lab 10

Use the rejection sampling algorithm to simulate from `$N(0, 1)$`.  Try the following proposal densities.

1. Logistic Distribution.
2. Cauchy distribution.
3. Student `$t_5$` distribution.
4. `$N(1,2)$` distribution.

For each proposal record the percentage of iterates that are accepted.
Which is the best proposal?

---
class: inverse, center, middle
# Importance sampling

---
# Importance sampling

- We can rewrite the target `$$\mathbb{E}_{f}[h(\theta|y)]=\mathbb{E}_{g}\left[\frac{f(\theta|y)}{g(\theta)} h(\theta|y)\right].$$`
- If `$\theta_{1}, \dots, \theta_{n} \sim g$`, we get 
`$$\tilde{\mu}_{n}=\frac{1}{n} \sum_{i=1}^{n} \frac{f\left(\theta_{i}|y\right)}{g\left(\theta_{i}\right)} h\left(\theta_{i}|y\right)=\frac{1}{n} \sum_{i=1}^{n} w_{i} h\left(\theta_{i}|y\right) \approx \mathbb{E}_{f}[h(\theta|y)].$$`
- `$w_i$` are referred to as the importance weights.

---
# Summary

- Direct sampling
- Rejection sampling
- Importance sampling

---
# References and further readings

###  References

- Chapter 10 of the BDA book by Gelman et.al.

### Further readings

- Chapter 6 of Advanced Statistical Computing by Roger Peng.