Bayesian Statistics and Computing

---

# Bayesian Statistics and Computing

## Lecture 11: Bayesian Thinking

#### *Yanfei Kang | BSC | Beihang University*

---

# Background

- We build models to predict complex phenomena.

- We need to make inference to derive estimates of parameters.

- Bayesian Statistics allows us to go from data back to probabilistic statements about the parameters.

- The procedure relies on Bayes' rule.

---
class: inverse, center, middle
# Some History of Bayes' Rule

By Thomas Bayes (1701–1761)  when he solved the problem of 'inverse probability'.

![Thomas Bayes](https://upload.wikimedia.org/wikipedia/commons/d/d4/Thomas_Bayes.gif)

---
class: inverse, center, middle
# Some History of Bayes' Rule

After Bayes' death, Richard Price published this work in 1763.

![Richard Rice](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Dr_Richard_Price%2C_DD%2C_FRS_-_Benjamin_West.jpg/330px-Dr_Richard_Price%2C_DD%2C_FRS_-_Benjamin_West.jpg)

---
class: inverse, center, middle
# Some History of Bayes' Rule

Pioneered and popularised by Pierre-Simon Laplace in 1774, independently.

![Pierre-Simon Laplace](https://upload.wikimedia.org/wikipedia/commons/3/39/Laplace%2C_Pierre-Simon%2C_marquis_de.jpg)

---

# Bayesian Theories in Machine Learning

- The core of machine learning.

- Eg1: spell checking.

- Eg2: word segmentation.

- Eg3: machine translation.

- Eg4: spam email filtering.

---
class: inverse, center, middle

# Bayes' Rule

---
# Bayes' Rule

---

# Bayes’ Rule and Diagnostic Testing

- Let us consider an example involving the human immunodeficiency virus (HIV).

- In the early 1980s, HIV had just been discovered and was rapidly expanding.

- False positives (?) and false negatives (?) in HIV testing highly undesirable.

---

# Think about probabilities

We have the following estimated information:

$$
  P(\text{ELISA is positive} \mid \text{Person tested has HIV}) = 93\% = 0.93.
$$

$$
  P(\text{ELISA is negative} \mid \text{Person tested has no HIV}) = 99\% = 0.99.
$$

$$
  P(\text{Person tested has HIV}) = \frac{1.48}{1000} = 0.00148.
$$

What the probability of HIV if ELISA is positive?

`$$P(\text{Person tested has HIV} \mid \text{ELISA is positive}) = ?$$`

---

# Bayes Learning

- One might want to do a second ELISA test after a first one comes up positive.

- What is the probability of being HIV positive if the second ELISA test comes back positive?

---

# Bayes Learning

- The process, of using Bayes' rule to update a probability based on an event affecting it, is called Bayes' updating.

- More generally, the what one tries to update can be considered 'prior' information, sometimes simply called the **prior**.

- Then, updating this prior using Bayes' rule gives the information conditional on the data, also known as the **posterior**, as in the information after having seen the data.

- Going from the prior to the posterior is **Bayes learning** or Bayes updating.

---

# Bayesian vs. Frequentist Definitions of Probability

- The frequentist definition of probability is based on observation of a large number of trials. The probability for an event `$E$` to occur is `$P(E)$`, and assume we get `$n_E$` successes out of `$n$` trials. Then we have
$$
P(E) = \lim_{n \rightarrow \infty} \dfrac{n_E}{n}.
$$

- On the other hand, the Bayesian definition of probability `$P(E)$` reflects our prior beliefs, so `$P(E)$` can be any probability distribution.

- Think about weather forecasting.

---

# Example

Based on a 2015 Pew Research poll on 1,500 adults: "We are 95% confident that 60% to 64% of Americans think the federal government does not do enough for middle class people."

- There is a 95% chance that this confidence interval includes the true population proportion?
- The true population proportion is in this interval 95% of the time?
- ?

---

# Example

- The **Bayesian** alternative is the credible interval, which has a definition that is easier to interpret.

- Since a Bayesian is allowed to express uncertainty in terms of probability, a Bayesian credible interval is a range for which the Bayesian thinks that the probability of including the true value is, say, 0.95.

- Thus a Bayesian can say that there is a 95% chance that the credible interval contains the true parameter value.

---

# Bayesian Inference

- Bayesian inference is to modify our beliefs to account for new data.

- Before we have data, we have our initial beliefs, which we call prior.

- After we have data, we update our beliefs: `$$\text{prior} + \text{data} \rightarrow \text{posterior}.$$`

---

# Bayesian Inference via Bayes' Rule

---
# How to select prior?

- There is no "true" or "correct" prior.

-  In some cases expert opinion or similar studies can be
used to specify an informative prior.

- It would be a waste to discard this information

- If prior information is unavailable, then the prior should be
uninformative.

- The prior is best viewed as an initial value to a statistical
procedure.

---
# How to select prior?

- As Bayesian learning continues and more and
more data are collected, the posterior concentrates around
the true value for any reasonable prior.

- However, in a finite sample the prior can have some effect.

- Different analysts may pick different priors and thus
have different results.

---
class: inverse, center, middle
# Example

---

# Learning About the Proportion of Heavy Sleepers

- Suppose a person is interested in learning about the sleeping habits of American college students.

- She hears that doctors recommend eight hours of sleep for an average adult.

- **What proportion of college students get at least eight hours of sleep?**

---

# Learning About the Proportion of Heavy Sleepers

Now we reorganize the problem.

1. **Population: ** all American college students.

2. `$p$`: the proportion of this population who sleep at least eight hours.

3. Our aim is to learn about the location of `$p$`, which is the unknown parameter.

---

# Bayesian viewpoint

- Use **probability distribution** to describe a person’s beliefs about the uncertainty in `$p$`.

- This distribution reflects the person’s **subjective prior** opinion about `$p$`.
- A random sample will be taken.

- Before that, research on the sleeping habits to build a prior distribution (e.g. from internet articles or newspaper).

- Prior beliefs: `$p < 0.5$`, can be any value between 0 and 0.5.

---

# Bayesian viewpoint

- Suppose our prior density for `$p$` is `$g(p)$`.

- Regard "success" as "sleeping at least 8 hours".

- We take a random sample with `$s$` successes and `$f$` failures.

- Now we have the likelihood function `$$L(p) \propto p^{s}(1-p)^{f}, 0<p<1.$$`

- The posterior for `$p$` can be obtained by Bayes' rule: `$$g(p|\text{data}) \propto g(p)L(p).$$`

---

# Discrete Prior

- List all possible values for `$p$` and assign weights for them.

- `$p$` could be 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95.

- Their weights are 1, 5.2, 8, 7.2, 4.6, 2.1, 0.7, 0.1, 0, 0.

---
# Discrete Prior
.pull-left[

``` r
p = seq(0.05, 0.95, by = 0.1)
prior = c(1, 5.2, 8, 7.2, 4.6, 2.1, 0.7, 0.1, 0, 0)
prior = prior/sum(prior)
plot(p, prior, type = "h", ylab="Prior Probability")
```
]

.pull-right[
<img src="figure/plot-label-out-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Likelihood

- In our example, 11 of 27 students sleep a sufficient number of hours.

- `$s = 11$` and `$f = 16.$`

- Likelihood function: `$$L(p) \propto p^{11}(1-p)^{16}.$$`

---
# Posterior

``` r
s <- 11; f <- 16
like <- p^s*(1-p)^f # calculate likelihood
post <- prior*like 
post <- post/sum(post)
par(mfrow=c(2,1))
plot(p,prior,type = "h",lty = 2,ylim = c(0,1),ylab = "prior")
plot(p,post,type = "h",lty = 1,ylim = c(0,1),ylab = "posterior")
```
]

---
# Beta Prior

- `$p$` is a continuous parameter.

- Alternative is to construct a density `$g(p)$` on [0, 1].

- Suppose she believes that the proportion is equally likely to be smaller or larger than `$p = 0.3$`.

- And she is 90% confident that `$p$` is less than 0.5.

- A convenient family of densities for a proportion is `$$g(p) \propto p^{a-1}(1-p)^{b-1}, 0<p<1.$$`

- **How to select `$a$` and `$b$`?**

---
# Beta Prior

- Here the person believes that the median and 90th percentiles of the proportion are given, respectively, by 0.3 and 0.5.

- ` beta.select()` in R can find the shape parameters of the beta density that match this prior knowledge.

``` r
quantile2=list(p=.9,x=.5)
quantile1=list(p=.5,x=.3)
LearnBayes::beta.select(quantile1,quantile2)
#> [1] 3.26 7.19
```

---
# Posterior

- Combining this beta prior with the likelihood function, one can show that the posterior density is also of the beta form with updated parameters `$a + s$` and `$b + f$`.

`$$g(p | \text { data }) \propto p^{a+s-1}(1-p)^{b+f-1}, 0<p<1.$$`

- This is an example of a **conjugate** analysis, where the prior and posterior densities have the same functional form.

-  The prior, likelihood, and posterior are all in the beta family.

---
# Posterior

``` r
# Plot beta prior, likelihood and posterior
a = 3.26; b = 7.19; s = 11; f = 16
*curve(dbeta(x,a+s,b+f), from=0, to=1,
*     xlab="p",ylab="Density",lty=1,lwd=4)
curve(dbeta(x,s+1,f+1),add=TRUE,lty=2,lwd=4)
curve(dbeta(x,a,b),add=TRUE,lty=3,lwd=4)
legend(.7,4,c("Prior","Likelihood","Posterior"),
       lty=c(3,2,1),lwd=c(3,3,3))
```
]

---
# Making Inferences via Posterior

**1. Is it likely that the proportion of heavy sleepers is greater than 0.5?**

**2. What is the 90% interval estimate for `$p$`?**

???

``` r
# Exact inferences
1 - pbeta(0.5, a + s, b + f)
#> [1] 0.0690226
qbeta(c(0.05, 0.95), a + s, b + f)
#> [1] 0.2555267 0.5133608
# Inferences based on simulation
ps = rbeta(1000, a + s, b + f)
hist(ps,xlab="p",main="")
sum(ps >= 0.5)/1000
#> [1] 0.057
quantile(ps, c(0.05, 0.95))
#>        5%       95% 
#> 0.2646908 0.5046892
```

---
class: inverse, center, middle
# Normal Distribution with Known Mean but Unknown Variance

---
# Normal Distribution with Known Mean but Unknown Variance

- We estimate an unknown variance using American football scores.

- The focus is on the difference `$d$` between a game outcome (winning score minus losing score) and a published point spread.

- We have observations: `$d_1, d_2, \cdots, d_n$`.

- We assume they are normally distributed with mean 0 and variance `$\sigma^2$`.

---
#

### Likelihood function
`$$L\left(\sigma^{2}\right) \propto\left(\sigma^{2}\right)^{-n / 2} \exp \left\{-\sum_{i=1}^{n} d_{i}^{2} /\left(2 \sigma^{2}\right)\right\}, \sigma^{2}>0.$$`

### Prior
We use the  noninformative prior ([Jeffreys prior](https://en.wikipedia.org/wiki/Jeffreys_prior)) 
`$$p\left(\sigma^{2}\right) \propto 1 / \sigma^{2}.$$`

### Posterior
`$$g\left(\sigma^{2} | \text { data }\right) \propto\left(\sigma^{2}\right)^{-n / 2-1} \exp \left\{-v /\left(2 \sigma^{2}\right)\right\},$$`
where `$v=\sum_{i=1}^{n} d_{i}^{2}.$`

???

### Aside

Let `$X$` be a random variable with density `$g(x)$` and let `$Y=h(X)$` be a one-one transformation. Then the density of `$Y$` satisfies
$$
f(y)=g(x)\left|\frac{d x}{d y}\right|=g(x)\left|h^{\prime}(x)\right|^{-1}, \quad \text { where } x=h^{-1}(y).
$$

### Jeffreys’ Priors

- Jeffreys’ principle states that any rule for determining the prior density `$p(\theta)$` should yield an equivalent result if applied to the transformed parameter.

- Applying this principle gives `$p(\theta)=[J(\theta)]^{1 / 2},$`
where `$J(\theta)$` is the Fisher information for `$\theta$`
$$
J(\theta)=E\left[\left(\frac{d \log p(y \mid \theta)}{d \theta}\right)^{2} \mid \theta\right]=-E\left[\frac{d^{2} \log p(y \mid \theta)}{d \theta^{2}} \mid \theta\right]
$$

---
#

Now if we define precision parameter `$Y=1 / \sigma^{2},$` 
1. **What is the distribution of `$Y$`?** 
2. **What is the  point estimate and a 95% probability interval for the standard deviation `$\sigma$`?**

???

- If `$X$` is distributed with density `$g(x)$`, then the density of `$y = 1/x$` is `$f(y)=\frac{1}{y^{2}} g\left(\frac{1}{y}\right)$`.
- `$Y$` is distributed as `$U/v$`.
- `$U$` has a `$\chi^2$` distribution with `$n$` degrees of freedom.

---
# Normal Dis. with Known Mean but Unknown Variance

``` r
# Football scores from  672 professional American football games.
d = read.table('footballscores.txt', header = TRUE)$scores
n = length(d); v = sum(d^2)
# simulate from posterior
*Y = rchisq(1000, n)/v
s = sqrt(1/Y) 
quantile(s, probs = c(0.025, 0.5, 0.975))
hist(s,main="")
```
]

```
#>     2.5%      50%    97.5% 
#> 13.17469 13.82649 14.65818
```

<img src="figure/normal-out-1.svg" width="504" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle
# Normal Data with Both Parameters Unknown

---
# Normal Data with Both Parameters Unknown

- A standard inference problem is to learn about a normal population where both the mean and variance are unknown.

- To illustrate Bayesian computation for this problem, suppose we are interested in learning about the distribution of completion times for men between ages 20 and 29 who are running the New York Marathon.

- Data: `$y_1, \cdots, y_{20}$` in minutes for 20 runners.

- They are from `$N(\mu, \sigma^2)$`.

---
# A noninformative prior distribution

`$$g\left(\mu, \sigma^{2}\right) \propto 1 / \sigma^{2}.$$`

---
# The joint posterior distribution

`$$\begin{aligned} g\left(\mu, \sigma^{2} | y\right) & \propto \sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right) \\ &=\sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[\sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}+n(\overline{y}-\mu)^{2}\right]\right) \\ &=\sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[(n-1) s^{2}+n(\overline{y}-\mu)^{2}\right]\right) \end{aligned},$$`

where `$s^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}$` is the sample variance.

**The joint posterior density can be factorized as the product of conditional and marginal posterior densities. How?**

---
# The conditional posterior distribution

`$$\mu | \sigma^{2}, y \sim \mathrm{N}\left(\overline{y}, \sigma^{2} / n\right).$$`

---
# The marginal posterior distribution

`$$p\left(\sigma^{2} | y\right) \propto \int \sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}}\left[(n-1) s^{2}+n(\overline{y}-\mu)^{2}\right]\right) d \mu.$$`

`$$\begin{aligned} p\left(\sigma^{2} | y\right) & \propto \sigma^{-n-2} \exp \left(-\frac{1}{2 \sigma^{2}}(n-1) s^{2}\right) \sqrt{2 \pi \sigma^{2} / n} \\ & \propto\left(\sigma^{2}\right)^{-(n+1) / 2} \exp \left(-\frac{(n-1) s^{2}}{2 \sigma^{2}}\right) \end{aligned},$$`

which is a scaled inverse- `$\chi^2$` density: `$\sigma^{2} | y \sim \text{Scale-Inv-} \chi^2(n-1, s^2).$`

---
# Sampling from the joint posterior distribution

1. First draw `$\sigma^2$` from the marginal posterior distribution.

2. Draw `$\mu$` from the conditional posterior distribution.

---
# Contour plot of the joint posterior density

``` r
library(LearnBayes)
data(marathontimes)
attach(marathontimes)
d = mycontour(normchi2post, c(220, 330, 500, 9000), time, xlab="mean",ylab="variance")
```
]
.pull-right[
<img src="figure/mara-out-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Simulation

``` r
S = sum((time - mean(time))^2)
n = length(time)
sigma2 = S/rchisq(1000, n - 1)
mu = rnorm(1000, mean = mean(time), sd = sqrt(sigma2)/sqrt(n))
d = mycontour(normchi2post, c(220, 330, 500, 9000), time, xlab="mean",ylab="variance")
points(mu, sigma2)
```
]
.pull-right[
<img src="figure/mara1-out-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Inference

``` r
quantile(mu, c(0.025, 0.975))
#>     2.5%    97.5% 
#> 252.5888 302.0254
quantile(sqrt(sigma2), c(0.025, 0.975))
#>     2.5%    97.5% 
#> 37.69567 73.07723
```

---
# What else can we do to summarize the posterior?

### Do not forget the crude approach: grid approximation

- Grid approximation is a very nice way to visualize the posterior (especially for bivariate case).
- We can compute `$g\left(\mu, \sigma^{2} | y\right)$` for all combinations of `$\mu$` and `$\sigma^2$`.

---
# Grid approximation: create a grid

``` r
# Data
library(LearnBayes)
Y <- marathontimes$time
n <- length(Y)

# Create a grid
npts    <- 100
mu      <- seq( 250, 350,length=npts)
s2      <- seq( 30^2,80^2,length=npts)

grid_mu <- matrix(mu,npts,npts,byrow=FALSE)
grid_s2 <- matrix(s2,npts,npts,byrow=TRUE)
```

---
# Grid approximation: prior

``` r
prior <- matrix(1,npts,npts)
for(i in 1:n){
  prior <- 1/grid_s2
}
prior <- prior/sum(prior)
library(fields)
image.plot(mu,s2,prior,main="Prior")
```
]

.pull-right[
<img src="figure/unnamed-chunk-5-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Grid approximation: likelihood

``` r
like   <- matrix(1,npts,npts)
for(i in 1:n){
  like <- like * dnorm(Y[i],grid_mu,sqrt(grid_s2))
}

like <- like/sum(like)
image.plot(mu,s2,like,main="Likelihood")
```
]

.pull-right[
<img src="figure/unnamed-chunk-6-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Grid approximation: posterior

``` r
post <- prior * like
post <- post/sum(post)
image.plot(mu,s2,post,main="Posterior")

```
]

.pull-right[
<img src="figure/unnamed-chunk-7-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Marginal of `$\mu$`

``` r
post_mu <-colSums(post)
post_mu <- post_mu/sum(post_mu)
plot(mu,post_mu,ylab="f(mu|Y)",type="l")
abline(v=mean(Y))
```
]
.pull-right[
<img src="figure/unnamed-chunk-8-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Marginal of `$\sigma^2$`

``` r
post_s2 <-rowSums(post)
post_s2 <- post_s2/sum(post_s2)
plot(s2,post_s2,ylab="f(sigma2|Y)",type="l")
abline(v=var(Y))
```
]
.pull-right[
<img src="figure/unnamed-chunk-9-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Summarizing the results in a table

``` r
post_mean_mu <- sum(mu*post_mu)
post_sd_mu   <- sum(mu*mu*post_mu) - post_mean_mu^2
post_q025_mu <- max(mu[cumsum(post_mu)<0.025])
post_q975_mu <- max(mu[cumsum(post_mu)<0.975])

post_mean_s2 <- sum(s2*post_s2)
post_sd_s2   <- sum(s2*s2*post_s2) - post_mean_s2^2
post_q025_s2 <- max(s2[cumsum(post_s2)<0.025])
post_q975_s2 <- max(s2[cumsum(post_s2)<0.975])

out1 <- c(post_mean_mu,post_sd_mu,post_q025_mu,post_q975_mu)
out2 <- c(post_mean_s2,post_sd_s2,post_q025_s2,post_q975_s2)
out  <- rbind(out1,out2)

colnames(out) <- c("Mean","SD","Q025","Q975")
rownames(out) <- c("mu","sigma2")
knitr::kable(round(out,2), format = 'html')
```
]

.pull-right[
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Mean </th>
   <th style="text-align:right;"> SD </th>
   <th style="text-align:right;"> Q025 </th>
   <th style="text-align:right;"> Q975 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> mu </td>
   <td style="text-align:right;"> 282.71 </td>
   <td style="text-align:right;"> 270.58 </td>
   <td style="text-align:right;"> 258.08 </td>
   <td style="text-align:right;"> 322.73 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sigma2 </td>
   <td style="text-align:right;"> 2435.52 </td>
   <td style="text-align:right;"> 380807.48 </td>
   <td style="text-align:right;"> 1177.78 </td>
   <td style="text-align:right;"> 3622.22 </td>
  </tr>
</tbody>
</table>

]

---
# What if we have more than a few parameters?

- Integration is difficult.

- For most analyses the marginal posteriors will not be a
nice distribution.

- Grid is not possible either.

---
class: inverse, center, middle

# No worries! We will come to *Bayesian Computing*.

---
class: inverse, center, middle
# Everything boils down to summarizing the posterior.

#### 1. Simulate from R.

#### 2. Brute-force.

#### 3. Deterministic tools.

#### 4. Computing (more general).

---
# Summary

- Bayesian rule.

- Bayesian reference.

- Difference with Frequentist.

- Priors.

- Everything  boils down to summarizing the posterior.

- Examples.

---
# References and further readings

### References

- Sections 2.1, 2.2, 2.3 and 2.5 of BDA book by Gelman et.al.

- Chapter 1 of [An Introduction to Bayesian Thinking](https://statswithr.github.io/book/the-basics-of-bayesian-statistics.html).

### Further readings

- Section 2.8 for noninformative priors.