Bayesian Statistics

class: left, bottom, inverse, title-slide

# Bayesian Statistics
## Lecture 3: Single-Parameter Models
### Yanfei Kang
### 2019/09/08 (updated: 2019-09-30)

---

# Introduction

- We will see several single-parameter models.

- Bayesian inference for a variance for a normal population and inference for a Poisson mean when informative prior information is available.

- One may have limited beliefs about a parameter.

- Inferences may or may not be sensitive to the choice of prior.

---
class: inverse, center, middle
# Normal Distribution with Known Mean but Unknown Variance

---
# Normal Distribution with Known Mean but Unknown Variance

- We estimate an unknown variance using American football scores.

- The focus is on the difference `$d$` between a game outcome (winning score minus losing score) and a published point spread.

- We have observations: `$d_1, d_2, \cdots, d_n$`.

- We assume they are normally distributed with mean 0 and variance `$\sigma^2$`.

---
#

### Likelihood function
`$$L\left(\sigma^{2}\right) \propto\left(\sigma^{2}\right)^{-n / 2} \exp \left\{-\sum_{i=1}^{n} d_{i}^{2} /\left(2 \sigma^{2}\right)\right\}, \sigma^{2}>0.$$`

### Prior
We use the  noninformative prior 
`$$p\left(\sigma^{2}\right) \propto 1 / \sigma^{2}.$$`

### Posterior
`$$g\left(\sigma^{2} | \text { data }\right) \propto\left(\sigma^{2}\right)^{-n / 2-1} \exp \left\{-v /\left(2 \sigma^{2}\right)\right\},$$`
where `$v=\sum_{i=1}^{n} d_{i}^{2}.$`

---
#

Now if we define precision parameter `$P=1 / \sigma^{2},$` 
1. **What is the distribution of `$P$`?** 
2. **What is the  point estimate and a 95% probability interval for the standard deviation `$\sigma$`?**

???

- If `$X$` is distributed with density `$f(x)$`, then the density of `$y = 1/x$` is `$g(y)=\frac{1}{y^{2}} f\left(\frac{1}{y}\right)$`.
- `$P$` is distributed as `$U/v$`.
- `$U$` has a `$\chi^2$` distribution with `$n$` degrees of freedom.

---
# Normal Dis. with Known Mean but Unknown Variance

.pull-left[

```r
library(LearnBayes); data(footballscores); attach(footballscores)
d = favorite - underdog - spread
n = length(d); v = sum(d^2)
# simulate from posterior
*P = rchisq(1000, n)/v
s = sqrt(1/P) 
quantile(s, probs = c(0.025, 0.5, 0.975))
hist(s,main="")
```
]

.pull-right[

```
##     2.5%      50%    97.5% 
## 13.16735 13.85951 14.60327
```

![](BS-L3-single_files/figure-html/normal-out-1.png)
]
---
class: inverse, center, middle
# Informative Prior for Cancer Rates

---
# A Puzzling Pattern in a Map

Let's look at the counties in the United States with the highest and lowest kidney cancer death rates during the 1980s (raw data is available [here](http://www.stat.columbia.edu/~gelman/book/data/cancer/)).

.pull-left[
![](cancerhigh.png)]
--

.pull-right[
![](cancerlow.png)]

**What patterns do you notice?**

---
# Bayesian Inference for the Cancer Death Rate

- The misleading patterns in the maps of raw rates suggest that a model-based approach to estimating the true underlying rates might be helpful.

- In particular, it is natural to estimate the underlying cancer death rate in each county `$j$` using the model `$$y_{j} \sim \operatorname{Poisson}\left(10 n_{j} \theta_{j}\right),$$`
where `$y_j$` is the number of kidney cancer deaths in county `$j$`, `$n_j$` is the
population of the county, and `$\theta_j$` is the underlying rate in units of deaths per person per year.

- Were we performing inference for just one of the counties, we would simply write `$y \sim \operatorname{Poisson}(10 n \theta).$`

---
# Prior

- A gamma distribution with parameters `$\alpha = 20$` and `$\beta = 430,000$` is a reasonable prior distribution for underlying kidney cancer death rates.

- This prior distribution has a mean of `$\frac{\alpha}{\beta}=4.65 \times 10^{-5}$` and
standard deviation `$\frac{\sqrt{\alpha}}{\beta}=1.04 \times 10^{-5}.$`

---
# Posterior

- The posterior distribution of `$θ_j$` is then, `$$\theta_{j} | y_{j} \sim \operatorname{Gamma}\left(20+y_{j}, 430,000+10 n_{j}\right),$$` which has mean and variance, `$$\begin{aligned} \mathrm{E}\left(\theta_{j} | y_{j}\right) &=\frac{20+y_{j}}{430,000+10 n_{j}} \\ \operatorname{var}\left(\theta_{j} | y_{j}\right) &=\frac{20+y_{j}}{\left(430,000+10 n_{j}\right)^{2}} \end{aligned}$$`

---
# Inference for a Small County: `$n_j = 1000$`

- For this county, if `$y_j = 0$`, then the raw death rate is 0 but the posterior mean is `$\frac{20}{440,000}=4.55 \times 10^{-5}.$` 
- If `$y_j = 1$`, then the raw death rate is 1 per 1000 per 10 years, or `$10^{−4}$` per person-year (about twice as high as the national mean), but the posterior mean is only `$\frac{21}{440,000}=4.77 \times 10^{-5}.$`
- If `$y_j = 2$`, then the raw death rate is an extremely high `$2 \times 10^{−4}$` per person-year, but `$4.77 \times 10^{−5}$`. The posterior mean is still only `$\frac{22}{440,000}=5.00 \times 10^{-5}.$`

**With such a small population size, the data are dominated by the prior distribution.**

---
# Inference for a Small County: `$n_j = 1000$`

**But how likely, a priori, is it that `$y_j$` will equal 0, 1, 2, and so forth, for this county with `$n_j = 1000?$`**

- We need to figure out the predictive distribution of `$y_j$`.
- The Poisson model with gamma prior distribution has a negative binomial predictive distribution: `$$y_{j} \sim \operatorname{Neg}-\operatorname{bin}\left(\alpha, \frac{\beta}{10 n_{j}}\right).$$`
- Other ways?

---
# Inference for a Large County: `$n_j = 1~\text{million}$`

**How many cancer deaths `$y_j$` might we expect to see in a ten-year period?**
--
- Assume we found a median of 473 and a 50% interval of [393,545]. The raw death rate in such a county is then as likely or not to fall between `$3.93 \times 10^{−5}$` and `$5.45 \times 10^{−5}$`.
- Bayesianly estimated or ‘Bayes-adjusted’ death rate? 
    - if `$y_j$` takes on the low value of 393, then the raw death rate is `$3.93 \times 10^{−5}$` and the posterior mean is `$\theta_j$` is `$\frac{20+393}{10^{7}+430.000}=3.96 \times 10^{-5}.$`
    - if `$y_j$` takes on the low value of 545, then the raw death rate is `$5.45 \times 10^{−5}$` and the posterior mean is `$5.41 \times 10^{-5}.$`
--
- **In this large county, the data dominate the prior distribution.**

---
# A General Feature of Bayesian Inference

- The posterior distribution is centered at a point that represents a compromise between the prior information and the data

- The compromise is controlled to a greater extent by the data as the sample size increases.

---
# Comparing Counties of Different Sizes

.pull-left[
<img src="cancer1.png" width="600" style="display: block; margin: auto;" /><img src="cancer2.png" width="600" style="display: block; margin: auto;" />
]

.pull-right[
1. Kidney cancer death rates v.s. the population size.
2. Replotted on the scale of `$log_{10}$` population to see the data more clearly. 
3. Bayes-estimated posterior mean kidney cancer death rates
4. Posterior medians and 50% intervals for `$\theta_j$` for a sample of 100 counties `$j$`.
]

---
class: inverse, center, middle
# Mixtures of Conjugate Priors

---
# Probability of a biased coin landing head

- If `$p$` represents the probability that the coin lands heads.
- Either `$p$` is in the neighborhood of 0.3 or in the neighborhood of 0.7.
- It is equally likely that `$p$` is in one of the two neighborhoods.

--
### Prior

.pull-left[This belief can be modeled using the prior density `$$g(p)=\gamma g_{1}(p)+(1-\gamma) g_{2}(p),$$` 
where `$g_1$` is beta(6, 14), `$g_2$` is beta(14, 6), and the mixing probability is `$\gamma = 0.5$`.]
.pull-right[
<img src="betamix.png" width="300" style="display: block; margin: auto;" />
]

---
# Posterior

Suppose we flip the coin `$n$` times, obtaining `$s$` heads and `$f = n − s$` tails. The posterior is `$$g(p | \text { data })=\gamma(\text { data }) g_{1}(p | \text { data })+(1-\gamma(\text { data })) g_{2}(p | \text { data }),$$`
where `$g_1$` is beta(6 + `$s$`, 14 + `$f$`), `$g_2$` is beta(14 + `$s$`, 6 + `$f$`), and the mixing probability `$\gamma(\text { data })$` has the form 
`$$\gamma(\text { data })=\frac{\gamma f_{1}(s, f)}{\gamma f_{1}(s, f)+(1-\gamma) f_{2}(s, f)},$$` 
where `$f_j(s,f)$` is the prior predictive probability of `$s$` heads in `$n$` flips when `$p$` has the prior density `$g_j$`.

---
# Posterior in R

```r
probs=c(.5,.5)
beta.par1=c(6, 14)
beta.par2=c(14, 6)
betapar=rbind(beta.par1, beta.par2)
data=c(7,3)
post=binomial.beta.mix(probs,betapar,data)
post
```

```
## $probs
##  beta.par1  beta.par2 
## 0.09269663 0.90730337 
## 
## $betapar
##           [,1] [,2]
## beta.par1   13   17
## beta.par2   21    9
```

---
# The prior and posterior densities

.pull-left[

```r
curve(post$probs[1]*dbeta(x,13,17)+post$probs[2]*dbeta(x,21,9),
      from=0, to=1, lwd=3, xlab="P", ylab="DENSITY")
curve(.5*dbeta(x,6,12)+.5*dbeta(x,12,6),0,1,add=TRUE)
legend("topleft",legend=c("Prior","Posterior"),lwd=c(1,3))
```
Your findings?
]
.pull-right[
![](BS-L3-single_files/figure-html/betamix-out-1.png)
]

---
class: inverse, center, middle
# Summary

---
# Summary

- Single-parameter models.

- Bayesian inference for a variance for a normal population and inference for a Poisson mean when informative prior information is available.

- One may have limited beliefs about a parameter.

- Mixtures.