Bayesian Statistics

class: left, bottom, inverse, title-slide

# Bayesian Statistics
## Lecture 1: The Basics of Bayesian Statistics
### Yanfei Kang
### 2019/08/01 (updated: 2019-09-04)

---

# Background

---

# Background

- We build models to predict complex phenomena.
- Two approaches to modelling: Frequentist (classical inference) and Bayesian inference.
- Easy to calculate probabilities of obtaining different data if we know the DGP.
- But usually no perfect knowledge about the DGP.
- So we use inference to derive estimates of parameters.
- Bayesian Statistics allows us to go from data back to probabilistic statements about the parameters.
- The procedure relies on Bayes' rule.

---
class: inverse, center, middle
# Some History of Bayes' Rule

By Thomas Bayes when he solved the problem of 'inverse probability'.

![Thomas Bayes](https://upload.wikimedia.org/wikipedia/commons/d/d4/Thomas_Bayes.gif)

---
class: inverse, center, middle
# Some History of Bayes' Rule

After Bayes' death, Richard Rice published this work.

![Richard Rice](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Dr_Richard_Price%2C_DD%2C_FRS_-_Benjamin_West.jpg/330px-Dr_Richard_Price%2C_DD%2C_FRS_-_Benjamin_West.jpg)

---
class: inverse, center, middle
# Some History of Bayes' Rule

Pioneered and popularised by Pierre-Simon Laplace.

![Pierre-Simon Laplace](https://upload.wikimedia.org/wikipedia/commons/3/39/Laplace%2C_Pierre-Simon%2C_marquis_de.jpg)

---

# Bayesian Theories in Machine Learning

- The core of machine learning.
- Eg1: spell checking. 
- Eg2: word segmentation.
- Eg3: machine translation.
- Eg4: spam email filtering.

---
class: inverse, center, middle

# Bayes' Rule

---

# Conditional Probabilities

The following table shows the results of a poll on the use of online dating sites among 1,738 adult Americans by ages.

<br>

<table>
<caption>Results from a 2015 Gallup poll on the use of online dating sites by age group</caption>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> 18-29 </th>
   <th style="text-align:right;"> 30-49 </th>
   <th style="text-align:right;"> 50-64 </th>
   <th style="text-align:right;"> 65+ </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Used online dating site </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 86 </td>
   <td style="text-align:right;"> 58 </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 225 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Did not use online dating site </td>
   <td style="text-align:right;"> 255 </td>
   <td style="text-align:right;"> 426 </td>
   <td style="text-align:right;"> 450 </td>
   <td style="text-align:right;"> 382 </td>
   <td style="text-align:right;"> 1513 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Total </td>
   <td style="text-align:right;"> 315 </td>
   <td style="text-align:right;"> 512 </td>
   <td style="text-align:right;"> 508 </td>
   <td style="text-align:right;"> 403 </td>
   <td style="text-align:right;"> 1738 </td>
  </tr>
</tbody>
</table>

---

# Conditional Probabilities

1. What is the probability of an adult American using an online dating site?
2. What is the probability of using an online dating site if one falls in the age group 30-49?
3. Now rewrite this conditional probability in terms of ‘regular’ probabilities by dividing both numerator

$$
  P(A \mid B) = \frac{P(A \,\&\, B)}{P(B)}.
$$

---

# Bayes’ Rule and Diagnostic Testing

- Let us consider an example involving the human immunodeficiency virus (HIV). 
- In the early 1980s, HIV had just been discovered and was rapidly expanding.
- False positives (?) and false negatives (?) in HIV testing highly undesirable.

---

# Think about probabilities

We have the following estimated information:

$$
  P(\text{ELISA is positive} \mid \text{Person tested has HIV}) = 93\% = 0.93.
$$

$$
  P(\text{ELISA is negative} \mid \text{Person tested has no HIV}) = 99\% = 0.99.
$$

$$
  P(\text{Person tested has HIV}) = \frac{1.48}{1000} = 0.00148.
$$

What the probability of HIV if ELISA is positive?

`$$P(\text{Person tested has HIV} \mid \text{ELISA is positive}) = ?$$`

---

# Bayes Updating

- One might want to do a second ELISA test after a first one comes up positive. 
- What is the probability of being HIV positive of also the second ELISA test comes back positive?

---

# Bayes Updating

- The process, of using Bayes' rule to update a probability based on an event affecting it, is called Bayes' updating. 
- More generally, the what one tries to update can be considered 'prior' information, sometimes simply called the **prior**. 
- Then, updating this prior using Bayes' rule gives the information conditional on the data, also known as the **posterior**, as in the information after having seen the data. 
- Going from the prior to the posterior is Bayes updating.

---

# Bayesian vs. Frequentist Definitions of Probability

- The frequentist definition of probability is based on observation of a large number of trials. The probability for an event `$E$` to occur is `$P(E)$`, and assume we get `$n_E$` successes out of `$n$` trials. Then we have
$$
P(E) = \lim_{n \rightarrow \infty} \dfrac{n_E}{n}.
$$

- On the other hand, the Bayesian definition of probability `$P(E)$` reflects our prior beliefs, so `$P(E)$` can be any probability distribution.

- Think about weather forecasting.

---

# Example

Based on a 2015 Pew Research poll on 1,500 adults: "We are 95% confident that 60% to 64% of Americans think the federal government does not do enough for middle class people."

- There is a 95% chance that this confidence interval includes the true population proportion?
- The true population proportion is in this interval 95% of the time?
- ?

---

# Example

- The **Bayesian** alternative is the credible interval, which has a definition that is easier to interpret. 
- Since a Bayesian is allowed to express uncertainty in terms of probability, a Bayesian credible interval is a range for which the Bayesian thinks that the probability of including the true value is, say, 0.95. 
- Thus a Bayesian can say that there is a 95% chance that the credible interval contains the true parameter value.

---
class: inverse, center, middle

#  Frequentist vs. Bayesian Inference

---

# Bayesian Inference

- Bayesian inference is to modify our beliefs to account for new data.
- Before we have data, we have our initial beliefs, which we call prior.
- After we have data, we update our beliefs: `$$\text{prior} + \text{data} \rightarrow \text{posterior}.$$`

---

# Bayesian Inference via Bayes' Rule

`$$p(\theta|\text{data}) = \frac{p(\text{data}|\theta)p(\theta)}{p(\text{data})}.$$`

---

# Example

We have a population of M&M's, and in this population the percentage of yellow M&M's is either 10% or 20%. You have been hired as a statistical consultant to decide whether the true percentage of yellow M&M's is 10% or 20%.

Payoffs/losses: You are being asked to make a decision, and there are associated payoff/losses that you should consider. If you make the correct decision, your boss gives you a bonus. On the other hand, if you make the wrong decision, you lose your job.

Data: You can "buy" a random sample from the population -- You pay `$200$` for each M&M, and you must buy in `$1,000$` increments (5 M&Ms at a time). You have a total of `$4,000$` to spend, i.e., you may buy 5, 10, 15, or 20 M&Ms.

---

# Frequentist Inference

* Hypothesis: `$H_0$` is 10% yellow M&Ms, and `$H_A$` is >10% yellow M&Ms.

* Significance level: `$\alpha = 0.05$`.

* Sample: red, green, **yellow**, blue, orange

* Observed data: `$k=1, n=5$`

* P-value: `$P(k \geq 1 | n=5, p=0.10) = 1 - P(k=0 | n=5, p=0.10) = 1 - 0.90^5 \approx 0.41$`

---

# Bayesian Inference

* Hypotheses: `$H_1$` is 10% yellow M&Ms, and `$H_2$` is 20% yellow M&Ms.

* Prior: `$P(H_1) = P(H_2) = 0.5$`

* Sample: red, green, **yellow**, blue, orange

* Observed data: `$k=1, n=5$`

* Likelihood:

`$$\begin{aligned}
P(k=1 | H_1) &= \left( \begin{array}{c} 5 \\ 1 \end{array} \right) \times 0.10 \times 0.90^4 \approx 0.33 \\
P(k=1 | H_2) &= \left( \begin{array}{c} 5 \\ 1 \end{array} \right) \times 0.20 \times 0.80^4 \approx 0.41
\end{aligned}$$`

---

# Bayesian Inference

* Posterior

`$$\begin{aligned}
P(H_1 | k=1) &= \frac{P(H_1)P(k=1 | H_1)}{P(k=1)} = \frac{0.5 \times 0.33}{0.5 \times 0.33 + 0.5 \times 0.41} \approx 0.45 \\
P(H_2 | k=1) &= 1 - 0.45 = 0.55
\end{aligned}$$`

---

# A Larger Sample Size

<table>
<caption>Frequentist and Bayesian probabilities for larger sample sizes</caption>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;"> Frequentist </th>
   <th style="text-align:left;"> Bayesian H_1 </th>
   <th style="text-align:left;"> Bayesian H_2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Observed Data </td>
   <td style="text-align:left;"> P(k or more | 10% yellow) </td>
   <td style="text-align:left;"> P(10% yellow | n, k) </td>
   <td style="text-align:left;"> P(20% yellow | n, k) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> n = 5, k = 1 </td>
   <td style="text-align:left;"> 0.41 </td>
   <td style="text-align:left;"> 0.45 </td>
   <td style="text-align:left;"> 0.55 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> n = 10, k = 2 </td>
   <td style="text-align:left;"> 0.26 </td>
   <td style="text-align:left;"> 0.39 </td>
   <td style="text-align:left;"> 0.61 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> n = 15, k = 3 </td>
   <td style="text-align:left;"> 0.18 </td>
   <td style="text-align:left;"> 0.34 </td>
   <td style="text-align:left;"> 0.66 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> n = 20, k = 4 </td>
   <td style="text-align:left;"> 0.13 </td>
   <td style="text-align:left;"> 0.29 </td>
   <td style="text-align:left;"> 0.71 </td>
  </tr>
</tbody>
</table>

---
class: inverse, center, middle

# Summary

---

# Summary

- Bayes' rule
- Bayesian inference
- Frequentist v.s. Bayesian
- Examples