class: left, bottom, inverse, title-slide # Bayesian Statistics and Computing ## Lecture 11: Bayesian Thinking ### Yanfei Kang ### 2020/03/15 (updated: 2020-03-31) --- # Background - We build models to predict complex phenomena. - We need to make inference to derive estimates of parameters. - Bayesian Statistics allows us to go from data back to probabilistic statements about the parameters. - The procedure relies on Bayes' rule. --- class: inverse, center, middle # Some History of Bayes' Rule By Thomas Bayes (1701–1761) when he solved the problem of 'inverse probability'. ![Thomas Bayes](https://upload.wikimedia.org/wikipedia/commons/d/d4/Thomas_Bayes.gif) --- class: inverse, center, middle # Some History of Bayes' Rule After Bayes' death, Richard Price published this work in 1763. ![Richard Rice](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Dr_Richard_Price%2C_DD%2C_FRS_-_Benjamin_West.jpg/330px-Dr_Richard_Price%2C_DD%2C_FRS_-_Benjamin_West.jpg) --- class: inverse, center, middle # Some History of Bayes' Rule Pioneered and popularised by Pierre-Simon Laplace in 1774, independently. ![Pierre-Simon Laplace](https://upload.wikimedia.org/wikipedia/commons/3/39/Laplace%2C_Pierre-Simon%2C_marquis_de.jpg) --- # Bayesian Theories in Machine Learning - The core of machine learning. - Eg1: spell checking. - Eg2: word segmentation. - Eg3: machine translation. - Eg4: spam email filtering. --- class: inverse, center, middle # Bayes' Rule --- # Bayes' Rule .large[ $$ P(A \mid B) = \frac{P(A \,\&\, B)}{P(B)} $$ ] --- # Bayes’ Rule and Diagnostic Testing - Let us consider an example involving the human immunodeficiency virus (HIV). - In the early 1980s, HIV had just been discovered and was rapidly expanding. - False positives (?) and false negatives (?) in HIV testing highly undesirable. --- # Think about probabilities We have the following estimated information: $$ P(\text{ELISA is positive} \mid \text{Person tested has HIV}) = 93\% = 0.93. $$ $$ P(\text{ELISA is negative} \mid \text{Person tested has no HIV}) = 99\% = 0.99. $$ $$ P(\text{Person tested has HIV}) = \frac{1.48}{1000} = 0.00148. $$ What the probability of HIV if ELISA is positive? `$$P(\text{Person tested has HIV} \mid \text{ELISA is positive}) = ?$$` --- # Bayes Learning - One might want to do a second ELISA test after a first one comes up positive. - What is the probability of being HIV positive if the second ELISA test comes back positive? --- # Bayes Learning - The process, of using Bayes' rule to update a probability based on an event affecting it, is called Bayes' updating. - More generally, the what one tries to update can be considered 'prior' information, sometimes simply called the **prior**. - Then, updating this prior using Bayes' rule gives the information conditional on the data, also known as the **posterior**, as in the information after having seen the data. - Going from the prior to the posterior is **Bayes learning** or Bayes updating. --- # Bayesian vs. Frequentist Definitions of Probability - The frequentist definition of probability is based on observation of a large number of trials. The probability for an event `\(E\)` to occur is `\(P(E)\)`, and assume we get `\(n_E\)` successes out of `\(n\)` trials. Then we have $$ P(E) = \lim_{n \rightarrow \infty} \dfrac{n_E}{n}. $$ - On the other hand, the Bayesian definition of probability `\(P(E)\)` reflects our prior beliefs, so `\(P(E)\)` can be any probability distribution. - Think about weather forecasting. --- # Example Based on a 2015 Pew Research poll on 1,500 adults: "We are 95% confident that 60% to 64% of Americans think the federal government does not do enough for middle class people." - There is a 95% chance that this confidence interval includes the true population proportion? - The true population proportion is in this interval 95% of the time? - ? --- # Example - The **Bayesian** alternative is the credible interval, which has a definition that is easier to interpret. - Since a Bayesian is allowed to express uncertainty in terms of probability, a Bayesian credible interval is a range for which the Bayesian thinks that the probability of including the true value is, say, 0.95. - Thus a Bayesian can say that there is a 95% chance that the credible interval contains the true parameter value. --- # Bayesian Inference - Bayesian inference is to modify our beliefs to account for new data. - Before we have data, we have our initial beliefs, which we call prior. - After we have data, we update our beliefs: `$$\text{prior} + \text{data} \rightarrow \text{posterior}.$$` --- # Bayesian Inference via Bayes' Rule .large[ `$$p(\theta|\text{data}) = \frac{p(\text{data}|\theta)p(\theta)}{p(\text{data})}$$` ] --- # How to select prior? - There is no "true" or "correct" prior. - In some cases expert opinion or similar studies can be used to specify an informative prior. - It would be a waste to discard this information - If prior information is unavailable, then the prior should be uninformative. - The prior is best viewed as an initial value to a statistical procedure. --- # How to select prior? - As Bayesian learning continues and more and more data are collected, the posterior concentrates around the true value for any reasonable prior. - However, in a finite sample the prior can have some effect. - Different analysts may pick different priors and thus have different results. --- class: inverse, center, middle # Example --- # Learning About the Proportion of Heavy Sleepers - Suppose a person is interested in learning about the sleeping habits of American college students. - She hears that doctors recommend eight hours of sleep for an average adult. - **What proportion of college students get at least eight hours of sleep?** --- # Learning About the Proportion of Heavy Sleepers Now we reorganize the problem. 1. **Population: ** all American college students. 2. `\(p\)`: the proportion of this population who sleep at least eight hours. 3. Our aim is to learn about the location of `\(p\)`, which is the unknown parameter. --- # Bayesian viewpoint - Use **probability distribution** to describe a person’s beliefs about the uncertainty in `\(p\)`. - This distribution reflects the person’s **subjective prior** opinion about `\(p\)`. - A random sample will be taken. - Before that, research on the sleeping habits to build a prior distribution (e.g. from internet articles or newspaper). - Prior beliefs: `\(p < 0.5\)`, can be any value between 0 and 0.5. --- # Bayesian viewpoint - Suppose our prior density for `\(p\)` is `\(g(p)\)`. - Regard "success" as "sleeping at least 8 hours". - We take a random sample with `\(s\)` successes and `\(f\)` failures. -- - Now we have the likelihood function `$$L(p) \propto p^{s}(1-p)^{f}, 0<p<1.$$` - The posterior for `\(p\)` can be obtained by Bayes' rule: `$$g(p|\text{data}) \propto g(p)L(p).$$` --- # Discrete Prior - List all possible values for `\(p\)` and assign weights for them. - `\(p\)` could be 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95. - Their weights are 1, 5.2, 8, 7.2, 4.6, 2.1, 0.7, 0.1, 0, 0. --- # Discrete Prior .pull-left[ ```r p = seq(0.05, 0.95, by = 0.1) prior = c(1, 5.2, 8, 7.2, 4.6, 2.1, 0.7, 0.1, 0, 0) prior = prior/sum(prior) plot(p, prior, type = "h", ylab="Prior Probability") ``` ] .pull-right[ ![](BSC-L11-thinking_files/figure-html/plot-label-out-1.png)<!-- --> ] --- # Likelihood - In our example, 11 of 27 students sleep a sufficient number of hours. - `\(s = 11\)` and `\(f = 16.\)` - Likelihood function: `$$L(p) \propto p^{11}(1-p)^{16}.$$` --- # Posterior .pull-left[ Now compute the posterior probabilities. ```r s <- 11; f <- 16 like <- p^s*(1-p)^f # calculate likelihood post <- prior*like post <- post/sum(post) par(mfrow=c(2,1)) plot(p,prior,type = "h",lty = 2,ylim = c(0,1),ylab = "prior") plot(p,post,type = "h",lty = 1,ylim = c(0,1),ylab = "posterior") ``` ] .pull-right[ ![](BSC-L11-thinking_files/figure-html/post-out-1.png)<!-- --> ] --- # Beta Prior - `\(p\)` is a continuous parameter. - Alternative is to construct a density `\(g(p)\)` on [0, 1]. - Suppose she believes that the proportion is equally likely to be smaller or larger than `\(p = 0.3\)`. - And she is 90% confident that `\(p\)` is less than 0.5. - A convenient family of densities for a proportion is `$$g(p) \propto p^{a-1}(1-p)^{b-1}, 0<p<1.$$` - **How to select `\(a\)` and `\(b\)`?** --- # Beta Prior - Here the person believes that the median and 90th percentiles of the proportion are given, respectively, by 0.3 and 0.5. - ` beta.select()` in R can find the shape parameters of the beta density that match this prior knowledge. ```r quantile2=list(p=.9,x=.5) quantile1=list(p=.5,x=.3) LearnBayes::beta.select(quantile1,quantile2) ``` ``` ## [1] 3.26 7.19 ``` --- # Posterior - Combining this beta prior with the likelihood function, one can show that the posterior density is also of the beta form with updated parameters `\(a + s\)` and `\(b + f\)`. `$$g(p | \text { data }) \propto p^{a+s-1}(1-p)^{b+f-1}, 0<p<1.$$` - This is an example of a **conjugate** analysis, where the prior and posterior densities have the same functional form. - The prior, likelihood, and posterior are all in the beta family. --- # Posterior .pull-left[ ```r # Plot beta prior, likelihood and posterior a = 3.26; b = 7.19; s = 11; f = 16 *curve(dbeta(x,a+s,b+f), from=0, to=1, * xlab="p",ylab="Density",lty=1,lwd=4) curve(dbeta(x,s+1,f+1),add=TRUE,lty=2,lwd=4) curve(dbeta(x,a,b),add=TRUE,lty=3,lwd=4) legend(.7,4,c("Prior","Likelihood","Posterior"), lty=c(3,2,1),lwd=c(3,3,3)) ``` ] .pull-right[ ![](BSC-L11-thinking_files/figure-html/beta-out-1.png)<!-- --> ] --- # Making Inferences via Posterior **1. Is it likely that the proportion of heavy sleepers is greater than 0.5?** **2. What is the 90% interval estimate for `\(p\)`?** ??? ```r # Exact inferences 1 - pbeta(0.5, a + s, b + f) ``` ``` ## [1] 0.0690226 ``` ```r qbeta(c(0.05, 0.95), a + s, b + f) ``` ``` ## [1] 0.2555267 0.5133608 ``` ```r # Inferences based on simulation ps = rbeta(1000, a + s, b + f) hist(ps,xlab="p",main="") ``` ![](BSC-L11-thinking_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ```r sum(ps >= 0.5)/1000 ``` ``` ## [1] 0.076 ``` ```r quantile(ps, c(0.05, 0.95)) ``` ``` ## 5% 95% ## 0.2485545 0.5153966 ``` --- class: inverse, center, middle # Example --- # Normal Distribution with Known Mean but Unknown Variance - We estimate an unknown variance using American football scores. - The focus is on the difference `\(d\)` between a game outcome (winning score minus losing score) and a published point spread. - We have observations: `\(d_1, d_2, \cdots, d_n\)`. - We assume they are normally distributed with mean 0 and variance `\(\sigma^2\)`. --- # ### Likelihood function `$$L\left(\sigma^{2}\right) \propto\left(\sigma^{2}\right)^{-n / 2} \exp \left\{-\sum_{i=1}^{n} d_{i}^{2} /\left(2 \sigma^{2}\right)\right\}, \sigma^{2}>0.$$` -- ### Prior We use the noninformative prior `$$p\left(\sigma^{2}\right) \propto 1 / \sigma^{2}.$$` -- ### Posterior `$$g\left(\sigma^{2} | \text { data }\right) \propto\left(\sigma^{2}\right)^{-n / 2-1} \exp \left\{-v /\left(2 \sigma^{2}\right)\right\},$$` where `\(v=\sum_{i=1}^{n} d_{i}^{2}.\)` --- # Now if we define precision parameter `\(P=1 / \sigma^{2},\)` 1. **What is the distribution of `\(P\)`?** 2. **What is the point estimate and a 95% probability interval for the standard deviation `\(\sigma\)`?** ??? - If `\(X\)` is distributed with density `\(f(x)\)`, then the density of `\(y = 1/x\)` is `\(g(y)=\frac{1}{y^{2}} f\left(\frac{1}{y}\right)\)`. - `\(P\)` is distributed as `\(U/v\)`. - `\(U\)` has a `\(\chi^2\)` distribution with `\(n\)` degrees of freedom. --- # Normal Dis. with Known Mean but Unknown Variance .pull-left[ Data available [here](https://yanfei.site/docs/bsc/footballscores.txt). ```r # Football scores from 672 professional American football games. d = read.table('footballscores.txt', header = TRUE)$scores n = length(d); v = sum(d^2) # simulate from posterior *P = rchisq(1000, n)/v s = sqrt(1/P) quantile(s, probs = c(0.025, 0.5, 0.975)) hist(s,main="") ``` ] .pull-right[ ``` ## 2.5% 50% 97.5% ## 13.14580 13.81775 14.55710 ``` ![](BSC-L11-thinking_files/figure-html/normal-out-1.png)<!-- --> ] --- class: inverse, center, middle # Everything boils down to summarizing the posterior. -- #### 1. Simulate from R. -- #### 2. Brute-force. -- #### 3. Deterministic tools. -- #### 4. Computing (more general). --- # Summary - Bayesian rule. - Bayesian reference. - Difference with Frequentist. - Priors. - Everything boils down to summarizing the posterior. - Examples. --- # References and further readings ### References - Sections 2.1, 2.2, 2.3 and 2.5 of BDA book by Gelman et.al. - Chapter 1 of [An Introduction to Bayesian Thinking](https://statswithr.github.io/book/the-basics-of-bayesian-statistics.html). ### Further readings - Section 2.8 for noninformative priors. - Section 3.2 for normal distribution with `\(\mu\)` and `\(sigma\)` both unknown.