Bayesian Statistics and Computing

layout: true

---
class: inverse, center, middle

background-image: url(../figs/titlepage16-9.png)
background-size: cover

# Bayesian Statistics and Computing

## Lecture 7: Newton Method

#### *Yanfei Kang | BSC | Beihang University*

---

# Optimisation in Business

Many problems in business require something to be minimized or maximized.

- Maximizing Revenue

- Minimizing Costs

- Minimizing Delivery Time

- Maximizing Financial Returns

---
# Optimisation in Statistics

- Maximum Likelihood

- Least Squares

- Method of Moments

- Posterior Mode

---
# Optimisation

- Suppose we want to find a minimum or maximum of a function `$f(x)$`.

- Sometimes `$f(x)$` will be very complicated.

- Are there computer algorithms that can help?

- Yes!

---
class: inverse, center, middle
# Root Finding

---
# Root Finding

- Consider the problem of finding the root of a function.

- For the function `$f(x)$`, the root is the point `$x^*$` such that `$f(x^*) = 0$`.

- If there is no closed form for `$f^{-1}(x)$` and `$f$` is continuous, the solution is effected by an iterative process.

---
# Bisection method

- The bisection algorithm is a simple method for finding the roots of one-dimensional functions.

- The goal is to find a root `$x^* \in [a, b]$` such that `$f(x^*) = 0$`.

- The algorithm starts with a large interval, known to contain  `$x^*$` and then successively reduces the size of the interval until it brackets the root.

---
# Bisection method

- Set an initial interval `$[a,b]$` such that `$\text{sign}(f(a)) \neq \text{sign}(f(b))$`.

- Iterate until `$|b-a| < \epsilon$` or `$|f(b) - f(a)|<\epsilon$`.

1. Calculate `$c = \frac{a+b}{2}$`.
    2. If `$f(c) = 0$`, stop and return `$c$`.
    3. If `$\text{sign}(f(a)) \neq \text{sign}(f(c))$`, set `$b \leftarrow c$`; otherwise `$a \leftarrow c$`, and return to step 1.
    
- Output `$c$`.

---
# Animation

```
#> [1] 4.470348e-08
```

---
# Newton-Raphson Method

- Newton-Raphson is an iterative method that begins with an initial guess of the root, proposed by Newton and Raphson nearly 500 years ago.

- The method uses the derivative of the function `$f^{'}(x)$` as well as the original function `$f(x)$`.

- When successful,  it converges (usually) rapidly, but may fail as any other root-finding algorithm.

---
# Newton-Raphson Method

The method tries to solve an equation in the form of `$f(x) = 0$` through the iterative procedure: `$$x_{n+1} = x_n - \frac{f(x_n)}{f^{'}(x_n)}.$$`

Please think about **Taylor expansion**.

---
# Geometric interpretation

---
# Stopping Rule

- With each step the algorithm should get closer to the root.

- However, it can run for a long time without reaching the exact root.

- There must be a stopping rule otherwise the program could run forever.

- Let `$\epsilon$` be an extremely small number e.g. `$1 \times 10^{-10}$` called the **tolerance level**.

- If `$|f(x^{*})| < \epsilon$` then the solution is close enough and there is a root at `$x^{*}$`.

---
# Example

- Now find the root of `$f(x) = x^2 - 5$` using Newton method by hand.

- Tell about its geometric interpretation.

---
# Newton-Raphson Method

1. Select an initial guess `$x_0$`, and set `$n = 0$`.

2. Set `$x_{n+1} = x_n - \frac{f(x_n)}{f^{'}(x_n)}$`.

3. Evaluate `$|f(x_{n+1})|$`.

1. If `$|f(x_{n+1})| < \epsilon$`, then stop;
    2. Otherwise set `$n=n+1$` and go back to step 2.
  
---
# Convergence rate

Suppose we have a sequence `$\{x_n\}$`,

- Linear convergence: there exists  `$r \in (0,1)$` such that `$$\frac{\|x_{n+1}-x_\infty\|}{\|x_n-x_\infty\|}\leq r.$$`

- Superlinear convergence: `$$\lim_{n\rightarrow\infty}\frac{\|x_{n+1}-x_\infty\|}{\|x_n-x_\infty\|}=0.$$`

- Quadratic Convergence: there exists some constant `$M >0$` such that `$$\frac{\|x_{n+1}-x_\infty\|}{\|x_n-x_\infty\|^2}\leq M.$$`

---
# Convergence rate: Bisection

- For Bisection, the error that we make in estimating the root is `$x_n = |b_n - a_n|$`.

- `$x_n = 2^{-n}|b_0 - a_0|$`.

- The convergence rate for Bisection method is 
`$$\frac{\|x_{n+1}-x_\infty\|}{\|x_n-x_\infty\|}=\frac{x_{n+1}}{x_n}=
\frac{2^{-(n+1)}(b_0-a_0)}{2^{-n}(b_0-a_0)}=
\frac{1}{2}.$$`

- The error of the bisection algorithm converges linearly to  0.

---
# Convergence rate: Newton-Raphson

- Let `$f(x)$` have a root at `$\alpha$`.   The Taylor approximation states that
`$$f(\alpha)\approx f(x_n)+f'(x_n)(\alpha-x_n)+\frac{1}{2}f''(x_n)(\alpha-x_n)^2.$$`

- The quality of the approximation depends on the function and how close `$x_n$` is to `$\alpha$`.

- Since `$\alpha$` is a root, `$f(\alpha)=0$` This implies
`$$0\approx f(x_n)+f'(x_n)(\alpha-x_n)+\frac{1}{2}f''(x_n)(\alpha-x_n)^2.$$`

- Dividing by `$f'(x_n)$` and rearranging gives:
`$$\frac{f(x_n)}{f'(x_n)}+(\alpha-x_n)\approx\frac{-f''(x_n)}{2f'(x_n)}(\alpha-x_n)^2.$$`
---
# Convergence rate: Newton-Raphson

- More rearranging
`$$\alpha-\left(x_n-\frac{f(x_n)}{f'(x_n)}\right)\approx\frac{-f''(x_n)}{2f'(x_n)}(\alpha-x_n)^2.$$`

- The term in brackets on the left hand side is the formula used to update `$x$` in the Newton Raphson method `$$(\alpha-x_{n+1})\approx\frac{-f''(x_n)}{2f'(x_n)}(\alpha-x_n)^2.$$`

- This can be rewritten in terms of errors `$e_{n+1}=\alpha-x_{n+1}$` and `$e_n=\alpha-x_n$`
`$$e_{n+1}\approx\frac{-f''(x_n)}{2f'(x_n)}e_n^2.$$`

---
# Convergence rate: Newton-Raphson

- Newton method is of quadratic convergence.

- It is very fast in the neighborhood of the root.

- We need to calculate the derivative (more information of `$f$` provided).

- Sometimes swing wildly out of control and diverge.

---
# Lab

1. Write R code to find the root of `$x^2 = 5$`.

2. Now use your Newton-Raphson code to find the root of `$f(x) = \sqrt{|x|}$`. Try initial value 0.25.

3. Now use your Newton-Raphson code to find the root of `$x e^{-x^2} = 0.4(e^x + 1)^{-1} + 0.2$`. Try initial values 0.5 and 0.6.

4. `$f(x) = x^3 - 2x^2 - 11x + 12$`, try starting values  2.35287527 and 2.35284172.

5. `$f(x) = 2x^3 + 3x^2 + 5$`, try starting values  0.5 and 0.

What did you learn from mistakes?

---
# Learn from Mistakes

- Newton-Raphson does not always converge.

- Initial guess matters.

- Different starting values may converge to different roots.

- Newton-Raphson doesn’t work if the first derivative is zero.

- If the function is not continuously differentiable in a neighborhood of the root, it is possible that Newton's method will always diverge or fail.

- Newton's method requires calculating the derivative.

- Newton’s method requires some intelligence.

---
# Conclusion

- Why did we spend so much time on finding roots of an equation?

- Isn’t this topic meant to be about optimization?

- Can we change the algorithm slightly so that it works for optimization?

---
class: inverse, center, middle
#  Optimisation

---
# Finding a Maximum or Minimum

- Suppose we want to find an minimum or maximum of a function `$f(x)$` (think about maximum likelihood estimation).

- Find the derivative `$f^{'}(x)$`  and find `$x^{*}$` such that `$f^{'}(x^*) = 0$`.

- This is the same as finding a root of the first derivative. We can use the Newton Raphson algorithm on the first derivative.

---
# Newton Raphson algorithm for finding local maxima or minima

1. Select an initial guess `$x_0$`, and set `$n = 0$`.

2. Set `$x_{n+1} = x_n - \frac{f^{'}(x_n)}{f^{''}(x_n)}$`.

3. Evaluate `$|f^{'}(x_{n+1})|$`.

1. If `$|f^{'}(x_{n+1})| < \epsilon$`, then stop;
    2. Otherwise set `$n=n+1$` and go back to step 2.
 
---
# Different Stopping Rules

Three stopping rules can be used.

- `$|f^{'}(x_{n+1})| < \epsilon$`.

- `$|x_{n} - x_{n-1}| < \epsilon$`.

- `$|f(x_{n}) - f(x_{n-1})| < \epsilon$`.

---
# Intuition

- Focus the step size `$-\frac{f'(x)}{f''(x)}$`.

- The signs of the derivatives control the direction of the next step.

- The size of the derivatives control the size of the next step.

- Consider the concave function `$f(x)=-x^4$` which has `$f'(x)=-4x^3$` and `$f''(x)=-12x^2$`. There is a maximum at `$x^{*}=0$`.

---
# Role of first derivative

<center> 
![Newton Method](RCode/climb1.png)

---
# Role of first derivative

<center> 
![Newton Method](RCode/climb2.png)

---
# Role of first derivative

<center> 
![Newton Method](RCode/climb3.png)

---
# Role of first derivative

- If `$f''(x)$` is negative the function is locally concave, and the search is for a local maximum.

- To the left of this maximum `$f'(x)>0$`.

- Therefore `$-\frac{f'(x)}{f''(x)}>0$`.

- The next step is to the right.

- The reverse holds if `$f'(x)<0$`.

- Large absolute values of `$f'(x)$` imply a steep slope.  A big step is needed to get close to the optimum.  The reverse hold for small absolute value of `$f'(x)$`.

---
# Role of first derivative

- If `$f''(x)$` is positive the function is locally  convex, and the search is for a local minimum.

- To the left of this maximum `$f'(x)<0$`.

- Therefore `$-\frac{f'(x)}{f''(x)}>0$`.

- The next step is to the right.

- The reverse holds if `$f'(x)>0$`.

- Large absolute values of `$f'(x)$` imply a steep slope.  A big step is needed to get close to the optimum.  The reverse hold for small absolute value of `$f'(x)$`.

---
# Role of second derivative

<center> 
![Newton Method](RCode/climb4.png)

---
# Role of second derivative

- Together with the sign of the first derivative, the sign of the second derivative controls the direction of the next step.

- A larger second derivative (in absolute value) implies a larger curvature.

- In this case smaller steps are need to stop the algorithm from overshooting.

- The opposite holds for a small second derivative.

---
# Lab - continuted

Based on your root-finding function, write a function to find the maximum of `$f(x) = -x^4$`.

---
class: inverse, center, middle
# Multidimensional Optimization

---
# Functions with more than one input

- Most interesting optimization problems involve multiple inputs.
    - In determining the most risk efficient portfolio the return is a function of many weights (one for each asset).
    - In least squares estimation for a linear regression model, the sum of squares is a function of many coefficients (one for each regressor).
    
- How do we optimize for functions `$f({x})$` where `${x}$` is a vector?

---
# Derviatives

- Newton's algorithm has a simple update rule based on first and second derivatives.

- What do these derivatives look like when the function is `$y=f({x})$` where `$y$` is a scalar and `$\mathbf{x}$` is a `$d\times 1$` vector?

---
# First derivative

Simply take the partial derivatives and put them in a vector
`$$\frac{\partial y}{\partial{\mathbf{x}}}=
    \left(
      \begin{array}{c}
        \frac{\partial y}{\partial x_1}\\
        \frac{\partial y}{\partial x_2}\\
        \vdots\\
        \frac{\partial y}{\partial x_d}
      \end{array}
    \right).$$`
    
This is called the gradient vector.

---
# An example

The function `$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$` has gradient vector
`$$\frac{\partial y}{\partial \mathbf{x}}=\left(\begin{array}{c}2 x_{1}-x_{2} \\ -x_{1}+2 x_{2}+e^{x_{2}}\end{array}\right).$$`

---
# Second derivative

Simply take the second order partial derivatives.  This will give a matrix

`$$\frac{\partial^2 y}{\partial \mathbf{x} \partial \mathbf{x}^{\prime}}=\left(\begin{array}{cccc}\frac{\partial^{2} y}{\partial x_{1}^{2}} & \frac{\partial^{2} y}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^{2} y}{\partial x_{1} \partial x_{d}} \\ \frac{\partial^{2} y}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} y}{\partial x_{2}^{2}} & \cdots & \frac{\partial^{2} y}{\partial x_{2} \partial x_{d}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2} y}{\partial x_{d} \partial x_{1}} & \frac{\partial^{2} y}{\partial x_{d} \partial x_{2}} & \cdots & \frac{\partial^{2} y}{\partial x_{d}^{2}}\end{array}\right).$$`

This is called the Hessian matrix.

---
# An example

The function `$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$` has Hessian matrix
`$$\frac{\partial^2 y}{\partial \mathbf{x} \partial \mathbf{x}^{\prime}}=\left(\begin{array}{cc}2 & -1 \\ -1 & 2+e^{x_{2}}\end{array}\right).$$`

---
# Newton’s algorithm for multidimensional optimization

We can now generalise the update step in Newton's method:
`$$\mathbf{x_{n+1}}=\mathbf{x_n}-\left(\frac{\partial^2 f({ \mathbf{x}})}{\partial {\mathbf{x}}\partial{\mathbf{x}}'}\right)^{-1}
                                                                                                  \frac{\partial f({\mathbf{x}})}{\partial {\mathbf{x}}}.$$`

---
# Lab

Now write code to minimise `$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$`.

---
class: inverse, center, middle
# Maximum likelihood estimate for linear models

---
# Maximum likelihood estimate for linear models

Assume you want to make a regression model
`$$y_i = \beta_0 + \beta_1x_{i} + \epsilon_i,~\text{where}~\epsilon_i \sim N(0, 1).$$`

- How do we estimate the parameters?

- Write down the log likelihood function with respect to the unknown parameters.

- Write down the gradient for the log likelihood function.

- Write down the Hessian for the log likelihood function.

- Use your newton function to obtain the best parameter estimate.

---
# Optimizing Using Newton's Method

```r
## Generate some data
beta0 <- 1
beta1 <- 3
sigma <- 1
n <- 1000
x <- rnorm(n, 3, 1)
y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)
## Make the log normal likelihood function
func = function(beta) {
    sum((y - beta[1] - beta[2] * x)^2)
}
grad = function(beta) {
    matrix(c(sum(-2 * (y - beta[1] - beta[2] * x)), sum(-2 * x * (y - beta[1] - beta[2] * x))), 2, 1)
}
hess = function(beta) {
    matrix(c(2 * length(x), 2 * sum(x), 2 * sum(x), 2 * sum(x^2)), 2, 2)
}
```

---
# Optimizing Using Newton's Method

```r
## The optimization
source('./RCode/newton.R')
## Newton optimization
optimOut <- newton(function(beta){list(func(beta), grad(beta), hess(beta))}, c(-4, -5))  
optimOut
#>          [,1]
#> [1,] 1.109307
#> [2,] 2.962898
beta0Hat <- optimOut[1,]
beta1Hat <- optimOut[2,]
```

---
# Comparison with `lm()`

.pull-left[

```r
myLM <- lm(y ~ x)
myLMCoef <- myLM$coefficients
coefs <- data.frame(Method = c("Newton", "LM"), intercept = c(beta0Hat, myLMCoef[1]), slope = c(beta1Hat, myLMCoef[2]))
coefs
#>             Method intercept    slope
#>             Newton  1.109307 2.962898
#> (Intercept)     LM  1.109307 2.962898
```

```r
library(ggplot2)
qplot(x, y) + geom_abline(aes(slope = myLMCoef[2], intercept = myLMCoef[1], color = "LM"), size = 3) + geom_abline(aes(slope = beta1Hat,
    intercept = beta0Hat, color = "Newton"), linetype = "dashed", size = 2) + labs(color = "Method")
```

]

.pull-right[
<img src="figure/comp-ols-plot-1.png" width="504" style="display: block; margin: auto;" />
]

---
# Newton method in R

- The `nlm()` function in R implements Newton’s method for minimizing a function given a vector of starting values.

- Re-do the linear regression example using newton method using `nlm()`in R.

```r
nlm(func, c(-4, 5))$estimate
#> [1] 1.109348 2.962884
```

---
# References

- Chapter 1 of the [advanced statistical computing](https://bookdown.org/rdpeng/advstatcomp/steepest-descent.html) book by Roger Peng.

- Chapter 2 of the online Chinese textbook.

- Chapters 10 and 12 of the book "Introduction to Scientific Programming and Simulation Using R".