Bayesian Statistics and Computing

---

# Bayesian Statistics and Computing

## Lecture 8: Quasi-Newton Methods

#### *Yanfei Kang | BSC | Beihang University*

---

# Quasi-Newton Methods

- One of the most difficult parts of the Newton method is working out the derivatives especially the Hessian.

- However methods can be used to approximate the Hessian and also the gradient.

- These are known as Quasi-Newton Methods.

- In general they will converge slower than pure Newton methods.

---
# The BFGS Algorithm

Introduced over several papers by Broyden, Fletcher, Goldfarb and Shanno. It is the most popular Quasi-Newton algorithm.

- Recall Newton iteration: `$$x_{n+1} = x_n - f^{''}(x_n)^{-1}f^{'}(x_n).$$`

- Is there some matrix to replace `$f^{''}(x_n)$` or `$f^{''}(x_n)^{-1}$`?

- Can we use a revised iteration: `$x_{n+1} = x_n - B_n^{-1}f^{'}(x_n),$` where `$B_n$` is simpler to compute but still allows the algorithm to converge quickly?

---
# The BFGS Algorithm

- The idea with Quasi-Newton is to find a solution `$B_n$` to the problem `$$f^{'}(x_n) - f^{'}(x_{n-1}) = B_n(x_n - x_{n-1}).$$`
- Let `$y_n = f^{'}(x_n) - f^{'}(x_{n-1})$` and `$s_n = x_n - x_{n-1}$`, one updating procedures for `$B_n$`:
`$$B_n = B_{n-1} + \frac{y_ny_n^{'}}{y_n^{'}s_n} - \frac{B_{n-1}s_n s_n^{'}B_{n-1}^{'}}{s_n^{'}B_{n-1}s_n}.$$`

---
# The L-BFGS-B Algorithm

- The R function `optim()` also has a variation called L-BFGS-B.

- The L-BFGS-B uses less computer memory than BFGS and allows for box constraints.

---
# Box Constrains

- Box constraints have the form `$$l_i \leq x_i \leq u_i,~ \forall i.$$`

- In statistics this can be very useful. Often parameters are constrained.
    - Variance must be greater than 0.
    - For a stationary AR(1), coefficients must be between -1 and 1.
    - Weights in a portfolio must be between 0 and 1.

---
# `optim()`  in R

- `optim()`  requires at least two inputs.
    - Initial values
    - The function that needs to be optimized
    
- By default it minimises a function.

- A function that computes the gradient vector can also be provided.

- The optimization method can be set (choices include BFGS, L-BFGS-B and Nelder-Mead).

- Lower and upper bounds can be set through the arguments lower and upper if the L-BFGS-B method is used.

---
# `optim()`  in R
- Further arguments can be passed in an argument called `control`.

- Some things that can be included in this list are
    - Maximum number of iterations (`maxit`)
    - Information about the algorithm (`trace`)
    - How often to display information about the algorithm (`REPORT`)

---
# `optim()`  in R   
- The result of optim can be saved in an object that is a list containing 
    - The value of the function at the turning point (`value`)
    - The optimal parameters (`par`)
    - Useful information about whether the algorithm has converged (`convergence`)
    
- For all algorithms convergence = 0 if the algorithm has converged (slightly confusing).

---
# Exercise

Use `optim()` to carry out maximum likelihood for the linear regression model in our last lecture.

---
# References

Chapter 3.3 of the book ["Advanced Statistical Computing"](https://bookdown.org/rdpeng/advstatcomp/).