Statistical Computing

School of Economics and Management
Beihang University
http://yanfei.site

Optimisation in Business

Many problems in business require something to be minimized or maximized.

Maximizing Revenue
Minimizing Costs
Minimizing Delivery Time
Maximizing Financial Returns

Optimisation in Statistics

Maximum Likelihood
Least Squares
Method of Moments
Posterior Mode

Optimisation

Suppose we want to find a minimum or maximum of a function \(f(x)\).
Sometimes \(f(x)\) will be very complicated.
Are there computer algorithms that can help?
Yes!

Root Finding

Consider the problem of finding the root of a function.
For the function \(f(x)\), the root is the point \(x^*\) such that \(f(x^*) = 0\).
An algorithm for solving this problem was proposed by Newton and Raphson nearly 500 years ago.

Newton-Raphson Method

Newton-Raphson is an iterative method that begins with an initial guess of the root.
The method uses the derivative of the function \(f^{'}(x)\) as well as the original function \(f(x)\).
When successful, it converges (usually) rapidly, but may fail as any other root-finding algorithm.

Newton-Raphson Method

The method tries to solve an equation in the form of \(f(x) = 0\) through the iterative procedure: \[x_{n+1} = x_n - \frac{f(x_n)}{f^{'}(x_n)}.\]

Please think about Taylor expansion.

Geometric interpretation

Stopping Rule

With each step the algorithm should get closer to the root.
However, it can run for a long time without reaching the exact root.
There must be a stopping rule otherwise the program could run forever.
Let \(\epsilon\) be an extremely small number e.g. \(1 \times 10^{-10}\) called the tolerance level.
If \(|f(x^{*})| < \epsilon\) then the solution is close enough and there is a root at \(x^{*}\).

Example

Now find the root of \(f(x) = x^2 - 5\) using Newton method by hand.
Tell about its geometric interpretation.

Newton-Raphson Method

Select an initial guess \(x_0\), and set \(n = 0\).
Set \(x_{n+1} = x_n - \frac{f(x_n)}{f^{'}(x_n)}\).
Evaluate \(|f(x_{n+1})|\).
1. If \(|f(x_{n+1})| < \epsilon\), then stop;
2. Otherwise set \(n=n+1\) and go back to step 2.

Lab Session 5

Write R code to find the root of \(x^2 = 5\).
Now use your Newton-Raphson code to find the root of \(f(x) = \sqrt{|x|}\). Try initial value 0.25.
Now use your Newton-Raphson code to find the root of \(x e^{-x^2} = 0.4(e^x + 1)^{-1} + 0.2\). Try initial values 0.5 and 0.6.

What did you learn from mistakes?

You can use `D()`

newton.raphson <- function(f, init.value, df = NULL, tol = 1e-5, maxiter = 1000) {
  if (is.null(df)) {
    df <- D(f, 'x')
  }
  niter <- 0
  diff <- tol + 1
  x <- init.value
  while (diff >= tol && niter <= maxiter) {
    niter <- niter + 1
    fx <- eval(f)
    dfx <- eval(df)
    if (dfx == 0) {
      warning("Slope is zero: no further improvement possible.")
      break
    }
    diff <- -fx/dfx
    x <- x + diff
    diff <- abs(diff)
  }
  if (niter > maxiter) {
    warning("Maximum number of iterations 'maxiter' was reached.")
  }
  return(list(root = x, f.root = fx, niter = niter, estim.prec = diff))
}

You can also use `deriv()`

newton.raphson <- function(ftn, x0, tol = 1e-9, max.iter = 100) {
  # Newton_Raphson algorithm for solving ftn(x)[1] == 0
  # we assume that ftn is a function of a single variable that returns
  # the function value and the first derivative as a vector of length 2
  #
  # x0 is the initial guess at the root
  # the algorithm terminates when the function value is within distance
  # tol of 0, or the number of iterations exceeds max.iter
  
  # initialise
  x <- x0
  fx <- ftn(x)
  iter <-  0
  
  # continue iterating until stopping conditions are met
  while ((abs(fx[1]) > tol) && (iter < max.iter)) {
    x <- x - fx[1]/fx[2]
    fx <- ftn(x)
    iter <-  iter + 1
    cat("At iteration", iter, "value of x is:", x, "\n")
  }
  
  # output depends on success of algorithm
  if (abs(fx[1]) > tol) {
    cat("Algorithm failed to converge\n")
    return(NULL)
  } else {
    cat("Algorithm converged\n")
    return(x)
  }
}

f <- expression(x^2 - 5)
df <- deriv(f, 'x', func = TRUE)
ftn <- function(x){
  dfx <- df(x)
  f <- dfx[1]
  gradf <- attr(dfx, 'gradient')[1,]
  return(c(f, gradf))
}

newton.raphson(ftn, 1)

Other Examples

Now try these functions:

\(f(x) = x^3 - 2x^2 - 11x + 12\), try starting values 2.35287527 and 2.35284172.
\(f(x) = 2x^3 + 3x^2 + 5\), try starting values 0.5 and 0.

Learn from Mistakes

Newton-Raphson does not always converge.
For some functions, using some certain starting values leads to a series that converges, while other starting values lead to a series that diverges.
For other functions different starting values converge to different roots.
Be careful when choosing the initial value.
Newton-Raphson doesn’t work if the first derivative is zero.

Conclusion

Why did we spend so much time on finding roots of an equation?
Isn’t this topic meant to be about optimization?
Can we change the algorithm slightly so that it works for optimization?

Optimisation

Finding a Maximum or Minimum

Suppose we want to find an minimum or maximum of a function \(f(x)\) (think about maximum likelihood estimation).
Find the derivative \(f^{'}(x)\) and find \(x^{*}\) such that \(f^{'}(x^*) = 0\).
This is the same as finding a root of the first derivative. We can use the Newton Raphson algorithm on the first derivative.

Newton Raphson algorithm for finding local maxima or minima

Select an initial guess \(x_0\), and set \(n = 0\).
Set \(x_{n+1} = x_n - \frac{f^{'}(x_n)}{f^{''}(x_n)}\).
Evaluate \(|f^{'}(x_{n+1})|\).
1. If \(|f^{'}(x_{n+1})| < \epsilon\), then stop;
2. Otherwise set \(n=n+1\) and go back to step 2.

Different Stopping Rules

Three stopping rules can be used.

\(|f^{'}(x_{n+1})| < \epsilon\).
\(|x_{n} - x_{n-1}| < \epsilon\).
\(|f(x_{n}) - f(x_{n-1})| < \epsilon\).

Intuition

Focus the step size \(-\frac{f'(x)}{f''(x)}\).
The signs of the derivatives control the direction of the next step.
The size of the derivatives control the size of the next step.
Consider the concave function \(f(x)=-x^4\) which has \(f'(x)=-4x^3\) and \(f''(x)=-12x^2\). There is a maximum at \(x^{*}=0\).

Role of first derivative

If \(f''(x)\) is negative the function is locally concave, and the search is for a local maximum.
To the left of this maximum \(f'(x)>0\).
Therefore \(-\frac{f'(x)}{f''(x)}>0\).
The next step is to the right.
The reverse holds if \(f'(x)<0\).
Large absolute values of \(f'(x)\) imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of \(f'(x)\).

Role of first derivative

If \(f''(x)\) is positive the function is locally convex, and the search is for a local minimum.
To the left of this maximum \(f'(x)<0\).
Therefore \(-\frac{f'(x)}{f''(x)}>0\).
The next step is to the right.
The reverse holds if \(f'(x)>0\).
Large absolute values of \(f'(x)\) imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of \(f'(x)\).

Role of second derivative

Together with the sign of the first derivative, the sign of the second derivative controls the direction of the next step.
A larger second derivative (in absolute value) implies a larger curvature.
In this case smaller steps are need to stop the algorithm from overshooting.
The opposite holds for a small second derivative.

Multidimensional Optimization

Functions with more than one input

Most interesting optimization problems involve multiple inputs.
- In determining the most risk efficient portfolio the return is a function of many weights (one for each asset).
- In least squares estimation for a linear regression model, the sum of squares is a function of many coefficients (one for each regressor).
How do we optimize for functions \(f({x})\) where \({x}\) is a vector?

Derviatives

Newton's algorithm has a simple update rule based on first and second derivatives.
What do these derivatives look like when the function is \(y=f({x})\) where \(y\) is a scalar and \(\mathbf{x}\) is a \(d\times 1\) vector?

First derivative

Simply take the partial derivatives and put them in a vector \[ \frac{\partial y}{\partial{\mathbf{x}}}= \left( \begin{array}{c} \frac{\partial y}{\partial x_1}\\ \frac{\partial y}{\partial x_2}\\ \vdots\\ \frac{\partial y}{\partial x_d} \end{array} \right)\] This is called the gradient vector.

An example

The function \(y=x_1^2-x_1x_2+x_2^2+e^{x_2}\) has gradient vector

\[\frac{\partial y}{\partial{\mathbf{x}}}=\left(\begin{array}{c} 2x_1-x_2\\ -x_1+2x_2+e^{x_2} \end{array} \right).\]

Second derivative

Simply take the second order partial derivatives. This will give a matrix \[ \frac{\partial y}{\partial{\mathbf{x}}\partial{\mathbf{x}}'}= \left( \begin{array}{cccc} \frac{\partial^2 y}{\partial x_1^2}&\frac{\partial^2 y}{\partial x_1\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_1\partial x_d}\\ \frac{\partial^2 y}{\partial x_2\partial x_1}&\frac{\partial^2 y}{\partial x_2^2}&\cdots&\frac{\partial^2 y}{\partial x_2\partial x_d}\\ \vdots&\vdots&\ddots&\vdots\\ \frac{\partial^2 y}{\partial x_d\partial x_1}&\frac{\partial^2 y}{\partial x_d\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_d^2}\\ \end{array} \right).\]

This is called the Hessian matrix.

An example

The function \(y=x_1^2-x_1x_2+x_2^2+e^{x_2}\) has Hessian matrix \[\frac{\partial y}{\partial{\mathbf{x}}\partial{ \mathbf{x}}'}=\left(\begin{array}{cc} 2 & -1\\ -1 & 2 + e^{x_2} \end{array} \right) \]

Newton’s algorithm for multidimensional optimization

We can now generalise the update step in Newton's method: \[ \mathbf{x_{n+1}}=\mathbf{x_n}-\left(\frac{\partial^2 f({ \mathbf{x}})}{\partial {\mathbf{x}}\partial{\mathbf{x}}'}\right)^{-1} \frac{\partial f({\mathbf{x}})}{\partial {\mathbf{x}}}\]

Now write code to minimise \(y=x_1^2-x_1x_2+x_2^2+e^{x_2}\).

Maximum likelihood Estimate for linear models

Assume you want to make a regression model \[ y_i = \beta_0 + \beta_1x_{i} + \epsilon_i,~\text{where}~\epsilon_i \sim N(0, 1). \]

How do we estimate the parameters?
Write down the log likelihood function with respect to the unknown parameters.
Write down the gradient for the log likelihood function.
Write down the Hessian for the log likelihood function.
Use your newton function to obtain the best parameter estimate.

Optimizing Using Newton's Method

## Generate some data
beta0 <- 1
beta1 <- 3
sigma <- 1
n <- 1000
x <- rnorm(n, 3, 1)
y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)
# plot(x, y, col = "blue", pch = 20)

## Make the log normal likelihood function
func = function(beta){
  sum((y-beta[1]-beta[2]*x)^2)
}

grad = function(beta){
  matrix(c(sum(-2*(y-beta[1]-beta[2]*x)),sum(-2*x*(y-beta[1]-beta[2]*x))),2,1)
}

hess = function(beta){
  matrix(c(2*length(x),2*sum(x),2*sum(x),2*sum(x^2)),2,2)
}

## The optimization
source('./RCode/newton.R')
optimOut <- newton(function(beta){list(func(beta), grad(beta), hess(beta))}, c(1.1, 1.3))
beta0Hat <- optimOut[1]
beta1Hat <- optimOut[2]
yHat <- beta0Hat + beta1Hat * x

## Plot
plot(x, y, pch = 20, col = "blue")
points(sort(x), yHat[order(x)], type = "l", col = "red", lwd = 2)

Comparison with OLS

myLM <- lm(y~x)
myLMCoef <- myLM$coefficients
yHatOLS <- myLMCoef[1] + myLMCoef[2] * x
plot(x, y, pch = 20, col = "blue")
points(sort(x), yHat[order(x)], type = "l", col = "red", lwd = 10)
points(sort(x), yHatOLS[order(x)], type = "l",
       col = "blue", lty="dashed", lwd = 2, pch = 20)

Lab Session 7

Use Newton’s method to find the maximum likelihood estimate for the coefficients in a logistic regression. The steps are:

Write down likelihood function.
Find the gradient and Hessian matrix.
Code these up in R.
Simulate some data from a logistic regression model and test your code.

References

Chapters 10 and 12 of the book "Introduction to Scientific Programming and Simulation Using R".

Optimisation in Business

Optimisation in Statistics

Optimisation

Root Finding

Root Finding

Newton-Raphson Method

Newton-Raphson Method

Geometric interpretation

Stopping Rule

Example

Newton-Raphson Method

Lab Session 5

You can use D()

You can also use deriv()

Other Examples

Learn from Mistakes

Conclusion

Optimisation

Finding a Maximum or Minimum

Newton Raphson algorithm for finding local maxima or minima

Different Stopping Rules

Intuition

Role of first derivative

Role of first derivative

Role of first derivative

Role of first derivative

Role of first derivative

Role of second derivative

Role of second derivative

Multidimensional Optimization

Functions with more than one input

Derviatives

First derivative

An example

Second derivative

An example

Newton’s algorithm for multidimensional optimization

Maximum likelihood Estimate for linear models

Optimizing Using Newton's Method

Comparison with OLS

Lab Session 7

References

You can use `D()`

You can also use `deriv()`