Many problems in business require something to be minimized or maximized.

- Maximizing Revenue
- Minimizing Costs
- Minimizing Delivery Time
- Maximizing Financial Returns

School of Economics and Management

Beihang University

http://yanfei.site

Many problems in business require something to be minimized or maximized.

- Maximizing Revenue
- Minimizing Costs
- Minimizing Delivery Time
- Maximizing Financial Returns

- Maximum Likelihood
- Least Squares
- Method of Moments
- Posterior Mode

- Suppose we want to find a minimum or maximum of a function \(f(x)\).
- Sometimes \(f(x)\) will be very complicated.
- Are there computer algorithms that can help?
- Yes!

- Consider the problem of finding the root of a function.
- For the function \(f(x)\), the root is the point \(x^*\) such that \(f(x^*) = 0\).
- An algorithm for solving this problem was proposed by Newton and Raphson nearly 500 years ago.

- Newton-Raphson is an iterative method that begins with an initial guess of the root.
- The method uses the derivative of the function \(f^{'}(x)\) as well as the original function \(f(x)\).
- When successful, it converges (usually) rapidly, but may fail as any other root-finding algorithm.

The method tries to solve an equation in the form of \(f(x) = 0\) through the iterative procedure: \[x_{n+1} = x_n - \frac{f(x_n)}{f^{'}(x_n)}.\]

Please think about **Taylor expansion**.

- With each step the algorithm should get closer to the root.
- However, it can run for a long time without reaching the exact root.
- There must be a stopping rule otherwise the program could run forever.
- Let \(\epsilon\) be an extremely small number e.g. \(1 \times 10^{-10}\) called the
**tolerance level**. - If \(|f(x^{*})| < \epsilon\) then the solution is close enough and there is a root at \(x^{*}\).

- Now find the root of \(f(x) = x^2 - 5\) using Newton method by hand.
- Tell about its geometric interpretation.

- Select an initial guess \(x_0\), and set \(n = 0\).
- Set \(x_{n+1} = x_n - \frac{f(x_n)}{f^{'}(x_n)}\).
Evaluate \(|f(x_{n+1})|\).

- If \(|f(x_{n+1})| < \epsilon\), then stop;
- Otherwise set \(n=n+1\) and go back to step 2.

- Write R code to find the root of \(x^2 = 5\).
- Now use your Newton-Raphson code to find the root of \(f(x) = \sqrt{|x|}\). Try initial value 0.25.
- Now use your Newton-Raphson code to find the root of \(x e^{-x^2} = 0.4(e^x + 1)^{-1} + 0.2\). Try initial values 0.5 and 0.6.

What did you learn from mistakes?

`D()`

newton.raphson <- function(f, init.value, df = NULL, tol = 1e-5, maxiter = 1000) { if (is.null(df)) { df <- D(f, 'x') } niter <- 0 diff <- tol + 1 x <- init.value while (diff >= tol && niter <= maxiter) { niter <- niter + 1 fx <- eval(f) dfx <- eval(df) if (dfx == 0) { warning("Slope is zero: no further improvement possible.") break } diff <- -fx/dfx x <- x + diff diff <- abs(diff) } if (niter > maxiter) { warning("Maximum number of iterations 'maxiter' was reached.") } return(list(root = x, f.root = fx, niter = niter, estim.prec = diff)) }

`deriv()`

newton.raphson <- function(ftn, x0, tol = 1e-9, max.iter = 100) { # Newton_Raphson algorithm for solving ftn(x)[1] == 0 # we assume that ftn is a function of a single variable that returns # the function value and the first derivative as a vector of length 2 # # x0 is the initial guess at the root # the algorithm terminates when the function value is within distance # tol of 0, or the number of iterations exceeds max.iter # initialise x <- x0 fx <- ftn(x) iter <- 0 # continue iterating until stopping conditions are met while ((abs(fx[1]) > tol) && (iter < max.iter)) { x <- x - fx[1]/fx[2] fx <- ftn(x) iter <- iter + 1 cat("At iteration", iter, "value of x is:", x, "\n") } # output depends on success of algorithm if (abs(fx[1]) > tol) { cat("Algorithm failed to converge\n") return(NULL) } else { cat("Algorithm converged\n") return(x) } } f <- expression(x^2 - 5) df <- deriv(f, 'x', func = TRUE) ftn <- function(x){ dfx <- df(x) f <- dfx[1] gradf <- attr(dfx, 'gradient')[1,] return(c(f, gradf)) } newton.raphson(ftn, 1)

Now try these functions:

- \(f(x) = x^3 - 2x^2 - 11x + 12\), try starting values 2.35287527 and 2.35284172.
- \(f(x) = 2x^3 + 3x^2 + 5\), try starting values 0.5 and 0.

- Newton-Raphson does not always converge.
- For some functions, using some certain starting values leads to a series that converges, while other starting values lead to a series that diverges.
- For other functions different starting values converge to different roots.
- Be careful when choosing the initial value.
- Newton-Raphson doesnâ€™t work if the first derivative is zero.

- Why did we spend so much time on finding roots of an equation?
- Isnâ€™t this topic meant to be about optimization?
- Can we change the algorithm slightly so that it works for optimization?

- Suppose we want to find an minimum or maximum of a function \(f(x)\) (think about maximum likelihood estimation).
- Find the derivative \(f^{'}(x)\) and find \(x^{*}\) such that \(f^{'}(x^*) = 0\).
- This is the same as finding a root of the first derivative. We can use the Newton Raphson algorithm on the first derivative.

- Select an initial guess \(x_0\), and set \(n = 0\).
- Set \(x_{n+1} = x_n - \frac{f^{'}(x_n)}{f^{''}(x_n)}\).
Evaluate \(|f^{'}(x_{n+1})|\).

- If \(|f^{'}(x_{n+1})| < \epsilon\), then stop;
- Otherwise set \(n=n+1\) and go back to step 2.

Three stopping rules can be used.

- \(|f^{'}(x_{n+1})| < \epsilon\).
- \(|x_{n} - x_{n-1}| < \epsilon\).
- \(|f(x_{n}) - f(x_{n-1})| < \epsilon\).

- Focus the step size \(-\frac{f'(x)}{f''(x)}\).
- The signs of the derivatives control the direction of the next step.
- The size of the derivatives control the size of the next step.
- Consider the concave function \(f(x)=-x^4\) which has \(f'(x)=-4x^3\) and \(f''(x)=-12x^2\). There is a maximum at \(x^{*}=0\).

- If \(f''(x)\) is negative the function is locally concave, and the search is for a local maximum.
- To the left of this maximum \(f'(x)>0\).
- Therefore \(-\frac{f'(x)}{f''(x)}>0\).
- The next step is to the right.
- The reverse holds if \(f'(x)<0\).
- Large absolute values of \(f'(x)\) imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of \(f'(x)\).

- If \(f''(x)\) is positive the function is locally convex, and the search is for a local minimum.
- To the left of this maximum \(f'(x)<0\).
- Therefore \(-\frac{f'(x)}{f''(x)}>0\).
- The next step is to the right.
- The reverse holds if \(f'(x)>0\).
- Large absolute values of \(f'(x)\) imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of \(f'(x)\).

- Together with the sign of the first derivative, the sign of the second derivative controls the direction of the next step.
- A larger second derivative (in absolute value) implies a larger curvature.
- In this case smaller steps are need to stop the algorithm from overshooting.
- The opposite holds for a small second derivative.

- Most interesting optimization problems involve multiple inputs.
- In determining the most risk efficient portfolio the return is a function of many weights (one for each asset).
- In least squares estimation for a linear regression model, the sum of squares is a function of many coefficients (one for each regressor).

- How do we optimize for functions \(f({x})\) where \({x}\) is a vector?

- Newton's algorithm has a simple update rule based on first and second derivatives.
- What do these derivatives look like when the function is \(y=f({x})\) where \(y\) is a scalar and \(\mathbf{x}\) is a \(d\times 1\) vector?

Simply take the partial derivatives and put them in a vector \[ \frac{\partial y}{\partial{\mathbf{x}}}= \left( \begin{array}{c} \frac{\partial y}{\partial x_1}\\ \frac{\partial y}{\partial x_2}\\ \vdots\\ \frac{\partial y}{\partial x_d} \end{array} \right)\] This is called the gradient vector.

The function \(y=x_1^2-x_1x_2+x_2^2+e^{x_2}\) has gradient vector

\[\frac{\partial y}{\partial{\mathbf{x}}}=\left(\begin{array}{c} 2x_1-x_2\\ -x_1+2x_2+e^{x_2} \end{array} \right).\]

Simply take the second order partial derivatives. This will give a matrix \[ \frac{\partial y}{\partial{\mathbf{x}}\partial{\mathbf{x}}'}= \left( \begin{array}{cccc} \frac{\partial^2 y}{\partial x_1^2}&\frac{\partial^2 y}{\partial x_1\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_1\partial x_d}\\ \frac{\partial^2 y}{\partial x_2\partial x_1}&\frac{\partial^2 y}{\partial x_2^2}&\cdots&\frac{\partial^2 y}{\partial x_2\partial x_d}\\ \vdots&\vdots&\ddots&\vdots\\ \frac{\partial^2 y}{\partial x_d\partial x_1}&\frac{\partial^2 y}{\partial x_d\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_d^2}\\ \end{array} \right).\]

This is called the Hessian matrix.

The function \(y=x_1^2-x_1x_2+x_2^2+e^{x_2}\) has Hessian matrix \[\frac{\partial y}{\partial{\mathbf{x}}\partial{ \mathbf{x}}'}=\left(\begin{array}{cc} 2 & -1\\ -1 & 2 + e^{x_2} \end{array} \right) \]

We can now generalise the update step in Newton's method: \[ \mathbf{x_{n+1}}=\mathbf{x_n}-\left(\frac{\partial^2 f({ \mathbf{x}})}{\partial {\mathbf{x}}\partial{\mathbf{x}}'}\right)^{-1} \frac{\partial f({\mathbf{x}})}{\partial {\mathbf{x}}}\]

Now write code to minimise \(y=x_1^2-x_1x_2+x_2^2+e^{x_2}\).

Assume you want to make a regression model \[ y_i = \beta_0 + \beta_1x_{i} + \epsilon_i,~\text{where}~\epsilon_i \sim N(0, 1). \]

- How do we estimate the parameters?
- Write down the log likelihood function with respect to the unknown parameters.
- Write down the gradient for the log likelihood function.
- Write down the Hessian for the log likelihood function.
- Use your newton function to obtain the best parameter estimate.

## Generate some data beta0 <- 1 beta1 <- 3 sigma <- 1 n <- 1000 x <- rnorm(n, 3, 1) y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma) # plot(x, y, col = "blue", pch = 20) ## Make the log normal likelihood function func = function(beta){ sum((y-beta[1]-beta[2]*x)^2) } grad = function(beta){ matrix(c(sum(-2*(y-beta[1]-beta[2]*x)),sum(-2*x*(y-beta[1]-beta[2]*x))),2,1) } hess = function(beta){ matrix(c(2*length(x),2*sum(x),2*sum(x),2*sum(x^2)),2,2) } ## The optimization source('./RCode/newton.R') optimOut <- newton(function(beta){list(func(beta), grad(beta), hess(beta))}, c(1.1, 1.3)) beta0Hat <- optimOut[1] beta1Hat <- optimOut[2] yHat <- beta0Hat + beta1Hat * x ## Plot plot(x, y, pch = 20, col = "blue") points(sort(x), yHat[order(x)], type = "l", col = "red", lwd = 2)

myLM <- lm(y~x) myLMCoef <- myLM$coefficients yHatOLS <- myLMCoef[1] + myLMCoef[2] * x plot(x, y, pch = 20, col = "blue") points(sort(x), yHat[order(x)], type = "l", col = "red", lwd = 10) points(sort(x), yHatOLS[order(x)], type = "l", col = "blue", lty="dashed", lwd = 2, pch = 20)

Use Newtonâ€™s method to find the maximum likelihood estimate for the coefficients in a logistic regression. The steps are:

- Write down likelihood function.
- Find the gradient and Hessian matrix.
- Code these up in R.
- Simulate some data from a logistic regression model and test your code.

Chapters 10 and 12 of the book "Introduction to Scientific Programming and Simulation Using R".