+ - 0:00:00
Notes for current slide
Notes for next slide

Bayesian Statistics and Computing

Lecture 10: Stochastic Gradient Descent

Yanfei Kang

2020/03/10 (updated: 2020-03-25)

1 / 23

Stochastic gradient descent

  • Popular for training a wide range of models in machine learning.

  • Optimization is a big part of machine learning.

  • The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data X onto output variables y.

  • Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.

  • Cost function.

2 / 23

Gradient descent

3 / 23

Gradient descent

  • Gradient descent (steepest descent) is an iterative algorithm for optimization.

  • The most obvious direction to choose when attempting to minimize a function f starting at xn is the direction of steepest descent, or f(xn).

4 / 23

Gradient descent: 1-d function

5 / 23

Gradient descent: 2-d function

6 / 23

Gradient descent: 2-d function

7 / 23

Gradient descent algorithm

Suppose we want to minimize f(xn), where f() is a function with continuous partial derivatives everywhere. Gradient descent algorithm does the following iteration.

xn+1=xnαf(xn), where α>0.

  1. Please think about the role of α.

  2. Gradient descent algorithm and Newton method?

8 / 23

Gradient descent algorithm and Newton method

  • A comparison of gradient descent (green) and Newton's method (red).

  • Newton's method uses curvature information (i.e. the second derivative) to take a more direct route.

9 / 23

Gradient descent algorithm

  • For initial value x0, we have f(x0)f(x1)f(x2).

  • Upon convergence, {xn} converges to the local min of f(x). We can use f(xn+1)f(xn)ϵ as the stopping rule.

10 / 23

Gradient descent algorithm in R

f <- function(x, y) {
x^2 + y^2
}
dx <- function(x, y) {
2 * x
}
dy <- function(x, y) {
2 * y
}
num_iter <- 10
learning_rate <- 0.2
x_val <- 6
y_val <- 6
updates_x <- c(x_val, rep(NA, num_iter))
updates_y <- c(y_val, rep(NA, num_iter))
updates_z <- c(f(x_val, y_val), rep(NA, num_iter))
# iterate
for (i in 1:num_iter) {
dx_val = dx(x_val, y_val)
dy_val = dy(x_val, y_val)
x_val <- x_val - learning_rate * dx_val
y_val <- y_val - learning_rate * dy_val
z_val <- f(x_val, y_val)
updates_x[i + 1] <- x_val
updates_y[i + 1] <- y_val
updates_z[i + 1] <- z_val
}
11 / 23

Gradient descent algorithm in R

## plotting
m <- data.frame(x = updates_x, y = updates_y)
x <- seq(-6, 6, length = 100)
y <- x
g <- expand.grid(x, y)
z <- f(g[, 1], g[, 2])
f_long <- data.frame(x = g[, 1], y = g[, 2], z = z)
library(ggplot2)
ggplot(f_long, aes(x, y, z = z)) + geom_contour_filled(aes(fill = stat(level)), bins = 50) +
guides(fill = FALSE) + geom_path(data = m, aes(x, y, z = 0), col = 2, arrow = arrow()) +
geom_point(data = m, aes(x, y, z = 0), size = 3, col = 2) + xlab(expression(x[1])) +
ylab(expression(x[2])) + ggtitle(parse(text = paste0("~ f(x[1],x[2]) == ~ x[1]^2 + x[2]^2")))

12 / 23

High correlation between variables

13 / 23

Gradient descent for linear regression

  • Consider simple linear regression

y=β0+β1x+ϵ, where ϵN(0,1).

  • Write out cost function.

  • Calculate gradient function.

  • How about multiple linear regression?

14 / 23

Gradient descent for linear regression in R

graddesc.lm <- function(X, y, beta.init, alpha = 0.1, tol = 1e-09, max.iter = 100) {
beta.old <- beta.init
J <- betas <- list()
betas[[1]] <- beta.old
J[[1]] <- lm.cost(X, y, beta.old)
beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old)
betas[[2]] <- beta.new
J[[2]] <- lm.cost(X, y, beta.new)
iter <- 1
while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter <
max.iter)) {
beta.old <- beta.new
beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old)
iter <- iter + 1
betas[[iter + 2]] <- beta.new
J[[iter + 2]] <- lm.cost(X, y, beta.new)
}
if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) {
cat("Did not converge \n")
} else {
cat("Converged \n")
cat("Iterated", iter, "times", "\n")
cat("Coef: ", beta.new)
return(list(coef = betas, cost = J))
}
}
15 / 23

Data generation

## Generate some data
beta0 <- 1
beta1 <- 3
sigma <- 1
n <- 10000
x <- rnorm(n, 3, 1)
y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)
X <- cbind(1, x)
16 / 23

Exercise

  1. Please write out the gradient function in the following code.

  2. Run the code and compare with lm().

  3. Change the value of α and see what happens?

  4. Think about how to select α.

## Make the cost function
lm.cost <- function(X, y, beta){
m <- length(y)
loss <- sum((X%*%beta - y)^2)/(2*n)
return(loss)
}
## Calculate the gradient
lm.cost.grad <- function(X, y, beta){
???
}
gd.est <- graddesc.lm(X, y, beta.init = c(0,0), alpha = NULL, max.iter = 1000)
17 / 23

Stochastic gradient descent (SGD)

18 / 23

Stochastic gradient descent (SGD)

  • Gradient descent can be slow to run on very large datasets.

  • One iteration of the gradient descent algorithm takes care of the entire dataset.

  • One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.

  • SGD updates parameters for each training sample.

19 / 23

SGD for linear regression

  1. Shuffle all the training data randomly.
  2. Repeat the following.
    • For each training sample i=1,,m, βn+1=βnαJ(βn)i.
20 / 23

Lab 8

  1. Improve the R function graddesc.lm(), try to optimize α in each step, instead of setting it to a constant.

  2. Write a function in R for Stochastic Gradient Descent for linear regression, and test your function in the Bodyfat data.

  3. Compare Gradient Descent, Stochastic Gradient Descent and Newton method.

21 / 23

Summary

  • GD and SGD.

  • When to use SGD?

22 / 23

Summary

  • GD and SGD.

  • When to use SGD?

  • Heaps of ML and DL methods use SGD to for training.

  • Simple answer: Use stochastic gradient descent when training time is the bottleneck.

22 / 23

References and further readings

References

Further readings

  • Stochastic Gradient Descent Tricks by Léon Bottou in Microsoft Research, Redmond, WA. Click here to download.
23 / 23

Stochastic gradient descent

  • Popular for training a wide range of models in machine learning.

  • Optimization is a big part of machine learning.

  • The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data X onto output variables y.

  • Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.

  • Cost function.

2 / 23
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow