+ - 0:00:00
Notes for current slide
Notes for next slide



Bayesian Statistics and Computing

Lecture 10: Stochastic Gradient Descent

Yanfei Kang | BSC | Beihang University

1 / 38

Stochastic gradient descent

  • Popular for training a wide range of models in machine learning.

  • Optimization is a big part of machine learning.

  • The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data XX onto output variables yy.

  • Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.

  • Cost function.

2 / 38

Gradient descent

3 / 38

Gradient descent

  • Gradient descent (steepest descent) is an iterative algorithm for optimization.

  • The most obvious direction to choose when attempting to minimize a function ff starting at xnxn is the direction of steepest descent, or f(xn)f(xn).

4 / 38

Gradient descent: 2-d function

5 / 38

Gradient descent: 2-d function

6 / 38

Gradient descent algorithm

Suppose we want to minimize f(xn)f(xn), where f()f() is a function with continuous partial derivatives everywhere. Gradient descent algorithm does the following iteration.

xn+1=xnαf(xn), where α>0.xn+1=xnαf(xn), where α>0.

  1. Please think about the role of αα.

  2. Gradient descent algorithm and Newton method?

7 / 38

Gradient descent algorithm and Newton method

  • A comparison of gradient descent (green) and Newton's method (red).

  • Newton's method uses curvature information (i.e. the second derivative) to take a more direct route.

8 / 38

Gradient descent algorithm

  • For initial value x0x0, we have f(x0)f(x1)f(x2)f(x0)f(x1)f(x2).

  • Upon convergence, {xn}{xn} converges to the local min of f(x)f(x). We can use f(xn+1)f(xn)ϵf(xn+1)f(xn)ϵ as the stopping rule.

9 / 38

Gradient descent algorithm in R

f <- function(x, y) { x^2 + y ^2}
dx <- function(x,y) {2*x }
dy <- function(x,y) {2*y}
num_iter <- 10
learning_rate <- 0.2
x_val <- 6
y_val <- 6
updates_x <- c(x_val, rep(NA, num_iter))
updates_y <- c(y_val, rep(NA, num_iter))
updates_z <- c(f(x_val, y_val), rep(NA, num_iter))
# iterate
for (i in 1:num_iter) {
dx_val = dx(x_val,y_val)
dy_val = dy(x_val,y_val)
x_val <- x_val-learning_rate*dx_val
y_val <- y_val-learning_rate*dy_val
z_val <- f(x_val, y_val)
updates_x[i + 1] <- x_val
updates_y[i + 1] <- y_val
updates_z[i + 1] <- z_val
}
10 / 38

Gradient descent algorithm in R

## plotting
m <- data.frame(x = updates_x, y = updates_y)
x <- seq(-6, 6, length = 100)
y <- x
g <- expand.grid(x, y)
z <- f(g[,1], g[,2])
f_long <- data.frame(x = g[,1], y = g[,2], z = z)
library(ggplot2)
ggplot(f_long, aes(x, y, z = z)) +
geom_contour_filled(aes(fill = stat(level)), bins = 50) +
guides(fill = FALSE) +
geom_path(data = m, aes(x, y, z=0), col = 2, arrow = arrow()) +
geom_point(data = m, aes(x, y, z=0), size = 3, col = 2) +
xlab(expression(x[1])) +
ylab(expression(x[2])) +
ggtitle(parse(text = paste0('~ f(x[1],x[2]) == ~ x[1]^2 + x[2]^2')))

11 / 38

Questions

  1. How about choosing a larger αα? and a smaller one?

  2. How to choose a better αα?

  3. What happens when there exists high correlation between variables?

12 / 38

High correlation between variables

13 / 38

Gradient descent for linear regression

14 / 38

Gradient descent for linear regression

Consider linear regression

y=Xβ+ϵ, where ϵN(0,I).y=Xβ+ϵ, where ϵN(0,I).

  • Write out cost function: J(β)=12n(Xβy)T(Xβy).J(β)=12n(Xβy)T(Xβy).
  • Calculate gradient function: J(β)=1nXT(Xβy).J(β)=1nXT(Xβy).

  • Iteration: β:=βαJ(β).β:=βαJ(β).

15 / 38

Gradient descent for linear regression in R

graddesc.lm <- function(X, y, beta.init, alpha = 0.1, tol = 1e-9, max.iter = 100){
beta.old <- beta.init
J <- betas <- list()
betas[[1]] <- beta.old
J[[1]] <- lm.cost(X, y, beta.old)
beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old)
betas[[2]] <- beta.new
J[[2]] <- lm.cost(X, y, beta.new)
iter <- 0
while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter < max.iter)){
beta.old <- beta.new
beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old)
iter <- iter + 1
betas[[iter + 2]] <- beta.new
J[[iter + 2]] <- lm.cost(X, y, beta.new)
}
if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){
cat('Did not converge \n')
} else{
cat('Converged \n')
cat('Iterated',iter + 1,'times','\n')
cat('Coef: ', beta.new)
return(list(coef = betas,
cost = J))
}
}
16 / 38

Data generation

## Generate some data
beta0 <- 1
beta1 <- 3
sigma <- 1
n <- 10000
x <- rnorm(n, 0, 1)
y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)
X <- cbind(1, x)
17 / 38

Exercise

  1. Please write out the gradient function in the following code.

  2. Run the code and compare with lm().

  3. Change the value of αα and see what happens?

  4. Think about how to select αα.

## Make the cost function
lm.cost <- function(X, y, beta){
n <- length(y)
loss <- sum((X%*%beta - y)^2)/(2*n)
return(loss)
}
## Calculate the gradient
lm.cost.grad <- function(X, y, beta){
n <- length(y)
return()}
gd.est <- graddesc.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, max.iter = 1000)
18 / 38

Convergence speed

19 / 38

Improving GD by optimizing αα in each step

gd.lm <- function(X, y, beta.init, alpha, tol = 1e-5,
max.iter = 100){
beta.old <- beta.init
J <- betas <- list()
if (alpha == 'auto'){
alpha <-
optim(0.1, function(alpha) {
lm.cost(
X, y,
beta.old - alpha * lm.cost.grad(X, y, beta.old))
}, method = 'L-BFGS-B', lower = 0, upper = 1)
if (alpha$convergence == 0) {
alpha <- alpha$par
} else{
alpha <- 0.1
}
}
betas[[1]] <- beta.old
J[[1]] <- lm.cost(X, y, beta.old)
beta.new <- beta.old -
alpha * lm.cost.grad(X, y, beta.old)
betas[[2]] <- beta.new
J[[2]] <- lm.cost(X, y, beta.new)
iter <- 0
while ((abs(lm.cost(X, y, beta.new) -
lm.cost(X, y, beta.old)) > tol) &
(iter < max.iter)) {
beta.old <- beta.new
if (alpha == 'auto'){
alpha <-
optim(0.1, function(alpha) {
lm.cost(X, y, beta.old -
alpha * lm.cost.grad(X, y, beta.old))},
method = 'L-BFGS-B', lower = 0, upper = 1)
if (alpha$convergence == 0) {
alpha <- alpha$par
} else{
alpha <- 0.1
}
}
beta.new <- beta.old -
alpha * lm.cost.grad(X, y, beta.old)
iter <- iter + 1
betas[[iter + 2]] <- beta.new
J[[iter + 2]] <- lm.cost(X, y, beta.new)
}
if (abs(lm.cost(X, y, beta.new) -
lm.cost(X, y, beta.old)) > tol){
cat('Did not converge. \n')
} else{
cat('Converged..\n')
cat('Iterated ',iter + 1,' times.','\n')
cat('Coefficients: ', beta.new, '\n')
return(list(coef = betas,
cost = J,
niter = iter + 1))
}
}
20 / 38

Improving GD by optimizing αα in each step

## Generate some data
beta0 <- 1
beta1 <- 3
sigma <- 1
n <- 10000
x <- rnorm(n, 0, 1)
y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)
X <- cbind(1, x)
## Make the cost function
lm.cost <- function(X, y, beta){
n <- length(y)
loss <- sum((X%*%beta - y)^2)/(2*n)
return(loss)
}
## Calculate the gradient
lm.cost.grad <- function(X, y, beta){
n <- length(y)
(1/n)*(t(X)%*%(X%*%beta - y))
}
## Use optimized alpha
gd.auto <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 'auto',
tol = 1e-5, max.iter = 10000)
#> Converged..
#> Iterated 3 times.
#> Coefficients: 1.003497 3.010837
gd1 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.1,
tol = 1e-5, max.iter = 10000)
#> Converged..
#> Iterated 66 times.
#> Coefficients: 0.9995082 3.002953
gd2 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.01,
tol = 1e-5, max.iter = 10000)
#> Converged..
#> Iterated 567 times.
#> Coefficients: 0.9889579 2.983279
21 / 38

Stochastic gradient descent (SGD)

22 / 38

Stochastic gradient descent (SGD)

  • Gradient descent can be slow to run on very large datasets.

  • One iteration of the gradient descent algorithm takes care of the entire dataset.

  • One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.

  • SGD updates parameters for each training sample.

23 / 38

SGD for linear regression

Replace J(β)=1nXT(Xβy)J(β)=1nXT(Xβy) with:

J(β)i=XTi(Xiβyi).J(β)i=XTi(Xiβyi).

24 / 38

SGD for linear regression

  1. Shuffle all the training data randomly.
  2. Repeat the following.
    • For each training sample i=1,,mi=1,,m, βn+1=βnαJ(βn)i.βn+1=βnαJ(βn)i.
25 / 38

SGD in R

sgd.lm <- function(X, y, beta.init,
alpha = 0.5, n.samples = 1,
tol = 1e-5, max.iter = 100){
n <- length(y)
beta.old <- beta.init
J <- betas <- list()
sto.sample <- sample(1:n, n.samples, replace = TRUE)
betas[[1]] <- beta.old
J[[1]] <- lm.cost(X, y, beta.old)
beta.new <- beta.old - alpha *
sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old)
betas[[2]] <- beta.new
J[[2]] <- lm.cost(X, y, beta.new)
iter <- 0
n.best <- 0
while ((abs(lm.cost(X, y, beta.new) -
lm.cost(X, y, beta.old)) > tol) &
(iter + 2 < max.iter)){
beta.old <- beta.new
sto.sample <- sample(1:n, n.samples, replace = TRUE)
beta.new <- beta.old -
alpha * sgd.lm.cost.grad(X[sto.sample, ],
y[sto.sample],
beta.old)
iter <- iter + 1
betas[[iter + 2]] <- beta.new
J[[iter + 2]] <- lm.cost(X, y, beta.new)
}
if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old))
> tol){
cat('Did not converge. \n')
} else{
cat('Converged. \n')
cat('Iterated ',iter + 1,' times.','\n')
cat('Coefficients: ', beta.new, '\n')
return(list(coef = betas,
cost = J,
niter = iter + 1))
}
}
## Make the cost function
sgd.lm.cost <- function(X, y, beta){
n <- length(y)
if (!is.matrix(X)){
X <- matrix(X, nrow = 1)
}
loss <- sum((X%*%beta - y)^2)/(2*n)
return(loss)
}
## Calculate the gradient
sgd.lm.cost.grad <- function(X, y, beta){
n <- length(y)
if (!is.matrix(X)){
X <- matrix(X, nrow = 1)
}
t(X)%*%(X%*%beta - y)/n
}
26 / 38

SGD in R

#> Converged.
#> Iterated 113 times.
#> Coefficients: 1.004577 2.905418

27 / 38

A comparison: GD, SGD and Newton methods

28 / 38

Your turn

Test the sgd.lm() function in the Bodyfat data.

29 / 38

SGD for logistic regression

Suppose response variable YiYi takes values 0 or 1 with probability P(Yi=1)=piP(Yi=1)=pi. (i.e., a Bernoulli distribution)

We relate pipi to the predictors:

ηi=β0+β1xi,1++βqxi,qηi=g(pi)ηi=β0+β1xi,1++βqxi,qηi=g(pi)

  • The link function gg should be monotone.
  • 0g1(η)10g1(η)1 for any ηη.
30 / 38

Logit link function

  • η=g(p)=log(p1p)η=g(p)=log(p1p) is the logit function. It maps (0,1)R(0,1)R.

  • The inverse logit is g1(η)=eη1+eηg1(η)=eη1+eη which maps R(0,1)R(0,1).

  • pi=eηi1+eηi=P(Yi=1)pi=eηi1+eηi=P(Yi=1).

31 / 38

Log-likelihood

ηi=β0+β1xi,1++βqxi,qηi=β0+β1xi,1++βqxi,q

logL(β)=ni=1log(pi)1Yi=1+log(1pi)1Yi=0=ni=1yi[ηilog(1+eηi)]+(1yi)log(1eηi1+eηi)=ni=1yi[ηilog(1+eηi)]+(1yi)log(11+eηi)=ni=1yi[ηilog(1+eηi)](1yi)log(1+eηi)=ni=1[yiηilog(1+eηi)]logL(β)=ni=1log(pi)1Yi=1+log(1pi)1Yi=0=ni=1yi[ηilog(1+eηi)]+(1yi)log(1eηi1+eηi)=ni=1yi[ηilog(1+eηi)]+(1yi)log(11+eηi)=ni=1yi[ηilog(1+eηi)](1yi)log(1+eηi)=ni=1[yiηilog(1+eηi)]

  • Uses MLE to obtain ˆβ^β.
32 / 38

SGD for Poisson regression

Let Y=Y= number of events in given time interval. If events independent, and prob of event proportional to length of interval, then YY is Poisson distributed.

P(Y=y)=eμμyy!P(Y=y)=eμμyy!

  • E(Y)=V(Y)=μE(Y)=V(Y)=μ
  • If YB(n,p)YB(n,p), then YPoisson(np)YPoisson(np) for small p/np/n.
  • If YPoisson(μ)YPoisson(μ), then YN(μ,μ)YN(μ,μ) for large μμ.
  • Poisson(μ1)+Poisson(μ1)Poisson(μ1+μ2)Poisson(μ1)+Poisson(μ1)Poisson(μ1+μ2).
33 / 38

Regression with count data

Suppose response YY is a count (0, 1, 2, ).

  • If count is bounded and bound is small, use binomial regression.

  • If min count is large, use normal approximation.

  • Otherwise, use Poisson or negative binomial.

34 / 38

Poisson regression

yiPoisson(μi)log(μi)=ηi=β0+β1xi,1++βqxi,q

  • Log link function forces positive mean.
  • Log-likelihood: logL=ni=1[yiXTiβexp(XTiβ)log(yi!)]
35 / 38

Lab 8

Apply GD/SGD to the estimation of any statistical/ML algorithm.

36 / 38

Summary

  • GD and SGD.

  • When to use SGD?

37 / 38

Summary

  • GD and SGD.

  • When to use SGD?

  • Heaps of ML and DL methods use SGD to for training.

  • Simple answer: Use stochastic gradient descent when training time is the bottleneck.

  • You might refer to the SGD package in R.

37 / 38

References and further readings

References

Further readings

  • Stochastic Gradient Descent Tricks by Léon Bottou in Microsoft Research, Redmond, WA. Click here to download.
38 / 38

Stochastic gradient descent

  • Popular for training a wide range of models in machine learning.

  • Optimization is a big part of machine learning.

  • The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data X onto output variables y.

  • Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.

  • Cost function.

2 / 38
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
sToggle scribble toolbox
Esc Back to slideshow