layout: true --- class: inverse, center, middle background-image: url(../figs/titlepage16-9.png) background-size: cover <br> <br> # Bayesian Statistics and Computing ## Lecture 10: Stochastic Gradient Descent <img src="../figs/slides.png" width="150px"/> #### *Yanfei Kang | BSC | Beihang University* --- # Stochastic gradient descent - Popular for training a wide range of models in machine learning. - Optimization is a big part of machine learning. - The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data `\(X\)` onto output variables `\(y\)`. - Require a process of optimization to find the set of coefficients that result in the best estimate of the target function. - Cost function. --- class: inverse, center, middle # Gradient descent --- # Gradient descent - Gradient descent (steepest descent) is an iterative algorithm for optimization. - The most obvious direction to choose when attempting to minimize a function `\(f\)` starting at `\(x_n\)` is the direction of steepest descent, or `\(-f^{\prime}(x_n)\)`. --- # Gradient descent: 2-d function
--- # Gradient descent: 2-d function <img src="figure/unnamed-chunk-2-1.svg" width="504" style="display: block; margin: auto;" /> --- # Gradient descent algorithm Suppose we want to minimize `\(f\left(\mathbf{x}_{n}\right)\)`, where `\(f(\cdot)\)` is a function with continuous partial derivatives everywhere. Gradient descent algorithm does the following iteration. `$$\mathbf{x}_{n+1}=\mathbf{x}_{n}-\alpha \nabla f\left(\mathbf{x}_{n}\right),~\mathrm{where}~ \alpha > 0.$$` 1. Please think about the role of `\(\alpha\)`. 2. Gradient descent algorithm and Newton method? --- # Gradient descent algorithm and Newton method .pull-left[ <img src="comp.png" width="400px"/> ] .pull-right[ - A comparison of gradient descent (green) and Newton's method (red). - Newton's method uses curvature information (i.e. the second derivative) to take a more direct route. ] --- # Gradient descent algorithm - For initial value `\(\mathbf{x}_0\)`, we have `\(f(\mathbf{x}_0) \geq f(\mathbf{x}_1) \geq f(\mathbf{x}_2) \geq \cdots\)`. - Upon convergence, `\(\{\mathbf{x}_n\}\)` converges to the local min of `\(f(\mathbf{x})\)`. We can use `\(f(\mathbf{x}_{n+1})- f(\mathbf{x}_n) \leq \epsilon\)` as the stopping rule. --- # Gradient descent algorithm in R .scroll-output[ ``` r f <- function(x, y) { x^2 + y ^2} dx <- function(x,y) {2*x } dy <- function(x,y) {2*y} num_iter <- 10 learning_rate <- 0.2 x_val <- 6 y_val <- 6 updates_x <- c(x_val, rep(NA, num_iter)) updates_y <- c(y_val, rep(NA, num_iter)) updates_z <- c(f(x_val, y_val), rep(NA, num_iter)) # iterate for (i in 1:num_iter) { dx_val = dx(x_val,y_val) dy_val = dy(x_val,y_val) x_val <- x_val-learning_rate*dx_val y_val <- y_val-learning_rate*dy_val z_val <- f(x_val, y_val) updates_x[i + 1] <- x_val updates_y[i + 1] <- y_val updates_z[i + 1] <- z_val } ``` ] --- # Gradient descent algorithm in R .pull-left[ ``` r ## plotting m <- data.frame(x = updates_x, y = updates_y) x <- seq(-6, 6, length = 100) y <- x g <- expand.grid(x, y) z <- f(g[,1], g[,2]) f_long <- data.frame(x = g[,1], y = g[,2], z = z) library(ggplot2) ggplot(f_long, aes(x, y, z = z)) + geom_contour_filled(aes(fill = stat(level)), bins = 50) + guides(fill = FALSE) + geom_path(data = m, aes(x, y, z=0), col = 2, arrow = arrow()) + geom_point(data = m, aes(x, y, z=0), size = 3, col = 2) + xlab(expression(x[1])) + ylab(expression(x[2])) + ggtitle(parse(text = paste0('~ f(x[1],x[2]) == ~ x[1]^2 + x[2]^2'))) ``` ] .pull-right[ <img src="figure/unnamed-chunk-4-1.svg" width="504" style="display: block; margin: auto;" /> ] --- # Questions 1. How about choosing a larger `\(\alpha\)`? and a smaller one? 2. How to choose a better `\(\alpha\)`? 3. What happens when there exists high correlation between variables? --- # High correlation between variables <img src="figure/unnamed-chunk-5-1.svg" width="504" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Gradient descent for linear regression --- # Gradient descent for linear regression Consider linear regression `$$y = X\beta + \epsilon,~\mathrm{where}~\epsilon\sim N(0,I).$$` - Write out cost function: `$$J(\beta) = \frac{1}{2n}(X\beta - y)^T(X\beta - y).$$` - Calculate gradient function: `$$\nabla J(\beta) = \frac{1}{n}X^T(X\beta-y).$$` - Iteration: `$$\beta := \beta - \alpha * \nabla J(\beta).$$` --- # Gradient descent for linear regression in R .scroll-output[ ``` r graddesc.lm <- function(X, y, beta.init, alpha = 0.1, tol = 1e-9, max.iter = 100){ beta.old <- beta.init J <- betas <- list() betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter < max.iter)){ beta.old <- beta.new beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge \n') } else{ cat('Converged \n') cat('Iterated',iter + 1,'times','\n') cat('Coef: ', beta.new) return(list(coef = betas, cost = J)) } } ``` ] --- # Data generation ``` r ## Generate some data beta0 <- 1 beta1 <- 3 sigma <- 1 n <- 10000 x <- rnorm(n, 0, 1) y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma) X <- cbind(1, x) ``` --- # Exercise 1. Please write out the gradient function in the following code. 2. Run the code and compare with `lm()`. 3. Change the value of `\(\alpha\)` and see what happens? 4. Think about how to select `\(\alpha\)`. .scroll-output[ ``` r ## Make the cost function lm.cost <- function(X, y, beta){ n <- length(y) loss <- sum((X%*%beta - y)^2)/(2*n) return(loss) } ## Calculate the gradient lm.cost.grad <- function(X, y, beta){ n <- length(y) return()} gd.est <- graddesc.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, max.iter = 1000) ``` ] --- # Convergence speed <img src="figure/unnamed-chunk-9-1.svg" width="864" style="display: block; margin: auto;" /> --- # Improving GD by optimizing `\(\alpha\)` in each step .scroll-output[ ``` r gd.lm <- function(X, y, beta.init, alpha, tol = 1e-5, max.iter = 100){ beta.old <- beta.init J <- betas <- list() if (alpha == 'auto'){ alpha <- optim(0.1, function(alpha) { lm.cost( X, y, beta.old - alpha * lm.cost.grad(X, y, beta.old)) }, method = 'L-BFGS-B', lower = 0, upper = 1) if (alpha$convergence == 0) { alpha <- alpha$par } else{ alpha <- 0.1 } } betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter < max.iter)) { beta.old <- beta.new if (alpha == 'auto'){ alpha <- optim(0.1, function(alpha) { lm.cost(X, y, beta.old - alpha * lm.cost.grad(X, y, beta.old))}, method = 'L-BFGS-B', lower = 0, upper = 1) if (alpha$convergence == 0) { alpha <- alpha$par } else{ alpha <- 0.1 } } beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge. \n') } else{ cat('Converged..\n') cat('Iterated ',iter + 1,' times.','\n') cat('Coefficients: ', beta.new, '\n') return(list(coef = betas, cost = J, niter = iter + 1)) } } ``` ] --- # Improving GD by optimizing `\(\alpha\)` in each step .scroll-output[ ``` r ## Generate some data beta0 <- 1 beta1 <- 3 sigma <- 1 n <- 10000 x <- rnorm(n, 0, 1) y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma) X <- cbind(1, x) ## Make the cost function lm.cost <- function(X, y, beta){ n <- length(y) loss <- sum((X%*%beta - y)^2)/(2*n) return(loss) } ## Calculate the gradient lm.cost.grad <- function(X, y, beta){ n <- length(y) (1/n)*(t(X)%*%(X%*%beta - y)) } ## Use optimized alpha gd.auto <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 'auto', tol = 1e-5, max.iter = 10000) #> Converged.. #> Iterated 3 times. #> Coefficients: 1.003497 3.010837 gd1 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, tol = 1e-5, max.iter = 10000) #> Converged.. #> Iterated 66 times. #> Coefficients: 0.9995082 3.002953 gd2 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.01, tol = 1e-5, max.iter = 10000) #> Converged.. #> Iterated 567 times. #> Coefficients: 0.9889579 2.983279 ``` ] --- class: inverse, center, middle # Stochastic gradient descent (SGD) --- # Stochastic gradient descent (SGD) - Gradient descent can be slow to run on very large datasets. - One iteration of the gradient descent algorithm takes care of the entire dataset. - One iteration of the algorithm is called one batch and this form of gradient descent is referred to as **batch gradient descent**. - SGD updates parameters for each training sample. --- # SGD for linear regression Replace `$$\nabla J(\beta) = \frac{1}{n}X^T(X\beta-y)$$` with: `$$\nabla J(\beta)_i = X_i^T(X_i\beta-y_i).$$` --- # SGD for linear regression 1. Shuffle all the training data randomly. 2. Repeat the following. - For each training sample `\(i = 1, \cdots, m\)`, `$$\beta_{n+1}= \beta_n - \alpha \nabla J(\beta_n)_i.$$` --- # SGD in R .scroll-output[ ``` r sgd.lm <- function(X, y, beta.init, alpha = 0.5, n.samples = 1, tol = 1e-5, max.iter = 100){ n <- length(y) beta.old <- beta.init J <- betas <- list() sto.sample <- sample(1:n, n.samples, replace = TRUE) betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 n.best <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter + 2 < max.iter)){ beta.old <- beta.new sto.sample <- sample(1:n, n.samples, replace = TRUE) beta.new <- beta.old - alpha * sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge. \n') } else{ cat('Converged. \n') cat('Iterated ',iter + 1,' times.','\n') cat('Coefficients: ', beta.new, '\n') return(list(coef = betas, cost = J, niter = iter + 1)) } } ## Make the cost function sgd.lm.cost <- function(X, y, beta){ n <- length(y) if (!is.matrix(X)){ X <- matrix(X, nrow = 1) } loss <- sum((X%*%beta - y)^2)/(2*n) return(loss) } ## Calculate the gradient sgd.lm.cost.grad <- function(X, y, beta){ n <- length(y) if (!is.matrix(X)){ X <- matrix(X, nrow = 1) } t(X)%*%(X%*%beta - y)/n } ``` ] --- # SGD in R ``` #> Converged. #> Iterated 113 times. #> Coefficients: 1.004577 2.905418 ``` <img src="figure/unnamed-chunk-13-1.svg" width="432" style="display: block; margin: auto;" /> --- # A comparison: GD, SGD and Newton methods <img src="figure/optimcomp-1.svg" width="504" style="display: block; margin: auto;" /> --- # Your turn Test the `sgd.lm()` function in the [Bodyfat](https://yanfei.site/docs/bsc/Bodyfat.csv) data. --- # SGD for logistic regression Suppose response variable `\(Y_i\)` takes values 0 or 1 with probability `\(P(Y_i=1)=p_i\)`. (i.e., a Bernoulli distribution) We relate `\(p_i\)` to the predictors: `\begin{align*} \eta_i &= \beta_0 + \beta_1 x_{i,1} + \dots + \beta_q x_{i,q}\\ \eta_i &= g(p_i) \end{align*}` * The link function `\(g\)` should be monotone. * `\(0 \leq g^{-1}(\eta) \leq 1\)` for any `\(\eta\)`. --- # Logit link function * `\(\eta = g(p) = \log(\frac{p}{1-p})\)` is the *logit* function. It maps `\((0,1) \rightarrow \mathbb{R}\)`. * The inverse logit is `\(g^{-1}(\eta) = \frac{e^\eta}{1+e^\eta}\)` which maps `\(\mathbb{R} \rightarrow (0,1)\)`. * `\(p_i = \frac{e^{\eta_i}}{1+e^{\eta_i}} = P(Y_i=1)\)`. --- # Log-likelihood `$$\eta_i = \beta_0 + \beta_1 x_{i,1} + \dots + \beta_q x_{i,q}$$` `\begin{align*} \log L({\beta}) &= \sum_{i=1}^n \log(p_i)1_{Y_i=1} + \log(1-p_i)1_{Y_i=0} \\ &= \sum_{i=1}^n y_i[\eta_i - \log(1+e^{\eta_i})] + (1-y_i)\log\left(1-\frac{e^{\eta_i}}{1+e^{\eta_i}}\right) \\ &= \sum_{i=1}^n y_i[\eta_i - \log(1+e^{\eta_i})] + (1-y_i)\log\left(\frac{1}{1+e^{\eta_i}}\right) \\ &= \sum_{i=1}^n y_i[\eta_i - \log(1+e^{\eta_i})] - (1-y_i)\log(1+e^{\eta_i}) \\ &= \sum_{i=1}^n [y_i\eta_i - \log(1+e^{\eta_i})] \end{align*}` * Uses MLE to obtain `\(\hat{{\beta}}\)`. --- # SGD for Poisson regression Let `\(Y=\)` number of events in given time interval. If events independent, and prob of event proportional to length of interval, then `\(Y\)` is Poisson distributed. $$ P(Y=y) = \frac{e^{-\mu}\mu^y}{y!}$$ * `\(E(Y)=V(Y)=\mu\)` * If `\(Y\sim B(n,p)\)`, then `\(Y \approx \text{Poisson}(np)\)` for small `\(p/n\)`. * If `\(Y\sim \text{Poisson}(\mu)\)`, then `\(Y\approx N(\mu,\mu)\)` for large `\(\mu\)`. * `\(\text{Poisson}(\mu_1) + \text{Poisson}(\mu_1) \sim \text{Poisson}(\mu_1+\mu_2)\)`. --- # Regression with count data Suppose response `\(Y\)` is a count (0, 1, 2, `\(\dots\)`). * If count is bounded and bound is small, use binomial regression. * If min count is large, use normal approximation. * Otherwise, use Poisson or negative binomial. --- # Poisson regression `\begin{align*} &y_i \sim \text{Poisson}(\mu_i) \\ &\log(\mu_i) = \eta_i = \beta_0 + \beta_1 x_{i,1} + \dots + \beta_q x_{i,q} \end{align*}` * Log link function forces positive mean. * Log-likelihood: `$$\log L = \sum_{i=1}^n \left[y_i X_i^T \beta - \exp(X_i^T{\beta}) - \log(y_i!)\right]$$` --- # Lab 8 Apply GD/SGD to the estimation of any statistical/ML algorithm. --- # Summary - GD and SGD. - When to use SGD? -- - Heaps of ML and DL methods use SGD to for training. - Simple answer: Use stochastic gradient descent when training time is the bottleneck. - You might refer to the [SGD](https://cran.r-project.org/web/packages/sgd/vignettes/sgd-jss.pdf) package in R. --- # References and further readings ### References - Section 3.1 of the [advanced statistical computing](https://bookdown.org/rdpeng/advstatcomp/steepest-descent.html) book by Roger Peng. ### Further readings - Stochastic Gradient Descent Tricks by Léon Bottou in Microsoft Research, Redmond, WA. Click [here](https://yanfei.site/docs/bsc/sgdMR.pdf) to download.