Bayesian Statistics and Computing

---

# Bayesian Statistics and Computing

## Lecture 10: Stochastic Gradient Descent

#### *Yanfei Kang | BSC | Beihang University*

---

# Stochastic gradient descent

- Popular for training a wide range of models in machine learning.

- Optimization is a big part of machine learning.

- The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data `$X$` onto output variables `$y$`. 
- Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.

- Cost function.

---
class: inverse, center, middle
# Gradient descent

---
# Gradient descent

- Gradient descent (steepest descent) is an iterative algorithm for optimization.

- The most obvious direction to choose when attempting to minimize a function `$f$` starting at `$x_n$` is the direction of steepest descent, or `$-f^{\prime}(x_n)$`.

---
# Gradient descent: 2-d function

---
# Gradient descent: 2-d function

---
# Gradient descent algorithm

Suppose we want to minimize `$f\left(\mathbf{x}_{n}\right)$`, where `$f(\cdot)$` is a function with continuous partial derivatives everywhere. Gradient descent algorithm does the following iteration.

`$$\mathbf{x}_{n+1}=\mathbf{x}_{n}-\alpha  \nabla f\left(\mathbf{x}_{n}\right),~\mathrm{where}~ \alpha > 0.$$`

1. Please think about the role of `$\alpha$`.

2. Gradient descent algorithm and Newton method?

---
# Gradient descent algorithm and Newton method

- A comparison of gradient descent (green) and Newton's method (red).

- Newton's method uses curvature information (i.e. the second derivative) to take a more direct route.

]

---
# Gradient descent algorithm

- For initial value `$\mathbf{x}_0$`, we have `$f(\mathbf{x}_0) \geq f(\mathbf{x}_1) \geq f(\mathbf{x}_2) \geq \cdots$`.

- Upon convergence,  `$\{\mathbf{x}_n\}$` converges to the local min of `$f(\mathbf{x})$`. We can use `$f(\mathbf{x}_{n+1})- f(\mathbf{x}_n) \leq \epsilon$` as the stopping rule.

---
# Gradient descent algorithm in R

``` r
f <- function(x, y) { x^2 + y ^2}
dx <- function(x,y) {2*x }
dy <- function(x,y) {2*y}
num_iter <- 10
learning_rate <- 0.2
x_val <- 6
y_val <- 6
updates_x <- c(x_val, rep(NA, num_iter))
updates_y <- c(y_val, rep(NA, num_iter))
updates_z <- c(f(x_val, y_val), rep(NA, num_iter))
# iterate
for (i in 1:num_iter) {
  dx_val = dx(x_val,y_val)
  dy_val = dy(x_val,y_val)

x_val <- x_val-learning_rate*dx_val
  y_val <- y_val-learning_rate*dy_val
  z_val <- f(x_val, y_val)

updates_x[i + 1] <- x_val
  updates_y[i + 1] <- y_val
  updates_z[i + 1] <- z_val
}
```
]

---
# Gradient descent algorithm in R

``` r
## plotting
m <- data.frame(x = updates_x, y = updates_y)
x <- seq(-6, 6, length = 100)
y <- x
g <- expand.grid(x, y)
z <- f(g[,1], g[,2])
f_long <- data.frame(x = g[,1], y = g[,2], z = z)
library(ggplot2)
ggplot(f_long, aes(x, y, z = z)) + 
  geom_contour_filled(aes(fill = stat(level)), bins = 50) + 
  guides(fill = FALSE) +
  geom_path(data = m, aes(x, y, z=0), col = 2, arrow = arrow()) +
  geom_point(data = m, aes(x, y, z=0), size = 3, col = 2) +
  xlab(expression(x[1])) +
  ylab(expression(x[2])) +
  ggtitle(parse(text = paste0('~ f(x[1],x[2]) == ~ x[1]^2 + x[2]^2'))) 
```
]

.pull-right[
<img src="figure/unnamed-chunk-4-1.svg" width="504" style="display: block; margin: auto;" />
]

---
# Questions

1. How about choosing a larger `$\alpha$`? and a smaller one?

2. How to choose a better `$\alpha$`?

3. What happens when there exists high correlation between variables?

---
# High correlation between variables

---
class: inverse, center, middle
# Gradient descent for linear regression

---
# Gradient descent for linear regression

Consider linear regression

`$$y = X\beta + \epsilon,~\mathrm{where}~\epsilon\sim N(0,I).$$`

- Write out cost function:
`$$J(\beta) = \frac{1}{2n}(X\beta - y)^T(X\beta - y).$$`

- Calculate gradient function:
`$$\nabla J(\beta) = \frac{1}{n}X^T(X\beta-y).$$`

- Iteration:
`$$\beta := \beta - \alpha * \nabla J(\beta).$$`

---
# Gradient descent for linear regression in R

``` r
graddesc.lm <- function(X, y, beta.init, alpha = 0.1, tol = 1e-9, max.iter = 100){
  beta.old <- beta.init
  J <- betas <- list()
  betas[[1]] <- beta.old
  J[[1]] <- lm.cost(X, y, beta.old)
  beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old)
  betas[[2]] <- beta.new
  J[[2]] <- lm.cost(X, y, beta.new)
  iter <- 0
  while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old))  > tol) & (iter < max.iter)){
    beta.old <- beta.new
    beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old)
    iter <- iter + 1
    betas[[iter + 2]] <- beta.new
    J[[iter + 2]] <- lm.cost(X, y, beta.new)
  }
  if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){
    cat('Did not converge \n')
  } else{
    cat('Converged \n')
    cat('Iterated',iter + 1,'times','\n')
    cat('Coef: ', beta.new)
    return(list(coef = betas,
                cost = J))
  }
}
```
]

---
# Data generation

``` r
## Generate some data
beta0 <- 1
beta1 <- 3
sigma <- 1
n <- 10000
x <- rnorm(n, 0, 1)
y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)
X <- cbind(1, x)
```

---
# Exercise

1. Please write out the gradient function in the following code.

2. Run the code and compare with `lm()`.

3. Change the value of `$\alpha$` and see what happens?

4. Think about how to select `$\alpha$`.

``` r
## Make the cost function
lm.cost <- function(X, y, beta){
   n <- length(y)
   loss <- sum((X%*%beta -  y)^2)/(2*n)
   return(loss)
}
## Calculate the gradient
lm.cost.grad <- function(X, y, beta){
  n <- length(y)
  return()}

gd.est <- graddesc.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, max.iter = 1000)
```
]

---
# Convergence speed

---
# Improving GD by optimizing `$\alpha$` in each step

``` r
gd.lm <- function(X, y, beta.init, alpha, tol = 1e-5,
                  max.iter = 100){
  beta.old <- beta.init
  J <- betas <- list()
  if (alpha == 'auto'){
    alpha <-
      optim(0.1, function(alpha) {
        lm.cost(
          X, y,
          beta.old - alpha * lm.cost.grad(X, y, beta.old))
      }, method = 'L-BFGS-B', lower = 0, upper = 1)
    if (alpha$convergence == 0) {
      alpha <- alpha$par
    } else{
      alpha <- 0.1
    }
  }
  betas[[1]] <- beta.old
  J[[1]] <- lm.cost(X, y, beta.old)
  beta.new <- beta.old -
    alpha * lm.cost.grad(X, y, beta.old)
  betas[[2]] <- beta.new
  J[[2]] <- lm.cost(X, y, beta.new)
  iter <- 0
  while ((abs(lm.cost(X, y, beta.new) -
              lm.cost(X, y, beta.old))  > tol) &
         (iter < max.iter)) {
    beta.old <- beta.new
    if (alpha == 'auto'){
    alpha <-
      optim(0.1, function(alpha) {
        lm.cost(X, y, beta.old -
          alpha * lm.cost.grad(X, y, beta.old))},
        method = 'L-BFGS-B', lower = 0, upper = 1)
    if (alpha$convergence == 0) {
      alpha <- alpha$par
    } else{
      alpha <- 0.1
    }
    }
    beta.new <- beta.old -
      alpha * lm.cost.grad(X, y, beta.old)
    iter <- iter + 1
    betas[[iter + 2]] <- beta.new
    J[[iter + 2]] <- lm.cost(X, y, beta.new)
  }
  if (abs(lm.cost(X, y, beta.new) -
          lm.cost(X, y, beta.old)) > tol){
    cat('Did not converge. \n')
  } else{
    cat('Converged..\n')
    cat('Iterated ',iter + 1,' times.','\n')
    cat('Coefficients: ', beta.new, '\n')
    return(list(coef = betas,
                cost = J,
                niter = iter + 1))
  }
}
```
]

---
# Improving GD by optimizing `$\alpha$` in each step

``` r
## Generate some data
beta0 <- 1
beta1 <- 3
sigma <- 1
n <- 10000
x <- rnorm(n, 0, 1)
y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)
X <- cbind(1, x)

## Make the cost function
lm.cost <- function(X, y, beta){
   n <- length(y)
   loss <- sum((X%*%beta -  y)^2)/(2*n)
return(loss)
}
## Calculate the gradient
lm.cost.grad <- function(X, y, beta){
  n <- length(y)
  (1/n)*(t(X)%*%(X%*%beta - y))
}
## Use optimized alpha
gd.auto <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 'auto',
                 tol = 1e-5, max.iter = 10000)
#> Converged..
#> Iterated  3  times. 
#> Coefficients:  1.003497 3.010837
gd1 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.1,
             tol = 1e-5, max.iter = 10000)
#> Converged..
#> Iterated  66  times. 
#> Coefficients:  0.9995082 3.002953
gd2 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.01,
             tol = 1e-5, max.iter = 10000)
#> Converged..
#> Iterated  567  times. 
#> Coefficients:  0.9889579 2.983279
```
]

---
class: inverse, center, middle
# Stochastic gradient descent (SGD)

---
# Stochastic gradient descent (SGD)

- Gradient descent can be slow to run on very large datasets.

- One iteration of the gradient descent algorithm takes care of the entire dataset.

- One iteration of the algorithm is called one batch and this form of gradient descent is referred to as **batch gradient descent**.

- SGD updates parameters for each training sample.

---
# SGD for linear regression

Replace `$$\nabla J(\beta) = \frac{1}{n}X^T(X\beta-y)$$` with:

`$$\nabla J(\beta)_i = X_i^T(X_i\beta-y_i).$$`

---
# SGD for linear regression

1. Shuffle all the training data randomly.
2. Repeat the following.
    - For each training sample `$i = 1, \cdots, m$`,
`$$\beta_{n+1}= \beta_n - \alpha \nabla J(\beta_n)_i.$$`

---
# SGD in R

``` r
sgd.lm <- function(X, y, beta.init,
                   alpha = 0.5, n.samples = 1,
                   tol = 1e-5, max.iter = 100){
  n <- length(y)
  beta.old <- beta.init
  J <- betas <- list()
  sto.sample <- sample(1:n, n.samples, replace = TRUE)
  betas[[1]] <- beta.old
  J[[1]] <- lm.cost(X, y, beta.old)
  beta.new <- beta.old - alpha *
    sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old)
  betas[[2]] <- beta.new
  J[[2]] <- lm.cost(X, y, beta.new)
  iter <- 0
  n.best <- 0
  while ((abs(lm.cost(X, y, beta.new) -
              lm.cost(X, y, beta.old))  > tol) &
         (iter + 2 < max.iter)){
    beta.old <- beta.new
    sto.sample <- sample(1:n, n.samples, replace = TRUE)
    beta.new <- beta.old -
      alpha * sgd.lm.cost.grad(X[sto.sample, ],
                               y[sto.sample],
                               beta.old)
    iter <- iter + 1
    betas[[iter + 2]] <- beta.new
    J[[iter + 2]] <- lm.cost(X, y, beta.new)
  }
  if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old))
      > tol){
    cat('Did not converge. \n')
  } else{
    cat('Converged. \n')
    cat('Iterated ',iter + 1,' times.','\n')
    cat('Coefficients: ', beta.new, '\n')
    return(list(coef = betas,
                cost = J,
                niter = iter + 1))
  }
}

## Make the cost function
sgd.lm.cost <- function(X, y, beta){
   n <- length(y)
   if (!is.matrix(X)){
    X <- matrix(X, nrow = 1)
  }
   loss <- sum((X%*%beta -  y)^2)/(2*n)
return(loss)
}
## Calculate the gradient
sgd.lm.cost.grad <- function(X, y, beta){
  n <- length(y)
  if (!is.matrix(X)){
    X <- matrix(X, nrow = 1)
  }
  t(X)%*%(X%*%beta - y)/n
}
```
]

---
# SGD in R

```
#> Converged. 
#> Iterated  113  times. 
#> Coefficients:  1.004577 2.905418
```

---
# A comparison: GD, SGD and Newton methods

---
# Your turn

Test the `sgd.lm()` function in the [Bodyfat](https://yanfei.site/docs/bsc/Bodyfat.csv) data.

---
# SGD for logistic regression

Suppose response variable `$Y_i$` takes values 0 or 1 with probability `$P(Y_i=1)=p_i$`. (i.e., a Bernoulli distribution)

We relate `$p_i$` to the predictors:

`\begin{align*}
 \eta_i &= \beta_0 + \beta_1 x_{i,1} + \dots + \beta_q x_{i,q}\\
  \eta_i &= g(p_i) 
\end{align*}`

* The link function `$g$` should be monotone.
* `$0 \leq g^{-1}(\eta) \leq 1$` for any `$\eta$`.

---
# Logit link function

* `$\eta = g(p) = \log(\frac{p}{1-p})$` is the *logit* function. It maps `$(0,1) \rightarrow \mathbb{R}$`.

* The inverse logit is `$g^{-1}(\eta) = \frac{e^\eta}{1+e^\eta}$` which maps `$\mathbb{R} \rightarrow (0,1)$`.

* `$p_i = \frac{e^{\eta_i}}{1+e^{\eta_i}} = P(Y_i=1)$`.

---
# Log-likelihood
 `$$\eta_i = \beta_0 + \beta_1 x_{i,1} + \dots + \beta_q x_{i,q}$$`

`\begin{align*}
\log L({\beta})
 &= \sum_{i=1}^n \log(p_i)1_{Y_i=1} + \log(1-p_i)1_{Y_i=0} \\
 &= \sum_{i=1}^n y_i[\eta_i - \log(1+e^{\eta_i})] + (1-y_i)\log\left(1-\frac{e^{\eta_i}}{1+e^{\eta_i}}\right) \\
 &= \sum_{i=1}^n y_i[\eta_i - \log(1+e^{\eta_i})] + (1-y_i)\log\left(\frac{1}{1+e^{\eta_i}}\right) \\
 &= \sum_{i=1}^n y_i[\eta_i - \log(1+e^{\eta_i})] - (1-y_i)\log(1+e^{\eta_i}) \\
 &= \sum_{i=1}^n [y_i\eta_i - \log(1+e^{\eta_i})]
\end{align*}`

* Uses MLE to obtain `$\hat{{\beta}}$`.

---
# SGD for Poisson regression

Let `$Y=$` number of events in given time interval. If events independent, and prob of event proportional to length of interval, then `$Y$` is Poisson distributed.

$$ P(Y=y) = \frac{e^{-\mu}\mu^y}{y!}$$
 * `$E(Y)=V(Y)=\mu$`
 * If `$Y\sim B(n,p)$`, then `$Y \approx \text{Poisson}(np)$`  for small `$p/n$`.
 * If `$Y\sim \text{Poisson}(\mu)$`, then `$Y\approx N(\mu,\mu)$` for large `$\mu$`.
 * `$\text{Poisson}(\mu_1) + \text{Poisson}(\mu_1) \sim \text{Poisson}(\mu_1+\mu_2)$`.

---
# Regression with count data

Suppose response `$Y$` is a count (0, 1, 2, `$\dots$`).

* If count is bounded and bound is small, use binomial regression.
 
 * If min count is large, use normal approximation.
 
 * Otherwise, use Poisson or negative binomial.

---
# Poisson regression

`\begin{align*}
&y_i \sim \text{Poisson}(\mu_i) \\
&\log(\mu_i) = \eta_i = \beta_0 + \beta_1 x_{i,1} + \dots + \beta_q x_{i,q}
\end{align*}`

* Log link function forces positive mean.
 * Log-likelihood:
  `$$\log L = \sum_{i=1}^n \left[y_i X_i^T \beta - \exp(X_i^T{\beta}) - \log(y_i!)\right]$$`

---
# Lab 8

Apply GD/SGD to the estimation of any statistical/ML algorithm.

---
# Summary

- GD and SGD.

- When to use SGD?

- Heaps of ML and DL methods use SGD to for training.

- Simple answer: Use stochastic gradient descent when training time is the bottleneck.

- You might refer to the [SGD](https://cran.r-project.org/web/packages/sgd/vignettes/sgd-jss.pdf) package in R.

---
# References and further readings

### References

- Section 3.1 of the [advanced statistical computing](https://bookdown.org/rdpeng/advstatcomp/steepest-descent.html) book by Roger Peng.

### Further readings

- Stochastic Gradient Descent Tricks by Léon Bottou in Microsoft Research, Redmond, WA. Click [here](https://yanfei.site/docs/bsc/sgdMR.pdf) to download.