Popular for training a wide range of models in machine learning.
Optimization is a big part of machine learning.
The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data XX onto output variables yy.
Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.
Cost function.
Gradient descent (steepest descent) is an iterative algorithm for optimization.
The most obvious direction to choose when attempting to minimize a function ff starting at xnxn is the direction of steepest descent, or −f′(xn)−f′(xn).
Suppose we want to minimize f(xn)f(xn), where f(⋅)f(⋅) is a function with continuous partial derivatives everywhere. Gradient descent algorithm does the following iteration.
xn+1=xn−α∇f(xn), where α>0.xn+1=xn−α∇f(xn), where α>0.
Please think about the role of αα.
Gradient descent algorithm and Newton method?
A comparison of gradient descent (green) and Newton's method (red).
Newton's method uses curvature information (i.e. the second derivative) to take a more direct route.
For initial value x0x0, we have f(x0)≥f(x1)≥f(x2)≥⋯f(x0)≥f(x1)≥f(x2)≥⋯.
Upon convergence, {xn}{xn} converges to the local min of f(x)f(x). We can use f(xn+1)−f(xn)≤ϵf(xn+1)−f(xn)≤ϵ as the stopping rule.
f <- function(x, y) { x^2 + y ^2}dx <- function(x,y) {2*x }dy <- function(x,y) {2*y}num_iter <- 10learning_rate <- 0.2x_val <- 6y_val <- 6updates_x <- c(x_val, rep(NA, num_iter))updates_y <- c(y_val, rep(NA, num_iter))updates_z <- c(f(x_val, y_val), rep(NA, num_iter))# iteratefor (i in 1:num_iter) { dx_val = dx(x_val,y_val) dy_val = dy(x_val,y_val) x_val <- x_val-learning_rate*dx_val y_val <- y_val-learning_rate*dy_val z_val <- f(x_val, y_val) updates_x[i + 1] <- x_val updates_y[i + 1] <- y_val updates_z[i + 1] <- z_val}
## plottingm <- data.frame(x = updates_x, y = updates_y)x <- seq(-6, 6, length = 100)y <- xg <- expand.grid(x, y)z <- f(g[,1], g[,2])f_long <- data.frame(x = g[,1], y = g[,2], z = z)library(ggplot2)ggplot(f_long, aes(x, y, z = z)) + geom_contour_filled(aes(fill = stat(level)), bins = 50) + guides(fill = FALSE) + geom_path(data = m, aes(x, y, z=0), col = 2, arrow = arrow()) + geom_point(data = m, aes(x, y, z=0), size = 3, col = 2) + xlab(expression(x[1])) + ylab(expression(x[2])) + ggtitle(parse(text = paste0('~ f(x[1],x[2]) == ~ x[1]^2 + x[2]^2')))
How about choosing a larger αα? and a smaller one?
How to choose a better αα?
What happens when there exists high correlation between variables?
Consider linear regression
y=Xβ+ϵ, where ϵ∼N(0,I).y=Xβ+ϵ, where ϵ∼N(0,I).
Calculate gradient function: ∇J(β)=1nXT(Xβ−y).∇J(β)=1nXT(Xβ−y).
Iteration: β:=β−α∗∇J(β).β:=β−α∗∇J(β).
graddesc.lm <- function(X, y, beta.init, alpha = 0.1, tol = 1e-9, max.iter = 100){ beta.old <- beta.init J <- betas <- list() betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter < max.iter)){ beta.old <- beta.new beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge \n') } else{ cat('Converged \n') cat('Iterated',iter + 1,'times','\n') cat('Coef: ', beta.new) return(list(coef = betas, cost = J)) }}
## Generate some databeta0 <- 1beta1 <- 3sigma <- 1n <- 10000x <- rnorm(n, 0, 1)y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)X <- cbind(1, x)
Please write out the gradient function in the following code.
Run the code and compare with lm()
.
Change the value of αα and see what happens?
Think about how to select αα.
## Make the cost functionlm.cost <- function(X, y, beta){ n <- length(y) loss <- sum((X%*%beta - y)^2)/(2*n) return(loss)}## Calculate the gradientlm.cost.grad <- function(X, y, beta){ n <- length(y) return()}gd.est <- graddesc.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, max.iter = 1000)
gd.lm <- function(X, y, beta.init, alpha, tol = 1e-5, max.iter = 100){ beta.old <- beta.init J <- betas <- list() if (alpha == 'auto'){ alpha <- optim(0.1, function(alpha) { lm.cost( X, y, beta.old - alpha * lm.cost.grad(X, y, beta.old)) }, method = 'L-BFGS-B', lower = 0, upper = 1) if (alpha$convergence == 0) { alpha <- alpha$par } else{ alpha <- 0.1 } } betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter < max.iter)) { beta.old <- beta.new if (alpha == 'auto'){ alpha <- optim(0.1, function(alpha) { lm.cost(X, y, beta.old - alpha * lm.cost.grad(X, y, beta.old))}, method = 'L-BFGS-B', lower = 0, upper = 1) if (alpha$convergence == 0) { alpha <- alpha$par } else{ alpha <- 0.1 } } beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge. \n') } else{ cat('Converged..\n') cat('Iterated ',iter + 1,' times.','\n') cat('Coefficients: ', beta.new, '\n') return(list(coef = betas, cost = J, niter = iter + 1)) }}
## Generate some databeta0 <- 1beta1 <- 3sigma <- 1n <- 10000x <- rnorm(n, 0, 1)y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)X <- cbind(1, x)## Make the cost functionlm.cost <- function(X, y, beta){ n <- length(y) loss <- sum((X%*%beta - y)^2)/(2*n)return(loss)}## Calculate the gradientlm.cost.grad <- function(X, y, beta){ n <- length(y) (1/n)*(t(X)%*%(X%*%beta - y))}## Use optimized alphagd.auto <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 'auto', tol = 1e-5, max.iter = 10000)#> Converged..#> Iterated 3 times. #> Coefficients: 1.003497 3.010837gd1 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, tol = 1e-5, max.iter = 10000)#> Converged..#> Iterated 66 times. #> Coefficients: 0.9995082 3.002953gd2 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.01, tol = 1e-5, max.iter = 10000)#> Converged..#> Iterated 567 times. #> Coefficients: 0.9889579 2.983279
Gradient descent can be slow to run on very large datasets.
One iteration of the gradient descent algorithm takes care of the entire dataset.
One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.
SGD updates parameters for each training sample.
Replace ∇J(β)=1nXT(Xβ−y)∇J(β)=1nXT(Xβ−y) with:
∇J(β)i=XTi(Xiβ−yi).∇J(β)i=XTi(Xiβ−yi).
sgd.lm <- function(X, y, beta.init, alpha = 0.5, n.samples = 1, tol = 1e-5, max.iter = 100){ n <- length(y) beta.old <- beta.init J <- betas <- list() sto.sample <- sample(1:n, n.samples, replace = TRUE) betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 n.best <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter + 2 < max.iter)){ beta.old <- beta.new sto.sample <- sample(1:n, n.samples, replace = TRUE) beta.new <- beta.old - alpha * sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge. \n') } else{ cat('Converged. \n') cat('Iterated ',iter + 1,' times.','\n') cat('Coefficients: ', beta.new, '\n') return(list(coef = betas, cost = J, niter = iter + 1)) }}## Make the cost functionsgd.lm.cost <- function(X, y, beta){ n <- length(y) if (!is.matrix(X)){ X <- matrix(X, nrow = 1) } loss <- sum((X%*%beta - y)^2)/(2*n)return(loss)}## Calculate the gradientsgd.lm.cost.grad <- function(X, y, beta){ n <- length(y) if (!is.matrix(X)){ X <- matrix(X, nrow = 1) } t(X)%*%(X%*%beta - y)/n}
#> Converged. #> Iterated 113 times. #> Coefficients: 1.004577 2.905418
Suppose response variable YiYi takes values 0 or 1 with probability P(Yi=1)=piP(Yi=1)=pi. (i.e., a Bernoulli distribution)
We relate pipi to the predictors:
ηi=β0+β1xi,1+⋯+βqxi,qηi=g(pi)ηi=β0+β1xi,1+⋯+βqxi,qηi=g(pi)
η=g(p)=log(p1−p)η=g(p)=log(p1−p) is the logit function. It maps (0,1)→R(0,1)→R.
The inverse logit is g−1(η)=eη1+eηg−1(η)=eη1+eη which maps R→(0,1)R→(0,1).
pi=eηi1+eηi=P(Yi=1)pi=eηi1+eηi=P(Yi=1).
ηi=β0+β1xi,1+⋯+βqxi,qηi=β0+β1xi,1+⋯+βqxi,q
logL(β)=n∑i=1log(pi)1Yi=1+log(1−pi)1Yi=0=n∑i=1yi[ηi−log(1+eηi)]+(1−yi)log(1−eηi1+eηi)=n∑i=1yi[ηi−log(1+eηi)]+(1−yi)log(11+eηi)=n∑i=1yi[ηi−log(1+eηi)]−(1−yi)log(1+eηi)=n∑i=1[yiηi−log(1+eηi)]logL(β)=n∑i=1log(pi)1Yi=1+log(1−pi)1Yi=0=n∑i=1yi[ηi−log(1+eηi)]+(1−yi)log(1−eηi1+eηi)=n∑i=1yi[ηi−log(1+eηi)]+(1−yi)log(11+eηi)=n∑i=1yi[ηi−log(1+eηi)]−(1−yi)log(1+eηi)=n∑i=1[yiηi−log(1+eηi)]
Let Y=Y= number of events in given time interval. If events independent, and prob of event proportional to length of interval, then YY is Poisson distributed.
P(Y=y)=e−μμyy!P(Y=y)=e−μμyy!
Suppose response YY is a count (0, 1, 2, …).
If count is bounded and bound is small, use binomial regression.
If min count is large, use normal approximation.
Otherwise, use Poisson or negative binomial.
yi∼Poisson(μi)log(μi)=ηi=β0+β1xi,1+⋯+βqxi,q
Apply GD/SGD to the estimation of any statistical/ML algorithm.
GD and SGD.
When to use SGD?
GD and SGD.
When to use SGD?
Heaps of ML and DL methods use SGD to for training.
Simple answer: Use stochastic gradient descent when training time is the bottleneck.
You might refer to the SGD package in R.
Popular for training a wide range of models in machine learning.
Optimization is a big part of machine learning.
The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data X onto output variables y.
Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.
Cost function.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
s | Toggle scribble toolbox |
Esc | Back to slideshow |
Popular for training a wide range of models in machine learning.
Optimization is a big part of machine learning.
The goal of all supervised machine learning algorithms is to best estimate a target function that maps input data X onto output variables y.
Require a process of optimization to find the set of coefficients that result in the best estimate of the target function.
Cost function.
Gradient descent (steepest descent) is an iterative algorithm for optimization.
The most obvious direction to choose when attempting to minimize a function f starting at xn is the direction of steepest descent, or −f′(xn).
Suppose we want to minimize f(xn), where f(⋅) is a function with continuous partial derivatives everywhere. Gradient descent algorithm does the following iteration.
xn+1=xn−α∇f(xn), where α>0.
Please think about the role of α.
Gradient descent algorithm and Newton method?
A comparison of gradient descent (green) and Newton's method (red).
Newton's method uses curvature information (i.e. the second derivative) to take a more direct route.
For initial value x0, we have f(x0)≥f(x1)≥f(x2)≥⋯.
Upon convergence, {xn} converges to the local min of f(x). We can use f(xn+1)−f(xn)≤ϵ as the stopping rule.
f <- function(x, y) { x^2 + y ^2}dx <- function(x,y) {2*x }dy <- function(x,y) {2*y}num_iter <- 10learning_rate <- 0.2x_val <- 6y_val <- 6updates_x <- c(x_val, rep(NA, num_iter))updates_y <- c(y_val, rep(NA, num_iter))updates_z <- c(f(x_val, y_val), rep(NA, num_iter))# iteratefor (i in 1:num_iter) { dx_val = dx(x_val,y_val) dy_val = dy(x_val,y_val) x_val <- x_val-learning_rate*dx_val y_val <- y_val-learning_rate*dy_val z_val <- f(x_val, y_val) updates_x[i + 1] <- x_val updates_y[i + 1] <- y_val updates_z[i + 1] <- z_val}
## plottingm <- data.frame(x = updates_x, y = updates_y)x <- seq(-6, 6, length = 100)y <- xg <- expand.grid(x, y)z <- f(g[,1], g[,2])f_long <- data.frame(x = g[,1], y = g[,2], z = z)library(ggplot2)ggplot(f_long, aes(x, y, z = z)) + geom_contour_filled(aes(fill = stat(level)), bins = 50) + guides(fill = FALSE) + geom_path(data = m, aes(x, y, z=0), col = 2, arrow = arrow()) + geom_point(data = m, aes(x, y, z=0), size = 3, col = 2) + xlab(expression(x[1])) + ylab(expression(x[2])) + ggtitle(parse(text = paste0('~ f(x[1],x[2]) == ~ x[1]^2 + x[2]^2')))
How about choosing a larger α? and a smaller one?
How to choose a better α?
What happens when there exists high correlation between variables?
Consider linear regression
y=Xβ+ϵ, where ϵ∼N(0,I).
Calculate gradient function: ∇J(β)=1nXT(Xβ−y).
Iteration: β:=β−α∗∇J(β).
graddesc.lm <- function(X, y, beta.init, alpha = 0.1, tol = 1e-9, max.iter = 100){ beta.old <- beta.init J <- betas <- list() betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter < max.iter)){ beta.old <- beta.new beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge \n') } else{ cat('Converged \n') cat('Iterated',iter + 1,'times','\n') cat('Coef: ', beta.new) return(list(coef = betas, cost = J)) }}
## Generate some databeta0 <- 1beta1 <- 3sigma <- 1n <- 10000x <- rnorm(n, 0, 1)y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)X <- cbind(1, x)
Please write out the gradient function in the following code.
Run the code and compare with lm()
.
Change the value of α and see what happens?
Think about how to select α.
## Make the cost functionlm.cost <- function(X, y, beta){ n <- length(y) loss <- sum((X%*%beta - y)^2)/(2*n) return(loss)}## Calculate the gradientlm.cost.grad <- function(X, y, beta){ n <- length(y) return()}gd.est <- graddesc.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, max.iter = 1000)
gd.lm <- function(X, y, beta.init, alpha, tol = 1e-5, max.iter = 100){ beta.old <- beta.init J <- betas <- list() if (alpha == 'auto'){ alpha <- optim(0.1, function(alpha) { lm.cost( X, y, beta.old - alpha * lm.cost.grad(X, y, beta.old)) }, method = 'L-BFGS-B', lower = 0, upper = 1) if (alpha$convergence == 0) { alpha <- alpha$par } else{ alpha <- 0.1 } } betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter < max.iter)) { beta.old <- beta.new if (alpha == 'auto'){ alpha <- optim(0.1, function(alpha) { lm.cost(X, y, beta.old - alpha * lm.cost.grad(X, y, beta.old))}, method = 'L-BFGS-B', lower = 0, upper = 1) if (alpha$convergence == 0) { alpha <- alpha$par } else{ alpha <- 0.1 } } beta.new <- beta.old - alpha * lm.cost.grad(X, y, beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge. \n') } else{ cat('Converged..\n') cat('Iterated ',iter + 1,' times.','\n') cat('Coefficients: ', beta.new, '\n') return(list(coef = betas, cost = J, niter = iter + 1)) }}
## Generate some databeta0 <- 1beta1 <- 3sigma <- 1n <- 10000x <- rnorm(n, 0, 1)y <- beta0 + x * beta1 + rnorm(n, mean = 0, sd = sigma)X <- cbind(1, x)## Make the cost functionlm.cost <- function(X, y, beta){ n <- length(y) loss <- sum((X%*%beta - y)^2)/(2*n)return(loss)}## Calculate the gradientlm.cost.grad <- function(X, y, beta){ n <- length(y) (1/n)*(t(X)%*%(X%*%beta - y))}## Use optimized alphagd.auto <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 'auto', tol = 1e-5, max.iter = 10000)#> Converged..#> Iterated 3 times. #> Coefficients: 1.003497 3.010837gd1 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.1, tol = 1e-5, max.iter = 10000)#> Converged..#> Iterated 66 times. #> Coefficients: 0.9995082 3.002953gd2 <- gd.lm(X, y, beta.init = c(-4,-5), alpha = 0.01, tol = 1e-5, max.iter = 10000)#> Converged..#> Iterated 567 times. #> Coefficients: 0.9889579 2.983279
Gradient descent can be slow to run on very large datasets.
One iteration of the gradient descent algorithm takes care of the entire dataset.
One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.
SGD updates parameters for each training sample.
Replace ∇J(β)=1nXT(Xβ−y) with:
∇J(β)i=XTi(Xiβ−yi).
sgd.lm <- function(X, y, beta.init, alpha = 0.5, n.samples = 1, tol = 1e-5, max.iter = 100){ n <- length(y) beta.old <- beta.init J <- betas <- list() sto.sample <- sample(1:n, n.samples, replace = TRUE) betas[[1]] <- beta.old J[[1]] <- lm.cost(X, y, beta.old) beta.new <- beta.old - alpha * sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old) betas[[2]] <- beta.new J[[2]] <- lm.cost(X, y, beta.new) iter <- 0 n.best <- 0 while ((abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol) & (iter + 2 < max.iter)){ beta.old <- beta.new sto.sample <- sample(1:n, n.samples, replace = TRUE) beta.new <- beta.old - alpha * sgd.lm.cost.grad(X[sto.sample, ], y[sto.sample], beta.old) iter <- iter + 1 betas[[iter + 2]] <- beta.new J[[iter + 2]] <- lm.cost(X, y, beta.new) } if (abs(lm.cost(X, y, beta.new) - lm.cost(X, y, beta.old)) > tol){ cat('Did not converge. \n') } else{ cat('Converged. \n') cat('Iterated ',iter + 1,' times.','\n') cat('Coefficients: ', beta.new, '\n') return(list(coef = betas, cost = J, niter = iter + 1)) }}## Make the cost functionsgd.lm.cost <- function(X, y, beta){ n <- length(y) if (!is.matrix(X)){ X <- matrix(X, nrow = 1) } loss <- sum((X%*%beta - y)^2)/(2*n)return(loss)}## Calculate the gradientsgd.lm.cost.grad <- function(X, y, beta){ n <- length(y) if (!is.matrix(X)){ X <- matrix(X, nrow = 1) } t(X)%*%(X%*%beta - y)/n}
#> Converged. #> Iterated 113 times. #> Coefficients: 1.004577 2.905418
Suppose response variable Yi takes values 0 or 1 with probability P(Yi=1)=pi. (i.e., a Bernoulli distribution)
We relate pi to the predictors:
ηi=β0+β1xi,1+⋯+βqxi,qηi=g(pi)
η=g(p)=log(p1−p) is the logit function. It maps (0,1)→R.
The inverse logit is g−1(η)=eη1+eη which maps R→(0,1).
pi=eηi1+eηi=P(Yi=1).
ηi=β0+β1xi,1+⋯+βqxi,q
logL(β)=n∑i=1log(pi)1Yi=1+log(1−pi)1Yi=0=n∑i=1yi[ηi−log(1+eηi)]+(1−yi)log(1−eηi1+eηi)=n∑i=1yi[ηi−log(1+eηi)]+(1−yi)log(11+eηi)=n∑i=1yi[ηi−log(1+eηi)]−(1−yi)log(1+eηi)=n∑i=1[yiηi−log(1+eηi)]
Let Y= number of events in given time interval. If events independent, and prob of event proportional to length of interval, then Y is Poisson distributed.
P(Y=y)=e−μμyy!
Suppose response Y is a count (0, 1, 2, …).
If count is bounded and bound is small, use binomial regression.
If min count is large, use normal approximation.
Otherwise, use Poisson or negative binomial.
yi∼Poisson(μi)log(μi)=ηi=β0+β1xi,1+⋯+βqxi,q
Apply GD/SGD to the estimation of any statistical/ML algorithm.
GD and SGD.
When to use SGD?
GD and SGD.
When to use SGD?
Heaps of ML and DL methods use SGD to for training.
Simple answer: Use stochastic gradient descent when training time is the bottleneck.
You might refer to the SGD package in R.