L15_NN_GD slides

AERSP 597 - Machine Learning in Aerosapce Engineering¶

Lecture 15, Neural Networks: Gradient Descent¶

Instructor: Daning Huang¶

$$ \newcommand{\bR}{\mathbb{R}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cN}{\mathcal{N}} \newcommand{\vH}{\mathbf{H}} \newcommand{\vI}{\mathbf{I}} \newcommand{\vK}{\mathbf{K}} \newcommand{\vL}{\mathbf{L}} \newcommand{\vM}{\mathbf{M}} \newcommand{\vO}{\mathbf{O}} \newcommand{\vQ}{\mathbf{Q}} \newcommand{\vV}{\mathbf{V}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vm}{\mathbf{m}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vy}{\mathbf{y}} \newcommand{\sns}{\sigma_n^2} \newcommand{\Dx}{{\nabla\vx}} $$

TODAY: Neural Networks - II¶

Loss functions
Newton-type methods
Gradient descent methods

References¶

PRML Chp. 5
Referenced papers

Training Neural Networks¶

Repeat until convergence

$(\textbf{x},\textbf{y}) \leftarrow$ Sample an example (or a mini-batch) from data
$\hat{\textbf{y}} \leftarrow f\left( \textbf{x} ; \theta \right)$ Forward propagation
Compute $\mathcal{L}\left(\textbf{y},\hat{\textbf{y}}\right)$
$\nabla_{\theta}\mathcal{L} \leftarrow$ Backward propagation
Update weights using (stochastic) gradient descent
- $\theta \leftarrow \theta - \alpha \nabla_{\theta}\mathcal{L}$

Types of Losses: Squared Loss¶

$$ \mathcal{L}\left(\textbf{y},\hat{\textbf{y}}\right) = \frac{1}{2}\left\Vert \textbf{y}-\hat{\textbf{y}}\right\Vert^2_2 $$$$ \frac{\partial\mathcal{L}}{\partial \hat{y}_i}=\hat{y}_i-y_i \iff \nabla_{\hat{\textbf{y}}}\mathcal{L}=\hat{\textbf{y}}-\textbf{y} $$

Used for regression problems

Forward/Backward propagation¶

We are going to cover backward propagation in the next lecture.

Newton's Method: Overview¶

Goal: Minimize a general function $F(w)$ in one dimension by solving for $$ f(w) = \frac{\partial F}{\partial w} = 0 $$
Newton's Method: To find roots of $f$, Repeat until convergence: $$ w \leftarrow w - \frac{f(w)}{f'(w)} $$

Geometric Intuition¶

Find the roots of $f(w)$ by following its tangent lines. The tangent line to $f$ at $w_{k-1}$ has equation $$ \ell(w) = f(w_{k-1}) + (w-w_{k-1}) f'(w_{k-1}) $$
Set next iterate $w$ to be root of tangent line: $$ \begin{gather} f(w_{k-1}) + (w-w_{k-1}) f'(w_{k-1}) = 0 \\ \implies \boxed{ w = w_{k-1} - \frac{f(w_{k-1})}{f'(w_{k-1})} } \end{gather} $$

Application to minimization¶

To minimize $F(w)$, find roots of $F'(w)$ via Newton's Method.

Repeat until convergence: $$ \boxed{w \leftarrow w - \frac{F'(w)}{F''(w)}} $$

Minimization in multivariate case¶

Replace second derivative with the Hessian Matrix, $$ H_{ij}(w) = \frac{\partial^2 F}{\partial w_i \partial w_j} $$

Newton update becomes: $$ \boxed{ w \leftarrow w - H^{-1} \nabla_w F} $$

Advantage: Fast convergence (quadratic)
Disadvantage: Computing Hessian and its inverse is slow

Sometimes line search is necessary¶

Let $p=-H^{-1} \nabla_w F$, find a step size $\alpha$ such that $$ \min_\alpha F(w+\alpha p) $$ Then $$ \boxed{ w \leftarrow w - \alpha H^{-1} \nabla_w F} $$

Issues¶

Computing Hessian is expensive
- Approximate linear solve for $H^{-1}$
- Quasi-Newton methods: Construct approximate $H$ from past gradients (e.g. BFGS using secant equation) $$\mbox{Scalar:}\quad F''(w_{k+1}) \approx [F'(w_{k+1}) - F'(w_k)]/(w_{k+1}-w_k)$$ $$\mbox{Vector:}\quad H_{k+1}(w_{k+1}-w_k) \approx \nabla_w F(w_{k+1}) - \nabla_w F(w_k)$$

Do not work well for non-convex problems
- It only solves for $\nabla_w F=0$
- The solution can be a saddle point, or even a maximizer!
- Possible to use the Levenberg-Marquardt algorithm (regularizing the Hessian) $$\tilde{H} = \lambda I + H,\quad \lambda>0$$

Unfortunately, saddle points prevail in high dimensions¶

The $\epsilon-\alpha$ curve [1]

$\alpha$: Percentage of negative eigenvalues of Hessian at critical point (i.e. gradient=0)
$\epsilon$: Error at critical point
There are many saddle points (the figure is for a relatively small NN)
Strong positive correlation between $\alpha$ and $\epsilon$, esp. in high-dim optimization

[1] Y. Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NeurIPS 2014.

Intuitively, in high dimensions, the chance that all the directions around a critical point lead upward (positive curvature) is exponentially small w.r.t. the number of dimensions

Implications:

If you get a "converged" model and it does not perform well, then the model is likely at a saddle point
The local minima are likely to be as good as global minima
- For even larger problems, even saddle points with low index are sufficient (which is more common)
Now that Hessian-based method might be problematic, we should just use Gradient Descent

Simple Gradient Descent¶

Batch mode¶

Given dataset $\{ (x,t) \}$ and initial guess $\vec{w}_0$, repeat until convergence:

$\vec{w} = \vec{w} - \eta \nabla_\vec{w} E(\vec{w})$

where $\eta$ is the step size(s), or learning rate, and assuming quadratic loss,

$$ \nabla_w E(\vec{w}) = \sum_{n=1}^N \left[ f\left( \textbf{x}^{(n)} ; \theta \right) - y^{(n)} \right] \frac{\partial f\left( \textbf{x}^{(n)} ; \theta \right)}{\partial \theta} $$

Disadvantage: Can be slow if the network is large and the dataset is huge

Stochastic mode, a.k.a., Stochastic Gradient Descent (SGD)¶

Main Idea: Instead of computing batch gradient (over entire training data), just compute gradient for individual example and update.

Repeat until convergence: for $n=1,\dots,N$

$$ w := w - \eta \nabla_\vec{w} E(\vec{w} | \vec{x}^{(n)}) $$

where

$$ \nabla_w E(\vec{w} | \vec{x}^{(n)}) = \left[ f\left( \textbf{x}^{(n)} ; \theta \right) - y^{(n)} \right] \frac{\partial f\left( \textbf{x}^{(n)} ; \theta \right)}{\partial \theta} $$

Advantage: Easier to compute, esp. when the dataset is huge; sometimes can escape local minima
Disadvantage: The gradient becomes noisy - impacts convergence; the Hessian is not well-defined even if we want to use it
In between there are "mini-batch" mode

Comparison - Batch mode seems more reliable?¶

Comparison - but if we look at cost¶

Think LBFGS as batch GD
SGD shows fast initial progress, though slowed down later.

Let $n$ be the batch size, $\epsilon\sim 10^{-3}$ the final "loss", $T$ the allowable time, then [1]

Batch: Work$\sim n\log(1/\epsilon)$, Error$\sim(\log(T)+1)/T$ - better to focus on a certain number of samples
Stochastic: Work$\sim 1/\epsilon$, Error$\sim 1/T$ - just process as many samples as possible

[1] L. Bottou, et al. Stochastic Gradient Methods For Large-Scale Machine Learning. Arxiv: 1606.04838; Also here as slides: http://users.iems.northwestern.edu/~nocedal/ICML

Practical SGD methods¶

$$ w := w - \eta \nabla_\vec{w} E(\vec{w} | \vec{x}^{(n)}) $$

How to determine learning rate?
How to improve convergence rate?
Getting out of saddle points?

Fancier variants¶

SGD with momentum¶

Let the optimizer remember the previous gradients, $$ v_k = \rho v_{k-1} + (1-\rho)\nabla_w F(w_k) $$ Then $$ \boxed{ w_{k+1} = w_k - \eta v_k} $$

Momentum is a weighted sum of gradients, weights for old gradients decay exponentially
Less sensitive to noise
Easier to pass local minimum

AdaGrad¶

Let the optimizer tune learning rate automatically, $$ g_k = g_{k-1} + [\nabla_w F(w_k)]^2 $$ Then $$ \boxed{ w_{k+1} = w_k - \frac{\eta}{\sqrt{g_k}+\epsilon} \nabla_w F(w_k)} $$

Note that the square and sqrt are element-wise, $\epsilon\sim 10^{-6}$ to avoid division by zero
Learning rate decreases as the solution moves longer distance
Now $\eta$ can be very large - It will become small anyways

RMSProp¶

Similar to AdaGrad, but allow the optimizer to "forget", $$ g_k = \alpha g_{k-1} + (1-\alpha)[\nabla_w F(w_k)]^2 $$ Then $$ \boxed{ w_{k+1} = w_k - \frac{\eta}{\sqrt{g_k}+\epsilon} \nabla_w F(w_k)} $$

Like momentum, let old squared gradient decay exponentially. Typically $\alpha=0.9$.
The rest is AdaGrad.

Adam¶

Combine the best of SGD w/ momentum and RMSProp, $$v_k = \beta_1 v_{k-1} + (1-\beta_1)\nabla_w F(w_k)$$ $$g_k = \beta_2 g_{k-1} + (1-\beta_2)[\nabla_w F(w_k)]^2$$ Then $$ \boxed{ w_{k+1} = w_k - \frac{\eta}{\sqrt{g_k}+\epsilon} v_k} $$

Sometimes $v_k$ and $g_k$ are corrected before updating $w$, $$ v_k \leftarrow \frac{v_k}{1-\beta_1^k},\quad g_k \leftarrow \frac{g_k}{1-\beta_2^k} $$
It was suggested to use $\beta_1=0.9$, $\beta_2=0.999$.

More methods¶

People have devised many, many other gradient-descent methods. Some perform the best in some specific applications, but there is not "one-for-all" solution.

You will implement one of them in HW3.

There are also attempts to combine the advantages of different methods, e.g. [1]

[1] N. Keskar, R. Socher, Improving Generalization Performance by Switching from Adam to SGD, Arxiv 1712.07628

Visual comparison of the methods - I¶

From https://ruder.io/optimizing-gradient-descent/:

Visual comparison of the methods - II¶

From ruder.io:

Again, Deep NN's are filled with saddle points

In [3]:

x0 = [-1.5,0.5]
# x0 = [1.0, 0.0]
# xs, fs, gs, hs = gradDesc(FGH, x0, BBLR)
# xs, fs, gs, hs = gradDesc(FGH, x0, BFGS, ifLS=True)
# xs, fs, gs, hs = gradDesc(FGH, x0, NR, ifLS=True)

x0 = [-1.5,0.5]
ss = 0.5  # Noise in gradient
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, fixLR(0.0001))
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, GDM(0.0001,0.9))
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, AdaGrad(2.0))
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, RMSProp(0.1,0.9))
xs, fs, gs, hs = gradDesc(FGHS(ss), x0, Adam(0.1,0.9,0.999))

f1, ax = pltFunc()
ax[0] = pltTraj(ax[0], xs)
ax[1] = pltHist(ax[1], xs)

One more awesome interactive visualization: https://github.com/lilipads/gradient_descent_viz

Now that we know there are many saddle points and local minima...¶

What kind of local minima is better? The flat ones - (empirically) better generalization performance
Compare Adam and vanilla SGD [1]
- Adam escapes saddle point easier (see the GIF earlier)
- SGD favors flat minima more

Additional conclusion from [1]: A small learning rate or large-batch makes it much harder to escape from sharp minima

[1] Z. Xie. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima. Arxiv: 2002.03495

Practical aspects in training¶

General procedure
Techniques for improved convergence and better solutions

Basics: General procedure¶

If there are not much data - try cross-validation to use all data
- i.e., do splits of train-test, so that all data points are used as test once.
- See Linear Regression module for review.
Otherwise, do a split of train-valid-test.
- Train: for decreasing the loss
- Valid: for picking the best parameter during training (behave like test in CV)
- Test: Completely unseen in the training process, used to verify the model performance

Basics: Diagnose your loss curve¶

Linearly decrease: increase the learning rate
Shaky curves: increase the batch size
Distant train/valid curves: overfitting
Overlapping train/valid curves: underfitting

Basics: More on loss curves¶

dev=valid

[1] A. Ng, Nuts and Bolts of Building Applications using Deep Learning, NeurIPS 2016

Intermediate: Caveats and Tuning tricks - I¶

Always normalize and shuffle your data!
- But normalization factors should be based on training dataset only.
Pick the right learning rate: e.g., Grid search [1], or (cyclic) scheduling [2,3]
- Higher learning rate for larger batch size (and larger batch-size if you have a bigger machine)

[1] S. Gugger, How Do You Find A Good Learning Rate

[2] I. Loshchilov, SGDR: Stochastic gradient descent with warm restarts, Arxiv: 1608.03983

[3] L. Smith, Cyclical Learning Rates for Training Neural Networks, 2017

Intermediate: Caveats and Tuning tricks - II¶

Pick the right optimizer: Typically start with Adam, fine-tune with SGD
Start with ReLU or tanh for activation; avoid sigmoid unless for "gate"-like layers
Try overfitting on a small dataset
- If your code/model is correct, the training should converge really fast
Record your training experiments - learn from history

More techniques¶

[1] R. Sun. Optimization for deep learning: theory and algorithms. Arxiv 1912.08957 A nice comprehensive summary

More techniques - Weight initialization¶

Basically to get better initial guesses of the weights [1,2]
Let $n_i$ be input size, $n_o$ output size, so the weight $W$ of size $(n_i,n_o)$ and let $n=n_i$ or $n=(n_i+n_o)/2$
Uniform $U[-s,s]$
- Xavier (tanh/sigmoid): $s=\sqrt{3/n}$
- Kaiming (ReLU): $s=\sqrt{6/n}$
Gaussian $N(0,\sigma^2)$
- Xavier (tanh/sigmoid): $\sigma=\sqrt{1/n}$
- Kaiming (ReLU): $\sigma=\sqrt{2/n}$

[1] X. Glorot, et al. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010

[2] K. He, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. 1502.01852

More techniques - Dropout¶

Randomly "throw away" neurons during training
- so these neuron do not contribute to the prediction and their weights are not updated.
- This is as if multiple NNs are being learned during training
Avoids overfitting, since now there is a "committee" of NNs

[1] G. Hinton, et al. Improving neural networks by preventing co-adaptation of feature detectors. Arxiv:1207.0580

More techniques - Normalization within Network - I¶

If we think each NN layer is a small model that we want to fit
- Then it is natural to think about normalizing the "inputs" to these layers
- The inputs, however, are the outputs of the preceeding layer

Batch Normalization [1] does the trick
- Turns out it maintains the distribution of inputs/outputs of the NN layers
- Leads to faster convergence and perhaps better accuracy

[1] S. Ioffe, et al. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015.

More techniques - Normalization within Network - II¶

There are more variants, too, such as weight/layer/instance/group normalization etc.
- See weight normalization [2] for example, it reparametrizes the weights instead of the inputs.

[2] T. Salimans, et al. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NeurIPS 2016

More techniques - Skip connection¶

This is an important type of NN architecture. It improves the convergence and has deeper implications.
We will come back to this in L17.

Pursuing zero training error?¶

BTW, look at that deep-double descent!

[1] T. Ishida, et al. Do We Need Zero Training Loss After Achieving Zero Training Error? Arxiv: 2002.08709