Table of Contents

1. Training Neural Networks
2. Preventing Overfitting
3. Preventing Overfitting - Weight Penalities
4. ReLU’s
5. Weight Initialization, Pre-Training and AutoEncoders
6. Momentum
7. Batch Normalization

Training Neural Networks

Preventing Overfitting

Early Stopping
Dropout

Preventing Overfitting - Weight Penalities

L2 Regularization:

$L_{λ} (w) = L (w) + λ ‖ w ‖_{2}^{2}$

Also knowns as weight decay or Ridge regression
L1 Regularization:

$L_{λ} (w) = L (w) + λ | w |$
https://commons.wikimedia.org/wiki/File:Regularization.jpg

ReLU’s

“Vanishing Gradients” are a problem when training deep neural networks…

Logisitic function:

ReLU

Weight Initialization, Pre-Training and AutoEncoders

Momentum

nice visualizations

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe and Szegedy, 2015.

(Around 53,000 citations on Google Scholar.)

At each layer, each input is adjusted according to:

${\hat{x}}^{(k)} = \frac{x^{(k)} - E [x^{(k)}]}{\sqrt{Var [x^{(k)}]}}$

$E [x^{(k)}]$ - The mean of $x^{(k)}$ for the batch.

$Var [x^{(k)}]$ - The variance of $x^{(k)}$ for the batch.
Then adjusted as:

$y^{(k)} = γ^{(k)} {\hat{x}}^{(k)} + β^{(k)}$

to restore representational power.
- $γ^{(k)}$ and $β^{(k)}$ are learnable parameters.

slide 1/7

* help? contents?

Space, Right Arrow or swipe left to move to next slide, click help below for more details