L2 Regularization:
\(L_\lambda(\mathbf{w}) = L(\mathbf{w}) + \lambda \|\mathbf{w}\|_2^2\)
Also knowns as weight decay or Ridge regression
L1 Regularization:
\(L_\lambda(\mathbf{w}) = L(\mathbf{w}) + \lambda |\mathbf{w}|\)
https://commons.wikimedia.org/wiki/File:Regularization.jpg
“Vanishing Gradients” are a problem when training deep neural networks…
Logisitic function:
ReLU
(Around 53,000 citations on Google Scholar.)
At each layer, each input is adjusted according to:
\[\hat{x}^{(k)} = \frac{x^{(k)} - \text{E}[x^{(k)}]}{\sqrt{\text{Var}[x^{(k)}]}}\]
\(\text{E}[x^{(k)}]\) - The mean of \(x^{(k)}\) for the batch.
\(\text{Var}[x^{(k)}]\) - The variance of \(x^{(k)}\) for the batch.
Then adjusted as:
\[y^{(k)} = \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)}\]
to restore representational power.