0. Definition of Regularization

One of the main goal in machine learning is to not only perform great in tranining set, but also get a good generalization in any test set. According to the book Deep Learning, regularization in deep learning can be described as an modification in learning model aims at reducing generalization error but not training error. Generally, there are two ways of adding regularization in the network. One is to apply extra restrction on the parameters, like $L_1$ and $L_1$, the other is to add an extra iterm in the objective funtions, like $Dropout$.

Under deep leanring background, most of the regularization strategies apply on the estimation/output. This is helpful in trading bias with an improvement in variance. A good regularization is a good trade in significantly decreasing the variance with a slightly increasing in the bias.

1. Parametars Norm Penalties

Many regularization approaches are based on limiting the capacity of models,such as neural networks, linear regression, or logistic regression, by adding a pa-rameter norm penalty $\Omega(\theta)$ to the objective function $J$. It can be describe as: \begin{equation} \bar{J}(\theta;X,y) = J(\theta;X,y)+\alpha\Omega(\theta) \end{equation} where $\alpha \in [0, \infty)$ is a hyperparameter that weights the relative contribution of thenorm penalty term, $\Omega$, relative to the standard objective function $J$. When our trainig algorithm minimizes the regulized objective $\bar{J}$, it will reduce the original objective $J$ w.r.t. the training error and in the mean time reduce some of the measure of the size of $\theta$.

To be noticed, in neural network the word “parameter” describe all the bias and weights in each layer. However, when doing regularization we only apply our method on the weights $w$ without bias. This is because we need much more datas to train to get a precise weight which denote the relationship between two variables than those we need to get a correct bias value. Thus we need to observe the variables in a variety of conditions for the fitting. On the contrary, one bias effect one variable which means the variance will be trivial even without regularization. What is more, regularized bias might result in under-fitting. Therefore, we only apply regularization on the weights but not bias.

1.1 $L_2$ Regularization

$L_2$ parameter norm penalty, also known as weight decay, ridge regression or Tikhonov regularization, is to add a norm item $\Omega(\theta) = \frac{1}{2}||w||_2^2$. This regularization drag the weight vector closer to the origin.
In short, $L_2$ regularization has modified the learningrule to multiplicatively shrink the weight vector by a constant factor on each step,just before performing the usual gradient update. As a result, only directions along which the parameters contribute significantly to reducingthe objective function are preserved relatively intact. In directions that do notcontribute to reducing the objective function, a small eigenvalue of the Hessiantells us that movement in this direction will not significantly increase the gradient.Components of the weight vector corresponding to such unimportant directionsare decayed away through the use of the regularization throughout training. What is more, $L_2$ regularization helps the learning algorithm to “perceive” the input $X$ as having higher variance, which makes it shrink the weights on features whose covariance with the output target is low (compared to the added variance).