Interaction of Feature Size and Regularization

In this post we’re going to explore how estimated feature size is affected by regularization. The intuition is that the shrinkage applied by regularization will mean low amplitude features get (even more) swamped by additive noise. This is because the estimated values will be more affected by the additive noise, and will not generalize. So low amplitude features would get dropped by cross validation.

We’re going to model the situation using a single feature with amplitude $A$, that is corrupted by additive Gaussian noise with variance $\sigma^2$. We’re going to estimate $A$ from $n$ independent observations, and then compute our test error on $n$ new observations. Call our observations $$ y_i = A + \veps_i, \quad \veps_i \sim \mathcal{N}(0, \sigma^2).$$

Ridge Regression

We’ll try ridge regression first. Our regularized estimate of $A$ will be $$\mu = \argmin {1 \over 2} \sum_i (y_i – \mu)^2 + {\lambda \over 2} \mu^2.$$ Solving this for $\mu$ we get
$$ -\sum_i (y_i – \mu) + \lambda \mu = 0 \implies (n + \lambda) \mu = \sum_i y_i \implies \mu = {\overline{y} \over n + \lambda}, \quad \overline {y} \triangleq \sum_i y_i.$$

So the regularization acts as if we’d had $\lambda$ observations with value 0, and shrinks the value of $\mu$ down accordingly.

Training Error

What’s our expected training error? We can find this by summing the error on all of the training data. \begin{align} \text{Err}_\text{train} &= \sum_i (y_i – \mu)^2 \\&= \sum_i (A + \veps_i – {n A + \sum_j \veps_j \over n + \lambda})^2 \\&= \sum_i ({\lambda \over n + \lambda}A + {n + \lambda – 1 \over n + \lambda} \veps_i – \sum_{j \neq i} {\veps_j \over n + \lambda})^2 \\&= \sum_i \left({\lambda \over n + \lambda}\right)^2 A^2 + \left({n + \lambda – 1 \over n + \lambda}\right)^2 \sigma^2 + \sum_{j \neq i} {\sigma^2 \over (n + \lambda)^2} \\ {1 \over n} \text{Err}_\text{train} &= \left({\lambda \over n + \lambda}\right)^2 A^2 + \left({n + \lambda – 1 \over n + \lambda}\right)^2 \sigma^2 + {\sigma^2 (n – 1) \over (n + \lambda)^2}\\ &= \left({\lambda \over n + \lambda}\right)^2 A^2 + {(n + \lambda – 1)^2 + (n-1) \over (n + \lambda)^2} \sigma^2. \end{align} Both terms are increasing functions of $\lambda$ — the derivative of the second term is $2 \lambda / (n + \lambda)^3$ — so the training error is minimized at $\lambda = 0$, as expected.

Test Error

To compute the test error, we compute the expected squared error when we sample a new batch of $n$ data points. \begin{align} \text{Err}_\text{test} &= \sum_i (y_i – \mu)^2\\ &= \sum_i (A + \eta_i – {n A + \sum_j \veps_j \over n + \lambda})^2\\&= \sum_i ({\lambda \over n + \lambda} A + \eta_i – {\sum_j \veps_j \over n + \lambda})^2\\&= \sum_i \left({\lambda \over n + \lambda}\right)^2 A^2 + \sigma^2 + {n \sigma^2 \over (n + \lambda)^2}\\ {1 \over n} \text{Err}_\text{test} &= \left({\lambda \over n + \lambda}\right)^2 A^2 + { n \over (n + \lambda)^2} \sigma^2 + \sigma^2 .\end{align}

We could normalize the amplitude with respect to the noise level at this point, but I think leaving it as is makes it more clear what’s going on. The first term tells us how much of the true amplitude we’re missing. When regularization is low, we capture most of $A$, so our error in that part is low. When regularization is high, say infinite, so that our estimate $\mu = 0$, then we get a full error contribution of $A^2$ from that part.

The second term reflects the training noise we pick up in our estimate. When regularization is 0, we take all the noise into our estimate, so, after averaging, we get an error contribution of $\sigma^2/n$. When regularization goes to infinity, this contribution goes to zero as our estimate shrinks to zero.

We can use the relative contributions of these two components to compute an overfitting index: $$\text{overfit} = {n \over \lambda^2} {\sigma^2 \over A^2}.$$ This behaves as we would expect relative to all the parameters, though it’s a bit strange that it says overfitting increases with $n$ — although I suppose that reflects the noise that we accumulate in the estimate.

The final term in the test error is the irreducible contribution due to the new noise samples in the test data.

Let’s write the test error out again: $${1 \over n} \text{Err}_\text{test} = \left({\lambda \over n + \lambda}\right)^2 A^2 + \left({ n \over n + \lambda} \right)^2 {\sigma^2 \over n} + \sigma^2.$$

As $\lambda$ ranges from 0 to $\infty$, the first term goes from 0 to $A^2$, while the second term goes from $\sigma^2 / n$ to 0.

If the coefficients summed to 1, the problem would be easy: set $\lambda$ to 0 if $A^2 > \sigma^2/ n$, to $\infty$ otherwise.

Unfortunately, the coefficients don’t sum to 1. E.g. if $A^2 > \sigma^2/n$, we can start at $\lambda = 0$ to get an error of $\sigma^2/n$. As we increase $\lambda$ however, the amount by which we reduce the contribution from $\sigma^2/n$ isn’t necessarily the same as that we gain from $A^2$, so it’s not clear that we shouldn’t increase $\lambda$ to some intermediate amount.

So let’s do the calculation. In stead of working with $\lambda$, let’s work with $p = \lambda / n + \lambda$ instead. Then our test error is $$ E(p) = p^2 A^2 + (1 – p)^2 {\sigma^2 / n}.$$ Taking the derivative and setting it to zero, we get $$ 2p A^2 – 2(1 – p) {\sigma^2 \over n} = 0 \implies p = {\sigma^2 \over \sigma^2 + n A^2} \implies \lambda = {\sigma^2 \over A^2}.$$ The second derivative is positive so this is a minimum.

The result makes sense: as $A$ increases relative to $\sigma$, the optimal regularization should decrease.

Our estimator with minimum test error then becomes $$\mu^* = {\overline{y} \over n + {\sigma^2 \over A^2}} = { n A + \mathcal{N}(0, n\sigma^2) \over n + {\sigma^2 \over A^2}} = { A + \mathcal{N}(0, \sigma^2/n) \over 1 + {\sigma^2 / n \over A^2}} = { \mathcal{N}(A, \sigma^2/n) \over 1 + \text{SNR}^{-1}} .$$

So, at least with $\ell_2$ regularization, all we get is shrinkage – we don’t actually have the estimator reporting zero if the amplitude is too low.

Lasso

Shrinkage was to be expected from ridge regression, what about $\ell_1$ regularization? This is what’s happening with the nuclear norm minimization.

We’ll follow the same pattern as before. The $\ell_1$ solution to the problem is
$$ \argmin_\mu \; {1 \over 2} \sum_i (y_i – \mu)^2 + \lambda |\mu|.$$ Taking the derivative and setting to zero, we get the soft-thresholding solution \begin{align}\sum_i (y_i – \mu) + \lambda \sgn{\mu} = 0 &\implies \overline{y} – \mu + {\lambda \over n} \sgn{\mu} = 0\\ &\implies \mu = \begin{cases}(|\overline{y}| – \lambda’) \sgn{\overline{y}} & |\overline{y}| \ge \lambda’ \\ 0 & \text{otherwise.} \end{cases}\\&= T_ {\lambda’}(\overline{y}), \end{align} where $\lambda’ \triangleq \lambda/n.$

Training Error

The training error is \begin{align} \text{Err}_\text{train} &= \sum_i (y_i – \mu)^2 \\&= \sum_i (A + \veps_i – T_{\lambda’}(\overline{y}))^2\\ &= p_0 \sum_i (A + \veps_i)^2 + (1 – p_0) \sum_i (A + \veps_i – (|\overline{y}| – \lambda’)\sgn{\overline{y}})^2,\end{align} where $p_0$ is the probability that the soft-thresholding actually kicks in and outputs a zero. What will make this complicated is that $p_0$ will constrain the $\veps_i$, so we can’t treat the probability and those values independently.

$$\blacksquare$$


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *