L5_LR_Reg slides

AERSP 597 - Machine Learning in Aerosapce Engineering¶

Lecture 5, Linear Regression: Regularization¶

Instructor: Daning Huang¶

TODAY: Linear Regression - II¶

Under/over-fitting
Regularization
- Ridge regression
- Lasso
- Elastic net

References:¶

[PRML] Chp. 3.1
[MLaPP] Chps. 7, 13

Previously: Curve-fitting examples...¶

Consider 1d case (e.g. D=1)
- Given a set of observations $x_1, \dots, x_N$
- and corresponding target values $t_1, \dots, t_N$
We want to learn a function $y(x, \vec{w}) \approx t$ to predict future values.

$$ y(x, \vec{w}) = w_0 + w_1 x + w_2 x^2 + \dots w_M x^M = \sum_{k=0}^M w_k x^k $$

0th Order Polynomial¶

In [3]:

coeffs = np.polyfit(x, y, 0)
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
_=plt.plot(xx, poly(xx), "b-")

3rd Order Polynomial¶

In [4]:

coeffs = np.polyfit(x, y, 3)
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
_=plt.plot(xx, poly(xx), "b-")

12th Order Polynomial¶

In [5]:

coeffs = np.polyfit(x, y, 12)
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
_=plt.plot(xx, poly(xx), "b-")

Overfitting¶

Root-Mean-Square (RMS) Error: $E_{RMS} = \sqrt{2E(w^*) / N} = \sqrt{\frac{1}{N}\sum_{i=1}^N[f(x_i)-t_i]^2}$

In [7]:

f = plt.figure()
plt.plot(_ps, _trns, 'bo-', label='Training error')
plt.plot(_ps, _tsts, 'rs-', label='Test error')
plt.legend()
plt.xlabel("Order of polynomial")
_=plt.ylabel("RMSE")

Polynomial Coefficients¶

In [23]:

for _s in _d:
    print(_s)

w0      0.66712    -0.06918     0.10366     0.12353
w1     -0.22739     1.66469     0.04525   156.79232
w2                 -0.80783     1.39030  -874.67537
w3                  0.08731    -1.05527  2007.14361
w4                              0.27977 -2543.62965
w5                             -0.03264  2006.68009
w6                              0.00147 -1045.52459
w7                                        369.91998
w8                                        -89.40320
w9                                         14.53250
w10                                         -1.51902
w11                                          0.09219
w12                                         -0.00247

Data Set Size: $N = 15$¶

(12th order polynomial)

In [10]:

# New trainging set
x1, y1 = generateData(15, True)
# fit
coeffs = np.polyfit(x1, y1, 12)
poly = np.poly1d(coeffs)
_=plt.plot(xx, yy, "g--", x1, y1, "ro", xx, poly(xx), "b-")

Data Set Size: $N = 100$¶

(12th order polynomial)

In [11]:

# New trainging set
x2, y2 = generateData(100, True)
# fit
coeffs = np.polyfit(x2, y2, 12)
poly = np.poly1d(coeffs)
_=plt.plot(xx, yy, "g--", x2, y2, "ro", xx, poly(xx), "b-")

How do we choose the degree of our polynomial?¶

AKA number of features

Numerically, the issue traces back to the "conditioning" of the $\Phi^T\Phi$ matrix.

Recall $\kappa(A)=|A^{-1}||A|$ as the condition number of $A$, $\kappa(A)$.
One wants $\kappa(A)$ close to 1 for "numerically stable" solutions for $Ax=b$.

Example: Solve for $x=A^{-1}b$ $$ A=\begin{bmatrix}2\alpha & 1 \\ 1 & \frac{2}{\alpha}\end{bmatrix},\quad b=[1,10^{10}] $$ and we verify if we recover $b$ by doing $b=Ax=AA^{-1}b$.

In [12]:

kA = lambda a: np.linalg.cond([[2*a,1.],[1.,2/a]])
a = 10**np.linspace(1,11,21)
k = [kA(_) for _ in a]
plt.loglog(a, k)
plt.xlabel(r'$\alpha$')
_=plt.ylabel(r'$\kappa(A)$')

In [13]:

fA = lambda a: np.array([[2*a,1.],[1.,2./a]])
sA = lambda a: fA(a).dot(np.linalg.solve(fA(a), [1,1e10]))[0]
a = 10**np.linspace(1,11,21)
s = [sA(_) for _ in a]
plt.semilogx(a, np.ones_like(a), 'r--', label='Expected')
plt.semilogx(a, s, 'b-', label='Numerical')
plt.xlabel(r'$\alpha$')
plt.ylabel(r'First element of $AA^{-1}b$')
_=plt.legend()

Rule of Thumb - I¶

First of all, normalize your data
- For example, Scale to [0,1] using max and min, or Scale using mean and std $$\hat{x}=\frac{x-x_{\min}}{x_{\max}-x_{\min}},\quad\mbox{or}\quad \hat{x}=\frac{x-\mu}{\sigma}$$
- Scale input only, or both input and output, depending on the bias
- If you scaled the output, remember to scale the predicted output back when using your trained model.
- The min/max or mean/std should be computed from the training dataset only

Rule of Thumb - II¶

Keep a tall-and-thin shape of the design matrix $\Phi$ for better conditioning - which means,
For a small number of datapoints, use a low degree
- Otherwise, the model will overfit!
As you obtain more data, you can gradually increase the degree
- Add more features to represent more data
- Warning: Your model is still limited by the finite amount of data available.

Or -

Use regularization to control model complexity.

Regularized Linear Regression¶

Regularized Least Squares¶

Consider the error function $E_D(w) + \lambda E_W(w)$
- Data term $E_D(w)$
- Regularization term $E_W(w)$
With the sum-of-squares error function and quadratic regularizer, $$ \widetilde{E}(w) = \frac12 \sum_{n=1}^N (y(x_n, w) - t_n)^2 + \boxed{\frac{\lambda}{2} || w ||^2} $$
This is minimized by $$ w = (\lambda I + \Phi^T \Phi)^{-1} \Phi^T t $$

Regularized Least Squares: Derivation¶

Recall that our objective function is

\begin{align} E(w) &= \frac12 \sum_{n=1}^N (w^T \phi(x^{(n)}) - t^{(n)})^2 + \frac{\lambda}{2} w^T w \\ &= \frac12 w^T \Phi^T \Phi w - w^T \Phi^T t + \frac12 t^T t + \frac{\lambda}{2} w^T w \end{align}

Regularized Least Squares: Derivation¶

Compute gradient and set to zero:

$$ \begin{align} \nabla_w E(w) &= \nabla_w \left[ \frac12 w^T \Phi^T \Phi w - w^T \Phi^T t + \frac12 t^T t + \frac{\lambda}{2} w^T w \right] \\ &= \Phi^T \Phi w - \Phi^T t + \lambda w \\ &= (\Phi^T \Phi + \lambda I) w - \Phi^T t = 0 \end{align} $$

Therefore, we get $\boxed{w_{ML} = (\Phi^T \Phi + \lambda I)^{-1} \Phi^T t}$

Regularized Least Squares: Norms¶

We can make use of the various $L_p$ norms for different regularizers:

$$ \widetilde{E}(w) = \frac12 \sum_{n=1}^N (t_n - w^T \phi(x_n))^2 + \frac{\lambda}{2} \sum_{j=1}^M |w_j|^q $$

Norms

L2 regularization is also called Ridge Regression.

Regularized Least Squares: Comparison¶

At the solution the contours are always tangent to each other.
Lasso tends to generate sparser solutions than a quadratic regularizer.

Regularization

Another explanation for L2 regularization, using Fourier analysis: https://arxiv.org/abs/1806.08734v1

Examples: Regularized regression¶

Ridge Regression¶

In [15]:

rdg_sc, rdg_coef = fitModel(1.0, lm.Ridge)

Lasso¶

In [16]:

las_sc, las_coef = fitModel(0.01, lm.Lasso)

LR vs Ridge vs Lasso¶

In [17]:

ii = np.arange(13)
f = plt.figure(figsize=(8,6))
plt.semilogy(ii, np.abs(_cs[11]), 'r--', label='LR')
plt.semilogy(ii, np.abs(rdg_coef), 'b-', label='Ridge')
plt.semilogy(ii, np.abs(las_coef), 'g-', label='Lasso')
plt.legend()
_=plt.xlabel(r'$w_i$')

Bad choice of hyperparameter¶

In [18]:

_, _ = fitModel(1e-8, lm.Ridge)

In [19]:

_, _ = fitModel(1e8, lm.Ridge)

Choosing the best hyperparameter¶

In [21]:

f = plt.figure()
plt.plot(np.log(lams), errs[0], 'b-', label='Training error')
plt.plot(np.log(lams), errs[1], 'r-', label='Test error')
plt.xlabel(r'$\log(\lambda)$')
_=plt.ylabel('1-R2')

Regularized Least Squares: Summary¶

Simple modification of linear regression
L2 Regularization controls the tradeoff between fitting error and complexity.
- Small L2 regularization results in complex models, but with risk of overfitting
- Large L2 regularization results in simple models, but with risk of underfitting
It is important to find an optimal regularization that balances between the two
More on model selection in future lectures

One more model... Elastic net¶

Combination of Ridge and Lasso $$ \widetilde{E}(w) = \frac12 \sum_{n=1}^N (t_n - w^T \phi(x_n))^2 + \lambda\left( \frac{1-\alpha}{2} \sum_{j=1}^M |w_j|^2 + \alpha\sum_{j=1}^M |w_j| \right) $$
Works better when the features are correlated.
More details:
- [MLaPP] Chp. 13.5.3
- H. Zou and T. Hastie, Regularization and variable selection via the elastic net, 2005

Yet more models...¶

The bias of the Lasso estimate for a truly nonzero variable is about $\lambda$ for large regression coefficients coefficients.

To fix the issue: Adaptive Lasso, MCP, SCAD

for those who are interested: https://myweb.uiowa.edu/pbreheny/7600/s16/notes/2-29.pdf

When it comes to deep learning¶

We will see Double-Descent phenomenon - Errors in training and test both decrease as number of features increase.

To be discussed more in L7.