SF6 Linear Model Selection

Methods of model selection:

今天要介紹模型的適選，主要有三種大方向：

Subset selection
Shrinkage methods
Dimension reduction methods

接下來就依照這三個方向進行介紹

Subset selection

Best Subset Selection

intercept model
training error
testing error (indirect and direct estimate)

當預測因子大於40不建議用 best subset selection (迭代次數 $2^{40}$)

Forward Stepwise Selection

迭代次數相對較少，只有 $\frac{p^2+p}{2}$ 次

Backward Stepwise Selection

限制： n 必須大於 p （否則沒辦法計算最小平方法）

Choosing Optimal Model

Indirect estimate of testing error

$C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2)$

n: observation, d: numbers of predictors, $\hat{\sigma}^2$: estimate of the variance of the error

$AIC = \frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2) = -2logL+2\cdot d$

L: maximum likelihood estimate; 右式為一般的 AIC通則+，左式則是線性迴歸特例公式

$BIC=\frac{1}{n}(RSS+log(n)d\hat{\sigma}^2)$

當 n > 7 時，對於 predictor 較多的 model 懲罰更大。使用 BIC 會更偏好參數少的 model

$\text{Adjusted } R^2 = 1- \frac{RSS/(n-d-1)}{TSS/(n-1)}$

Direct estimate of testing error

Validation
Cross Validation

Shrinkage Methods

Ridge regression
Lasso

Recall least square fitting procedure, RSS:

$RSS = \sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2$

Ridge regression:

$RSS + \lambda\sum_{j=1}^{p}\beta_{j}^2$

$\lambda\sum_{j=1}^{p}\beta_{j}^2$： shrinkage penalty

$\lambda$： tunning parameter（ $\lambda = 0$ 為 ordinary least square fitting, $\lambda$ 越大，係數會越小，逐漸逼近0）

$\hat{\beta}$: least squares coefficient estimates

$\hat{\beta}_{\lambda}^{R}$: ridge regression coefficient

$\vert\vert\beta\vert\vert_2 =\sqrt{\sum_{j=1}^{p}\beta_{j}^2}$

LASSO:

$RSS + \lambda\sum_{j=1}^{p}|\beta_j|$

ridge regression 的變形 (1996 invented) 當 $\lambda$ 夠大，可以讓係數變成0 而不是趨近於0 (sparsity)

Issue of scale equivariant

Scale equivariant: 已知 $X_i$ 利用最小平方法所得出的係數為 $\beta_i$。若 $X_i$ 乘上定值 $c$ ，則得到的係數 $\hat{\beta_i}$ 會變成原本的 $1/c$ 倍。即 $X_i\beta_i$ 會是定值。但在 Ridge regression 或是 LASSO 沒有這樣的特性，因此最好的做法，是在對 predictors 事前的標準化：

$\tilde{x_{ij}}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}$

Trade-off between Ridge regression and Least squares

Bias-Variance trade-off:

least squares: high variance and low bias (complex model)
ridge regression: high bias and low variance (simple model)

Visualize the variable selection property

Another formula for the LASSO and ridge regression:

如何選擇 ridge regression 或 LASSO?

dense model (gene) → ridge regression sparse model (gender) → LASSO

What’s about choosing the $\lambda$ (Tuning parameter) ?

Cross Validation:

資料先分成訓練、測試集
訓練集建模 (ridge or LASSO)
訓練模型含有若干 $\lambda$ 值
將不同 $\lambda$ 所得的模型預測結果與測試集得到 RMSE
找出 min(RMSE) 的 $\lambda$ 值，該模型的係數

code demo

library(leaps)
library(ISLR)
Hitters <- na.omit(Hitters)
trainNum <- round(dim(Hitters)[1]*2/3)
set.seed(1)
trainIdx <- sample(dim(Hitters)[1],trainNum,replace=F)
# training model (use the best lambda to get coef)
lasso.tr <- glmnet(x[trainIdx,],y[trainIdx])
lasso.tr # 88 values of lambda
# cross validation?
pred <- predict(lasso.tr,x[-trainIdx,])
dim(pred) # compute 92 times
rmse <- sqrt(apply((y[-trainIdx]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="log(lambda)")
lam.best <- lasso.tr$lambda[order(rmse)[1]]
coef(lasso.tr,s= lam.best) # after obtained the training model, we can set a lambda value which is got from the above method and finally got the coef

Dimension Reduction Methods

概念：利用原始自變量的線性轉換所生成新的自變量原理：利用自變量間的相關，聚合新的自變量，減少共線性目的：將原本的 $p$ 個變量縮減到 $m$ 個變量, $m < p$

Principle Component Regression:

第一主成份：統計學家常用的假設，變異最大的方向向量（high variance is probably going to be associated with the response）

缺點: 只考慮自變量的線性轉換，未考慮自變量與應變量間的關係，導致模型可能出現 full model is the best model 的狀況，無法化簡主成份模型應有的效果：

Partial Least Squares

根據 PCR 得到的 $Z_1,...,Z_m$ 和 $y$ 的關係，篩選出 subset $Z_j$ 屬於 supervised learning procedure

Category: Stat
Tags: Stat

YD

YD's blog

Posted 三 22 3月 2017