SF6 Linear Model Selection
Methods of model selection:
今天要介紹模型的適選,主要有三種大方向:
- Subset selection
- Shrinkage methods
- Dimension reduction methods
接下來就依照這三個方向進行介紹
Subset selection
Best Subset Selection
- intercept model
- training error
- testing error (indirect and direct estimate)
當預測因子大於40不建議用 best subset selection (迭代次數 $2^{40}$)
Forward Stepwise Selection
迭代次數相對較少,只有 $\frac{p^2+p}{2}$ 次
Backward Stepwise Selection
限制: n 必須大於 p (否則沒辦法計算 最小平方法)
Choosing Optimal Model
Indirect estimate of testing error
- $C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2)$
n: observation, d: numbers of predictors, $\hat{\sigma}^2$: estimate of the variance of the error
- $AIC = \frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2) = -2logL+2\cdot d$
L: maximum likelihood estimate; 右式為一般 的 AIC通則+,左式則是線性迴歸特例公式
- $BIC=\frac{1}{n}(RSS+log(n)d\hat{\sigma}^2)$
當 n > 7 時,對於 predictor 較多的 model 懲罰更大。使用 BIC 會更偏好參數少的 model
- $\text{Adjusted } R^2 = 1- \frac{RSS/(n-d-1)}{TSS/(n-1)}$
Direct estimate of testing error
- Validation
- Cross Validation
Shrinkage Methods
- Ridge regression
- Lasso
Recall least square fitting procedure, RSS:
$RSS = \sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2$
Ridge regression:
$RSS + \lambda\sum_{j=1}^{p}\beta_{j}^2$
$\lambda\sum_{j=1}^{p}\beta_{j}^2$: shrinkage penalty
$\lambda$: tunning parameter( $\lambda = 0$ 為 ordinary least square fitting, $\lambda$ 越大,係數會越小,逐漸逼近0)
$\hat{\beta}$: least squares coefficient estimates
$\hat{\beta}_{\lambda}^{R}$: ridge regression coefficient
$\vert\vert\beta\vert\vert_2 =\sqrt{\sum_{j=1}^{p}\beta_{j}^2}$
LASSO:
$RSS + \lambda\sum_{j=1}^{p}|\beta_j|$
ridge regression 的變形 (1996 invented) 當 $\lambda$ 夠大,可以讓係數變成0 而不是趨近於0 (sparsity)
Issue of scale equivariant
Scale equivariant: 已知 $X_i$ 利用最小平方法所得出的係數為 $\beta_i$。 若 $X_i$ 乘上定值 $c$ ,則得到的係數 $\hat{\beta_i}$ 會變成原本的 $1/c$ 倍。即 $X_i\beta_i$ 會是定值。 但在 Ridge regression 或是 LASSO 沒有這樣的特性,因此最好的做法,是在對 predictors 事前的標準化:
$\tilde{x_{ij}}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}$
Trade-off between Ridge regression and Least squares
Bias-Variance trade-off:
- least squares: high variance and low bias (complex model)
- ridge regression: high bias and low variance (simple model)
Visualize the variable selection property
Another formula for the LASSO and ridge regression:
如何選擇 ridge regression 或 LASSO?
dense model (gene) → ridge regression sparse model (gender) → LASSO
What’s about choosing the $\lambda$ (Tuning parameter) ?
Cross Validation:
- 資料先分成訓練、測試集
- 訓練集建模 (ridge or LASSO)
- 訓練模型含有若干 $\lambda$ 值
- 將不同 $\lambda$ 所得的模型預測結果與測試集 得到 RMSE
- 找出 min(RMSE) 的 $\lambda$ 值,該模型的係數
code demo
library(leaps)
library(ISLR)
Hitters <- na.omit(Hitters)
trainNum <- round(dim(Hitters)[1]*2/3)
set.seed(1)
trainIdx <- sample(dim(Hitters)[1],trainNum,replace=F)
# training model (use the best lambda to get coef)
lasso.tr <- glmnet(x[trainIdx,],y[trainIdx])
lasso.tr # 88 values of lambda
# cross validation?
pred <- predict(lasso.tr,x[-trainIdx,])
dim(pred) # compute 92 times
rmse <- sqrt(apply((y[-trainIdx]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="log(lambda)")
lam.best <- lasso.tr$lambda[order(rmse)[1]]
coef(lasso.tr,s= lam.best) # after obtained the training model, we can set a lambda value which is got from the above method and finally got the coef
Dimension Reduction Methods
概念:利用原始自變量的線性轉換所生成新的自變量 原理:利用自變量間的相關,聚合新的自變量,減少共線性 目的:將原本的 $p$ 個變量 縮減到 $m$ 個變量, $m < p$
Principle Component Regression:
第一主成份:統計學家常用的假設,變異最大的方向向量(high variance is probably going to be associated with the response)
缺點: 只考慮自變量的線性轉換,未考慮自變量與應變量間的關係,導致模型可能出現 full model is the best model 的狀況,無法化簡主成份模型應有的效果:
Partial Least Squares
根據 PCR 得到的 $Z_1,...,Z_m$ 和 $y$ 的關係,篩選出 subset $Z_j$ 屬於 supervised learning procedure