YD's blog

Posted 三 22 3月 2017

SF6 Linear Model Selection

Methods of model selection:

今天要介紹模型的適選,主要有三種大方向:

接下來就依照這三個方向進行介紹

Subset selection

Best Subset Selection

  1. intercept model
  2. training error
  3. testing error (indirect and direct estimate)

當預測因子大於40不建議用 best subset selection (迭代次數 $2^{40}$)


Forward Stepwise Selection

迭代次數相對較少,只有 $\frac{p^2+p}{2}$ 次


Backward Stepwise Selection

限制: n 必須大於 p (否則沒辦法計算 最小平方法)


Choosing Optimal Model

Indirect estimate of testing error

n: observation, d: numbers of predictors, $\hat{\sigma}^2$: estimate of the variance of the error

L: maximum likelihood estimate; 右式為一般 的 AIC通則+,左式則是線性迴歸特例公式

當 n > 7 時,對於 predictor 較多的 model 懲罰更大。使用 BIC 會更偏好參數少的 model


Direct estimate of testing error

Shrinkage Methods

Recall least square fitting procedure, RSS:

$RSS = \sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2$

Ridge regression:

$RSS + \lambda\sum_{j=1}^{p}\beta_{j}^2$

$\lambda\sum_{j=1}^{p}\beta_{j}^2$: shrinkage penalty

$\lambda$: tunning parameter( $\lambda = 0$ 為 ordinary least square fitting, $\lambda$ 越大,係數會越小,逐漸逼近0)

$\hat{\beta}$: least squares coefficient estimates

$\hat{\beta}_{\lambda}^{R}$: ridge regression coefficient

$\vert\vert\beta\vert\vert_2 =\sqrt{\sum_{j=1}^{p}\beta_{j}^2}$

LASSO:

$RSS + \lambda\sum_{j=1}^{p}|\beta_j|$

ridge regression 的變形 (1996 invented) 當 $\lambda$ 夠大,可以讓係數變成0 而不是趨近於0 (sparsity)

Issue of scale equivariant

Scale equivariant: 已知 $X_i$ 利用最小平方法所得出的係數為 $\beta_i$。 若 $X_i$ 乘上定值 $c$ ,則得到的係數 $\hat{\beta_i}$ 會變成原本的 $1/c$ 倍。即 $X_i\beta_i$ 會是定值。 但在 Ridge regression 或是 LASSO 沒有這樣的特性,因此最好的做法,是在對 predictors 事前的標準化:

$\tilde{x_{ij}}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}$

Trade-off between Ridge regression and Least squares

Bias-Variance trade-off:

Visualize the variable selection property

Another formula for the LASSO and ridge regression:

如何選擇 ridge regression 或 LASSO?

dense model (gene) → ridge regression sparse model (gender) → LASSO

What’s about choosing the $\lambda$ (Tuning parameter) ?

Cross Validation:

  1. 資料先分成訓練、測試集
  2. 訓練集建模 (ridge or LASSO)
  3. 訓練模型含有若干 $\lambda$ 值
  4. 將不同 $\lambda$ 所得的模型預測結果與測試集 得到 RMSE
  5. 找出 min(RMSE) 的 $\lambda$ 值,該模型的係數

code demo

library(leaps)
library(ISLR)
Hitters <- na.omit(Hitters)
trainNum <- round(dim(Hitters)[1]*2/3)
set.seed(1)
trainIdx <- sample(dim(Hitters)[1],trainNum,replace=F)
# training model (use the best lambda to get coef)
lasso.tr <- glmnet(x[trainIdx,],y[trainIdx])
lasso.tr # 88 values of lambda
# cross validation?
pred <- predict(lasso.tr,x[-trainIdx,])
dim(pred) # compute 92 times
rmse <- sqrt(apply((y[-trainIdx]-pred)^2,2,mean))
plot(log(lasso.tr$lambda),rmse,type="b",xlab="log(lambda)")
lam.best <- lasso.tr$lambda[order(rmse)[1]]
coef(lasso.tr,s= lam.best) # after obtained the training model, we can set a lambda value which is got from the above method and finally got the coef

Dimension Reduction Methods

概念:利用原始自變量的線性轉換所生成新的自變量 原理:利用自變量間的相關,聚合新的自變量,減少共線性 目的:將原本的 $p$ 個變量 縮減到 $m$ 個變量, $m < p$

Principle Component Regression:

第一主成份:統計學家常用的假設,變異最大的方向向量(high variance is probably going to be associated with the response)

缺點: 只考慮自變量的線性轉換,未考慮自變量與應變量間的關係,導致模型可能出現 full model is the best model 的狀況,無法化簡主成份模型應有的效果:

Partial Least Squares

根據 PCR 得到的 $Z_1,...,Z_m$ 和 $y$ 的關係,篩選出 subset $Z_j$ 屬於 supervised learning procedure

Category: Stat
Tags: Stat