Day 5 : Cross Validation and Hyperparameter Tuning

本日重點

如何透過 Cross Validation 切分資料和驗證模型的好壞
如何透過 Hyperparameter Tuning 為模型選到一組最佳參數組合
如何透過 R 實現 Cross Validation & XGBoost Tuning

準確是兩個概念。準是bias小，確是variance小。準確是相對概念，因為bias-variance tradeoff。 —— Liam Huang

Part 1 : Cross Validation

在機器學習中，一般會透過 Cross Validation(CV) 的方式來確保訓練出來的模型效果好 (high prediction accuracy) 但又不會過擬合 (good at generalization)

基於以上目的，常將資料切分成以下三種：

訓練集(Training Set)：用來訓練和擬合模型
驗證集(Validation Set)：用來評估不同超參數或是不同模型表現而切出來的一部分資料集
測試集(Testing Set)：事先保留出來的一部分資料集，在整個機器學習專案的過程中都不能使用，專案結束時用來測試模型的表現，確保模型在新資料上的表現也是好的

本次活動所提供的 test.csv 就是測試集，沒有提供目標欄位 “SalesPrice” ，因此沒有辦法拿來訓練，單純只是為了驗證模型在新資料上的表現；為了訓練模型，在 CV 的過程中會把 train.csv 再切分成訓練集和驗證集。

CV進行的基本流程是：

保留一部分資料作為驗證集
用剩餘的資料去訓練模型
用訓練出的模型在驗證集中測試

而常用的 CV 方法有：

Hold Out CV：僅做一次的訓練/測試切分，缺點是會有一些資料沒有拿來訓練過，且僅分一次的變異性較大。

Figure 1. Hold Out CV

Leave-one-out CV：將每一筆資料都當成驗證集，剩下的資料當成訓練集，如此重複直到每筆資料都有被當成過訓練集為止；優點是每一次訓練都幾乎有用到所有的樣本，且無隨機因素的影響；但缺點是計算成本過高，適合資料集不大的時候使用。

Figure 2. Leave-one-out CV

k-fold CV：將資料隨機平均分成 k 個集合，每次將某個集合當成驗證集，剩下的 k-1 個集合當成訓練集，如此重複直到每個集合都有被當成過訓練集為止，最後再取每一次的平均得到最終結果；k-fold CV 一定程度上結合了以上兩種方法的優點；當k越小時就會越接近 Hold Out CV（k=1）的做法，bias 會越大，當 k 越大時 bias 就越小，也就會越接近 Leave-one-out CV（ k 等於樣本數 n 時）的做法。

Figure 3. k-fold CV

Part 2 : Hyperparameter Tuning

調參的複雜性在於不同的模型有不同的超參數，而各個參數的意義也都不同，甚至對不同的訓練資料集，超參數的效果也會不一樣。當面對一個真實的資料集時，通常無法得知什麼樣的參數組合是比較好的，一般都會需要透過 trial-n-error 去找到最佳組合；當然一些主流的演算法都可以在網路上找到一些經驗值的分享，比較有效率的方式是可以透過研究前人經驗並結合一些調參策略來執行；常用的調參方法有如下幾種：

Random Search：從定義的參數空間隨機選取
Grid Search：會對定義的搜尋空間中的所有可行值執行方格搜尋
Bayesian Search：以貝氏演算法為基礎，並對下一個要取樣的超參數做出明智的選擇，會根據先前樣本的執行方式來挑選樣本

最後，從訓練集中找出最佳參數組合的驗證方式也需要透過CV來完成。

Part 3 : Demo XGBoost Tuning in R

3.1 環境設定和套件安裝

##設定環境
#setwd(dir) #設定working directory的存放位置
# MAC : setwd("/Users/rladiestaipei/R_DragonBall/") 
# Windows : setwd("C://Users/rladiestaipei/Desktop/R_DragonBall/")  

#安裝套件(僅需執行一次)
#install.packages(c("caTools", "caret", "dplyr", 'xgboost'),
#                 dependencies = TRUE)

#load packages
library(caTools)
library(caret)
library(dplyr)
library(xgboost)

3.2 讀取特徵工程後的資料

複習一下昨天讀資料、切資料以及如何產生給XGBoost的matrix

讀資料，根據 Day 3 Feature Engineering 會產生整理完的資料檔，下載連結)
將資料切分成 training set 和 validation set
將所有變數轉成 numeric 並將 training set 和 validation set 分別轉成 Dmatrix 的物件

dataset <- read.csv("train_new.csv")

# select features you want to put in models
# 這邊請根據前面幾天所學的去放入你認為重要的變數(如下只是範例)
dataset <- dataset %>% dplyr::select(SalePrice_log, X1stFlrSF, TotalBsmtSF, 
    YearBuilt, LotArea, Neighborhood, GarageCars, GarageArea, GrLivArea_stand, 
    MasVnrArea_stand, LotFrontage_log, is_Fireplace, TotalBathrooms, TotalSF_stand)

# transfer all feature to numeric
cat_index <- which(sapply(dataset, class) == "factor")
dataset[cat_index] <- lapply(dataset[cat_index], as.numeric)

# Splitting the dataset into the Training set and Validation set
set.seed(1)
split <- sample.split(dataset$SalePrice_log, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
val_set <- subset(dataset, split == FALSE)


# put testing & training data into two seperates Dmatrixs objects
tr_x <- as.matrix(training_set)
tr_y <- training_set$SalePrice_log
val_x <- as.matrix(val_set)
val_y <- val_set$SalePrice_log

dtrain <- xgb.DMatrix(data = tr_x, label = tr_y) 
dval <- xgb.DMatrix(data = val_x, label = val_y)

3.3 用 XGBoost 的預設參數來訓練一個基準(baseline)模型

nrounds: Number of trees, default: 100
max_depth: Maximum tree depth, default: 6
eta: Learning rate, default: 0.3
gamma: Used for tuning of Regularization, default: 0
colsample_bytree: Column sampling, default: 1
min_child_weight: Minimum leaf weight, default: 1
subsample: Row sampling, default: 1

#default parameters
default_params <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

train_control <- caret::trainControl(
  method = "none",
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_base <- caret::train(
  x = tr_x,
  y = tr_y,
  trControl = train_control,
  tuneGrid = default_params,
  method = "xgbTree",
  verbose = TRUE
)

xgb_base_rmse <- ModelMetrics::rmse(val_y, predict(xgb_base, newdata = val_x))
xgb_base_rmse

[1] 0.004612619

# 0.004612619

3.4 XGBoost 的調參策略

參考 Complete Guide to Parameter Tuning in XGBoost with codes in Python

Learning rate(eta) 越低訓練所需的時間會越長，因此在一開始的時候先固定一個相對高一點的 eta，通常選 0.1，對不同的問題也可以選 0.05-0.3
為選定的 eta 確定一個最佳的樹的數量(nrounds)
為選定的 eta&nrounds 調試樹相關的參數(max_depth, min_child_weight)
再調試選擇抽樣資料筆數的 subsample 和欄位數量的 colsample_bytree
調試正規化相關的參數 gamma 可以降低模型的複雜度防止 overfitting
最後，降低 learning rate 並增加樹的數量

Step 1. Number of iterations and learning rate

## 使用library(caret)來調參
# 這裡示範用grid search的方式來調參
grid <- expand.grid(
  nrounds = seq(from = 200, to = 1000, by = 50),
  eta = c(0.025, 0.05, 0.1, 0.3),
  max_depth = c(2, 3, 4, 5, 6),
  gamma = 0, 
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

control <- caret::trainControl(
  method = "cv",
  number = 3, # cross validation with n(n=3) folds
  verboseIter = FALSE,
  allowParallel = FALSE # FALSE for reproducible results 
)

xgb_tune <- caret::train(
  x = tr_x,
  y = tr_y,
  trControl = control,
  tuneGrid = grid,
  method = "xgbTree",
  verbose = TRUE
)

# 查看tune好的最佳參數和結果
xgb_tune$bestTune

   nrounds max_depth   eta gamma colsample_bytree min_child_weight subsample
24     500         3 0.025     0                1                1         1

Step 2. Maximum depth and minimum child weight

grid2 <- expand.grid(
  nrounds = seq(from = 200, to = 1000, by = 50),
  eta = 0.025, # 或者直接動態設定為 xgb_tune$bestTune$eta
  max_depth =  c(5, 6, 7), # 或者直接動態設定為 ifelse(xgb_tune$bestTune$max_depth == 2,
    # c(xgb_tune$bestTune$max_depth:4),
    # xgb_tune$bestTune$max_depth - 1:xgb_tune$bestTune$max_depth + 1),
  gamma = 0, 
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

xgb_tune2 <- caret::train(
  x = tr_x,
  y = tr_y,
  trControl = control,
  tuneGrid = grid2,
  method = "xgbTree",
  verbose = TRUE
)
# 查看tune好的最佳參數和結果
xgb_tune2$bestTune

    nrounds max_depth   eta gamma colsample_bytree min_child_weight subsample
112     650         7 0.025     0                1                1         1

Step 3. Column and row sampling

grid3 <- expand.grid(
  nrounds = seq(from = 200, to = 1000, by = 50),
  eta = 0.025, # 或者直接動態設定為 xgb_tune$bestTune$eta
  max_depth =  6,  # 或者直接動態設定為 xgb_tune2$bestTune$max_depth
  gamma = 0, 
  colsample_bytree = c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = 1, # 或者直接動態設定為 xgb_tune2$bestTune$min_child_weight
  subsample = c(0.5, 0.65, 0.8, 0.95, 1.0)
)

xgb_tune3 <- caret::train(
  x = tr_x,
  y = tr_y,
  trControl = control,
  tuneGrid = grid3,
  method = "xgbTree",
  verbose = TRUE
)
# 查看tune好的最佳參數和結果
xgb_tune3$bestTune

    nrounds max_depth   eta gamma colsample_bytree min_child_weight subsample
304     900         6 0.025     0                1                1       0.8

Step 4. Gamma

grid4 <- expand.grid(
  nrounds = seq(from = 200, to = 1000, by = 50),
  eta = 0.025, # 或者直接動態設定為 xgb_tune$bestTune$eta
  max_depth =  6,  # 或者直接動態設定為 xgb_tune2$bestTune$max_depth
  gamma = c(0, 0.05, 0.1, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree = 1, # 或者直接動態設定為 xgb_tune3$bestTune$colsample_bytree
  min_child_weight = 1, # 或者直接動態設定為 xgb_tune2$bestTune$min_child_weight
  subsample = 0.5 # 或者直接動態設定為 xgb_tune3$bestTune$subsample
)

xgb_tune4 <- caret::train(
  x = tr_x,
  y = tr_y,
  trControl = control,
  tuneGrid = grid4,
  method = "xgbTree",
  verbose = TRUE
)
# 查看tune好的最佳參數和結果
xgb_tune4$bestTune

   nrounds max_depth   eta gamma colsample_bytree min_child_weight subsample
16     950         6 0.025     0                1                1       0.5

Step 5. Reducing the learning rate and increase nrounds

grid5 <- expand.grid(
  nrounds = seq(from = 500, to = 10000, by = 100),
  eta = c(0.01, 0.015, 0.025, 0.05, 0.1), 
  max_depth =  6,  # 或者直接動態設定為 xgb_tune2$bestTune$max_depth
  gamma = 0, # 或者直接動態設定為 xgb_tune4$bestTune$gamma
  colsample_bytree = 1, # 或者直接動態設定為 xgb_tune3$bestTune$colsample_bytree
  min_child_weight = 1, # 或者直接動態設定為 xgb_tune2$bestTune$min_child_weight
  subsample = 0.5 # 或者直接動態設定為 xgb_tune3$bestTune$subsample
)

xgb_tune5 <- caret::train(
  x = tr_x,
  y = tr_y,
  trControl = control,
  tuneGrid = grid5,
  method = "xgbTree",
  verbose = TRUE
)
# 查看tune好的最佳參數和結果
xgb_tune5$bestTune

   nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
85    8900         6 0.01     0                1                1       0.5

## 獲得最後的參數
final_grid <- expand.grid(
  nrounds = xgb_tune5$bestTune$nrounds,
  eta = xgb_tune5$bestTune$eta,
  max_depth = xgb_tune5$bestTune$max_depth,
  gamma = xgb_tune5$bestTune$gamma,
  colsample_bytree = xgb_tune5$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5$bestTune$min_child_weight,
  subsample = xgb_tune5$bestTune$subsample
)
final_grid

  nrounds  eta max_depth gamma colsample_bytree min_child_weight subsample
1    8900 0.01         6     0                1                1       0.5

# 用tune好的最佳參數套用到模型
xgb_model <- caret::train(
  x = tr_x,
  y = tr_y,
  trControl = train_control,
  tuneGrid = final_grid,
  method = "xgbTree",
  verbose = TRUE
)

# 查看訓練出來的模型在驗證集中的表現
xgb_tuned_rmse <- ModelMetrics::rmse(val_y, predict(xgb_model, newdata = val_x))

xgb_tuned_rmse # tune過的模型在驗證集上的RMSE

[1] 0.0008546333

xgb_base_rmse # 用預設參數的模型在驗證集上的RMSE

[1] 0.004612619

# RMSE 提升程度
xgb_base_rmse - xgb_tuned_rmse

[1] 0.003757986

## 和baseline模型相比會發現，經過調參後 RMSE 從 0.0046 降低到了 0.0011！！！

Reference

本日小挑戰

機器學習中如何處理 “Bias-Variance Tradeoff” 的問題？