本日重點
若模型的預測值與實際值的差異太大,會造成“損失”
例如: 為了制定價格而作價格迴歸預測模型
為了最小化損失,希望模型預測值與實際值的差異越小越好,而不同的損失函數衍生出不同的衡量指標,以下介紹兩個常見的衡量指標
還有另一個常見的衡量指標用可於判斷模型的解釋力:
為了稍後要用 R 分別得出常見衡量指標 RMSE、MAE、R-Squared 的數值,要先設定環境並跑模型、存取用 validation data 跑好的預測結果(此處僅是舉例)
##設定環境
#setwd(dir) #設定working directory的存放位置
# MAC : setwd("/Users/rladiestaipei/R_DragonBall/")
# Windows : setwd("C://Users/rladiestaipei/Desktop/R_DragonBall/")
#安裝套件(僅需執行一次)
#install.packages(c("caTools", "caret", "dplyr", 'xgboost',"Metrics"),
# dependencies = TRUE)
#load packages
library(caTools)
library(caret)
library(dplyr)
library(xgboost)
library(Metrics)
#讀取資料
dataset <- read.csv("train_new.csv")
# select features you want to put in models
# 這邊請根據前面幾天所學的去放入你認為重要的變數(如下只是範例)
dataset <- dataset %>% dplyr::select(SalePrice_log, X1stFlrSF, TotalBsmtSF,
YearBuilt, LotArea, Neighborhood, GarageCars, GarageArea, GrLivArea_stand,
MasVnrArea_stand, LotFrontage_log, is_Fireplace, TotalBathrooms, TotalSF_stand)
# Splitting the dataset into the Training set and Validation set
# library(caTools)
set.seed(1)
split <- sample.split(dataset$SalePrice_log, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
val_set <- subset(dataset, split == FALSE)
先跑 Linear Regression
##Linear Regression
# Fitting Multiple Linear Regression to the Training set
Reg <- lm(formula = SalePrice_log ~ ., data = training_set)
#Get prediction of validation set
pred_reg <- predict(Reg, newdata = val_set)
再跑一個XGBoost
##XGBoost
# library(xgboost)
# 因XGBoost是吃matrix的格式,故須將所有資料皆轉換為數值型,存入矩陣中
# transfer all feature to numeric
training_set_new <- training_set %>% dplyr::select(-SalePrice_log)
val_set_new <- val_set %>% dplyr::select(-SalePrice_log)
cat_index <- which(sapply(training_set_new, class) == "factor")
training_set_new[cat_index] <- lapply(training_set_new[cat_index], as.numeric)
val_set_new[cat_index] <- lapply(val_set_new[cat_index], as.numeric)
# put testing & training data into two seperates Dmatrixs objects
labels <- training_set$SalePrice_log
dtrain <- xgb.DMatrix(data = as.matrix(training_set_new),label = labels)
dval <- xgb.DMatrix(data = as.matrix(val_set_new))
# set parameters
param <-list(objective = "reg:linear",
booster = "gbtree",
eta = 0.01, #default = 0.3
gamma=0,
max_depth=3, #default=6
min_child_weight=4, #default=1
subsample=1,
colsample_bytree=1
)
# Fitting XGBoost to the Training set
set.seed(1)
xgb_base <- xgb.train(params = param, data = dtrain, nrounds = 3000
#watchlist = list(train = dtrain, val = dval),
#print_every_n = 50, early_stopping_rounds = 300
)
#Get prediction of validation set
pred_xgb_base <- predict(xgb_base, dval)
RMSE 複習在 Day4 & Day5 學過的,曾經分別用2個不同的套件計算過 RMSE ( Day4 的Metrics & Day5 套件 caret 中的 ModelMetrics )
Caculate RMSE Metrics Package (Day4)
# Linear Regression
rmse(val_set$SalePrice_log, pred_reg)
[1] 0.1531402
# XGBoost
rmse(val_set$SalePrice_log, pred_xgb_base)
[1] 0.1442952
也可以還原公式, 直接計算 MSE & RMSE, 以 XGBoost 預測結果為例
先算 MSE
mse_cal <- mean((pred_xgb_base - val_set$SalePrice_log)**2)
print(mse_cal)
[1] 0.0208211
再算 RMSE
rmse_cal <- sqrt(mse_cal)
print(rmse_cal)
[1] 0.1442952
# 計算結果相同, 0.1442952 = 0.1442952
MAE
# XGBoost - Metrics Package
mae(val_set$SalePrice_log, pred_xgb_base)
[1] 0.09977234
其他衡量指標的計算可直接套用 Metrics Package 或直接計算
本次 Kaggle 競賽採用的衡量基準為 testset log(SalePrice) 預測值與真實值的 RMSE –> 越小越好 –> 因此要選取 Validation set RMSE 最小的模型
# 還原log的範例
x <- 87
x_log <- log(x)
exp(x_log)
[1] 87
二選一
A. 找出你最好的模型,上傳你的 testset 預測結果到 Kaggle 平台上吧! 並記錄下 Kaggle 幫你算出來的實際 RMSE 是多少
B. 如何得出模型的 R-Squared