本日重點: 學習 dplyr , tidyr 等套件中的資料清理及轉換技能
##設定環境
#setwd(dir) #設定working directory的存放位置
# MAC : setwd("/Users/rladiestaipei/R_DragonBall/")
# Windows : setwd("C://Users/rladiestaipei/Desktop/R_DragonBall/")
#安裝套件(僅需執行一次)
#install.packages(c("tidyverse", "ploty", "zoo", "lubridate", "rmarkdown","data.table", "DT", "kableExtra"), dependencies = TRUE)
#load packages
library(DT)
library(zoo)
library(plotly)
library(lubridate)
library(rmarkdown)
library(data.table)
library(tidyverse)
library(kableExtra)
options(dplyr.print_max=1e9)
介紹兩種最常用的把資料載入的方式:
read.csv()
以及fread()
read.csv()
: 是 R
的 base function,最常用也最直接讀入 csv 檔的方式fread()
: 來自 data.table
library,當資料量大到一個程度的時候(ex. 百萬筆以上), fread()
會用更快的速度把資料載入.xlsx
檔,可以用 xlsx::read.xlsx()
str()
(或 glimpse()
)這兩個 function 來檢視 read.csv()
以及 fread()
讀入的資料差異為何?#資料需放在此workspace中,若不是的話需要指定路徑
train1<-read.csv("train.csv")
train2<-fread("train.csv")
#str(train1) #檢視所有欄位
str(train1[,1:10]) #避免佔用教材過大篇幅,僅顯示前十個欄位
'data.frame': 1460 obs. of 10 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage: int 65 80 68 60 84 85 75 NA 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
$ LandContour: Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
#str(train2)
str(train2[,1:10])
Classes 'data.table' and 'data.frame': 1460 obs. of 10 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : chr "RL" "RL" "RL" "RL" ...
$ LotFrontage: int 65 80 68 60 84 85 75 NA 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : chr "Pave" "Pave" "Pave" "Pave" ...
$ Alley : chr NA NA NA NA ...
$ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
$ LandContour: chr "Lvl" "Lvl" "Lvl" "Lvl" ...
$ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
- attr(*, ".internal.selfref")=<externalptr>
發現差異了嗎?
read.csv
讀入的資料會自動把類別型資料 categorical variables
轉換成factor
,而 fread()
則預設把類別型資料讀入成文字 character
。factor
讀入的好處是,我們用 str()
檢視的時候,可以很快的知道這個欄位有幾種數值,例如 LotShape
這個欄位,顯示為 Factor w/ 4 levels "IR1","IR2","IR3"....
,我們就知道此欄位至少有 IR1
~ IR4
四種。read.csv
做這樣的轉換,可以設定 stringsAsFactors = FALSE
即可。factor
或 character
變數類型在後續跑模型的時候會有影響,有些模型只接受 categorical variables 一定得為 factor
才行,因此模型跑得熟練之後,大家就知道資料讀入的時候應該做何種設定比較合適。dim()
看資料的行列數 (rows x columns)
rows
, 也常被稱為 observations
或 records
columns
,也被稱為 fields
,單指資料集的垂直組合columns
通常稱為 variables
columns
, variables
都是一樣的概念,由於機器學習以及深度學習正夯,現在較廣泛地稱為 features
喔!#讀入raw data
train0<-read.csv("train.csv", stringsAsFactors = FALSE)
test0<-read.csv("test.csv", stringsAsFactors = FALSE)
dim(train0) #training data共有1460個房子的資料,81個欄位
[1] 1460 81
dim(test0) #testing data共有1460個房子的資料,80個欄位
[1] 1459 80
#training 與 testing 資料差別在training data多了房價 (SalesPrice),這是testing data在建模完畢後最終要去預測的欄位
colnames(train0)[!colnames(train0) %in% colnames(test0)]
[1] "SalePrice"
分析資料的第一步就是理解資料,這也是考驗分析者R coding的基本功。
在
tidyverse
系列套件推出之前,資料的清理難度比現在高,rbind
,cbind
,subset
,merge
到地老天荒(其實只有我?),還有腦筋要轉個幾圈才能搞懂的apply
,lapply
,tapply
,sapply
,mapply
等等全部長很像系列家族。一直到 2014 年
dplyr
package 推出後,R的語法變得更直覺,跟資料庫語言 SQL 更貼近,可以一步接著一步的 piping 語法(%>%
)讓 R codes 變得更好讀,瞬間將 R 的 coding style 提升到另一個境界!!!後續推出的
tibble
,tidyr
,purrr
等套件,再加上本來就紅透半邊天的ggplot2
,如果沒聽過tidyverse
家族?,別說你懂 R 。
根據位置
dat<-train0%>%select(10:15)
head(dat)
Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
1 AllPub Inside Gtl CollgCr Norm Norm
2 AllPub FR2 Gtl Veenker Feedr Norm
3 AllPub Inside Gtl CollgCr Norm Norm
4 AllPub Corner Gtl Crawfor Norm Norm
5 AllPub FR2 Gtl NoRidge Norm Norm
6 AllPub Inside Gtl Mitchel Norm Norm
根據名稱
dat<-train0%>%select(MSZoning, Utilities, HouseStyle, Heating, YearBuilt, SalePrice)
head(dat)
MSZoning Utilities HouseStyle Heating YearBuilt SalePrice
1 RL AllPub 2Story GasA 2003 208500
2 RL AllPub 1Story GasA 1976 181500
3 RL AllPub 2Story GasA 2001 223500
4 RL AllPub 2Story GasA 1915 140000
5 RL AllPub 2Story GasA 2000 250000
6 RL AllPub 1.5Fin GasA 1993 143000
根據某種條件
#1. 欄位含某個字串:選取欄位名稱含有Lot,Bsmt開頭,或是以Condition結尾
dat<-train0%>%select(contains("Lot"), starts_with("Bsmt"), ends_with("Condition"))
head(dat)
LotFrontage LotArea LotShape LotConfig BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
1 65 8450 Reg Inside Gd TA No GLQ 706
2 80 9600 Reg FR2 Gd TA Gd ALQ 978
3 68 11250 IR1 Inside Gd TA Mn GLQ 486
4 60 9550 IR1 Corner TA Gd No ALQ 216
5 84 14260 IR1 FR2 Gd TA Av GLQ 655
6 85 14115 IR1 Inside Gd TA No GLQ 732
BsmtFinType2 BsmtFinSF2 BsmtUnfSF BsmtFullBath BsmtHalfBath SaleCondition
1 Unf 0 150 1 0 Normal
2 Unf 0 284 0 1 Normal
3 Unf 0 434 1 0 Normal
4 Unf 0 540 1 0 Abnorml
5 Unf 0 490 1 0 Normal
6 Unf 0 64 1 0 Normal
#2. 欄位符合某種pattern
dat<-train0%>%select(matches("Yr|Year|year|yr")) #跟年份有關的欄位
head(dat)
YearBuilt YearRemodAdd GarageYrBlt YrSold
1 2003 2003 2003 2008
2 1976 1976 1976 2007
3 2001 2002 2001 2008
4 1915 1970 1998 2006
5 2000 2000 2000 2008
6 1993 1995 1993 2009
#3. 刪除欄位
dat<-train0%>%select(-PoolArea, -Fence, -matches("Bsmt|Lot|Garage")) #PoolArea, Fence, 含有Bsmt或Lot或Garage的欄位都刪除
colnames(dat)
[1] "Id" "MSSubClass" "MSZoning" "Street" "Alley" "LandContour"
[7] "Utilities" "LandSlope" "Neighborhood" "Condition1" "Condition2" "BldgType"
[13] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt" "YearRemodAdd" "RoofStyle"
[19] "RoofMatl" "Exterior1st" "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
[25] "ExterCond" "Foundation" "Heating" "HeatingQC" "CentralAir" "Electrical"
[31] "X1stFlrSF" "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "FullBath" "HalfBath"
[37] "BedroomAbvGr" "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional" "Fireplaces"
[43] "FireplaceQu" "PavedDrive" "WoodDeckSF" "OpenPorchSF" "EnclosedPorch" "X3SsnPorch"
[49] "ScreenPorch" "PoolQC" "MiscFeature" "MiscVal" "MoSold" "YrSold"
[55] "SaleType" "SaleCondition" "SalePrice"
先把房價簡單的視覺化
plot_ly(train0, x=~SalePrice, type="histogram")
- 房價在某個範圍
range(train0$SalePrice) #所有資料的房價範圍
[1] 34900 755000
dat<-train0%>%filter(SalePrice>=100000, SalePrice<=300000)
range(dat$SalePrice) #選取後的房價範圍
[1] 100000 299800
#房價100000~300000佔所有資料的比例達84.3%
scales::percent(nrow(dat)/nrow(train0))
[1] "84.3%"
- 根據欄位資料類型篩選
table(train0$SaleType) #SaleType當中WD類型最多
COD Con ConLD ConLI ConLw CWD New Oth WD
43 2 9 5 5 4 122 3 1267
#選SaleTeyp=="WD"的資料(WD:Warranty Deed - Conventional)
dat<-train0%>%filter(SaleType=="WD")
table(dat$SaleType)
WD
1267
- 可根據多種條件
table(train0$YrSold, train0$SaleType) #售出年份與SaleType的數量
COD Con ConLD ConLI ConLw CWD New Oth WD
2006 5 0 2 1 1 0 43 1 261
2007 5 0 0 0 0 3 40 2 279
2008 14 1 5 0 1 1 14 0 268
2009 14 0 2 3 3 0 18 0 298
2010 5 1 0 1 0 0 7 0 161
#選2008年以前售出,而且SaleType為New(剛蓋好就賣出)的資料
dat<-train0%>%
filter(YrSold<2008, SaleType=="New")
table(dat$YrSold, dat$SaleType)
New
2006 43
2007 40
- 根據rows的位置
dat<-train0%>%slice(1000:1001) #第1000-1001筆資料
head(dat[,1:3])
Id MSSubClass MSZoning
1 1000 20 RL
2 1001 20 RL
- 根據某欄位前幾名
#選出售價前五名
dat<-train0%>%top_n(5, SalePrice)
#列出Id, 社區名稱並將售價由高至低排序
head(dat%>%select(Id, Neighborhood, SalePrice)%>%arrange(-SalePrice))
Id Neighborhood SalePrice
1 692 NoRidge 755000
2 1183 NoRidge 745000
3 1170 NoRidge 625000
4 899 NridgHt 611657
5 804 NridgHt 582933
根據社區算該區最低,最高,平均房價,以及房價標準差
dat<-train0%>%
group_by(Neighborhood)%>%
summarise(low=min(SalePrice),
high=max(SalePrice),
average=mean(SalePrice),
sd=sd(SalePrice))%>%
arrange(-average)
Neighborhood | low | high | average | sd |
---|---|---|---|---|
NoRidge | 190000 | 755000 | 335295.32 | 121412.659 |
NridgHt | 154000 | 611657 | 316270.62 | 96392.545 |
StoneBr | 170000 | 556581 | 310499.00 | 112969.677 |
Timber | 137500 | 378500 | 242247.45 | 64845.652 |
Veenker | 162500 | 385000 | 238772.73 | 72369.318 |
Somerst | 144152 | 423000 | 225379.84 | 56177.556 |
ClearCr | 130000 | 328000 | 212565.43 | 50231.539 |
Crawfor | 90350 | 392500 | 210624.73 | 68866.395 |
CollgCr | 110000 | 424870 | 197965.77 | 51403.666 |
Blmngtn | 159895 | 264561 | 194870.88 | 30393.229 |
Gilbert | 141000 | 377500 | 192854.51 | 35986.779 |
NWAmes | 82500 | 299800 | 189050.07 | 37172.218 |
SawyerW | 76000 | 320000 | 186555.80 | 55651.998 |
Mitchel | 84500 | 271000 | 156270.12 | 36486.625 |
NAmes | 87500 | 345000 | 145847.08 | 33075.345 |
NPkVill | 127500 | 155000 | 142694.44 | 9377.315 |
SWISU | 60000 | 200000 | 142591.36 | 32622.918 |
Blueste | 124000 | 151000 | 137500.00 | 19091.883 |
Sawyer | 62383 | 190000 | 136793.14 | 22345.129 |
OldTown | 37900 | 475000 | 128225.30 | 52650.583 |
Edwards | 58500 | 320000 | 128219.70 | 43208.616 |
BrkSide | 39300 | 223500 | 124834.05 | 40348.689 |
BrDale | 83000 | 125000 | 104493.75 | 14330.176 |
IDOTRR | 34900 | 169500 | 100123.78 | 33376.710 |
MeadowV | 75000 | 151400 | 98576.47 | 23491.050 |
亦可增加 grouping 條件 (ex. street)
dat<-train0%>%
group_by(Neighborhood, Street)%>%
summarise(low=min(SalePrice),
high=max(SalePrice),
average=mean(SalePrice),
sd=sd(SalePrice))
Neighborhood | Street | low | high | average | sd |
---|---|---|---|---|---|
Blmngtn | Pave | 159895 | 264561 | 194870.88 | 30393.229 |
Blueste | Pave | 124000 | 151000 | 137500.00 | 19091.883 |
BrDale | Pave | 83000 | 125000 | 104493.75 | 14330.176 |
BrkSide | Pave | 39300 | 223500 | 124834.05 | 40348.689 |
ClearCr | Pave | 130000 | 328000 | 212565.43 | 50231.539 |
CollgCr | Pave | 110000 | 424870 | 197965.77 | 51403.666 |
Crawfor | Pave | 90350 | 392500 | 210624.73 | 68866.395 |
Edwards | Grvl | 118500 | 118500 | 118500.00 | NA |
Edwards | Pave | 58500 | 320000 | 128317.88 | 43417.297 |
Gilbert | Pave | 141000 | 377500 | 192854.51 | 35986.779 |
IDOTRR | Grvl | 55993 | 110000 | 82331.00 | 27028.091 |
IDOTRR | Pave | 34900 | 169500 | 101693.74 | 33759.212 |
MeadowV | Pave | 75000 | 151400 | 98576.47 | 23491.050 |
Mitchel | Pave | 84500 | 271000 | 156270.12 | 36486.625 |
NAmes | Pave | 87500 | 345000 | 145847.08 | 33075.345 |
NoRidge | Pave | 190000 | 755000 | 335295.32 | 121412.659 |
NPkVill | Pave | 127500 | 155000 | 142694.44 | 9377.315 |
NridgHt | Pave | 154000 | 611657 | 316270.62 | 96392.545 |
NWAmes | Pave | 82500 | 299800 | 189050.07 | 37172.218 |
OldTown | Pave | 37900 | 475000 | 128225.30 | 52650.583 |
Sawyer | Pave | 62383 | 190000 | 136793.14 | 22345.129 |
SawyerW | Pave | 76000 | 320000 | 186555.80 | 55651.998 |
Somerst | Pave | 144152 | 423000 | 225379.84 | 56177.556 |
StoneBr | Pave | 170000 | 556581 | 310499.00 | 112969.677 |
SWISU | Pave | 60000 | 200000 | 142591.36 | 32622.918 |
Timber | Grvl | 186700 | 228950 | 207825.00 | 29875.262 |
Timber | Pave | 137500 | 378500 | 244159.81 | 65941.417 |
Veenker | Pave | 162500 | 385000 | 238772.73 | 72369.318 |
簡單地看一下房屋建造年份,外牆材質跟售價的關係
dat<-train0%>%
group_by(YearBuilt, MasVnrType)%>%
summarise(average=mean(SalePrice))%>%
arrange(-average)
plot_ly(dat, x = ~YearBuilt, y = ~average, text = ~MasVnrType, type = 'scatter', mode = 'markers', size = ~average, color = ~MasVnrType,
#Choosing the range of the bubbles' sizes:
sizes = c(10, 80),
marker = list(opacity = 0.5, sizemode = 'diameter')) %>%
layout(title = 'Estate Sale Price by Neighborhood',
xaxis = list(showgrid = FALSE),
yaxis = list(showgrid = FALSE),
showlegend = TRUE)
BsmtQual看起來是地下室的 quality, 但看了資料說明才發現是 basement height 為了讓欄位名稱更直覺,決定改成 BsmtHght (注意:新名稱放前面)
dat<-train0%>%
select(BsmtQual)%>%
rename(BsmtHght=BsmtQual)
head(dat)
BsmtHght
1 Gd
2 Gd
3 Gd
4 TA
5 Gd
6 Gd
關於 basement 的欄位實在有點多,可以做一些合併
colnames(train0%>%select(contains("Bsmt")))
[1] "BsmtQual" "BsmtCond" "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
[7] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "BsmtFullBath" "BsmtHalfBath"
#以BsmtFullBath, BsmtHalfBath為例,可以將兩個欄位合併成BsmtBath
Bsmt<-train0%>%
select(matches("Bsmt.*Bath"))%>%
mutate(BsmtBath=case_when(BsmtFullBath>0|BsmtHalfBath>0~1, #只要有Bath,無論種類都標示為1,否則為0
TRUE~0))
head(Bsmt%>%arrange(-BsmtBath))
BsmtFullBath BsmtHalfBath BsmtBath
1 1 0 1
2 0 1 1
3 1 0 1
4 1 0 1
5 1 0 1
6 1 0 1
根據房屋整體品質來分組,新增“平均房價”欄位
dat<-train0%>%
group_by(OverallQual)%>%
mutate(average_SalePrice=mean(SalePrice))%>%
select(Id, OverallQual, SalePrice)
dat[1:10,]
# A tibble: 10 x 3
# Groups: OverallQual [4]
Id OverallQual SalePrice
<int> <int> <int>
1 1 7 208500
2 2 6 181500
3 3 7 223500
4 4 7 140000
5 5 8 250000
6 6 5 143000
7 7 8 307000
8 8 7 200000
9 9 7 129900
10 10 5 118000
很快地看一下 data description ,嗯,格式有點亂,欄位名稱跟數值顯示在同一欄
dcr0<-read.delim("data_description.txt", header = FALSE, stringsAsFactors = FALSE)
datatable(dcr0)
把 data description 做個簡單清理與 reformatting
dcr<-dcr0%>%
#先切開欄位名稱跟該欄位的數值種類(by tab or space),切完後取前面那段
mutate(feature=sapply(strsplit(V1, '\t|[[:space:]]'), "[", 1))%>%
filter(!is.na(feature))%>% #為了下一步的fill, 先把空白的欄位用NA取代
mutate_all(na_if, "")%>% #把feature這個欄位的NA用前一個非NA的值取代
fill(feature, .direction="down")%>%
rename(value=V1, description=V2)%>% #改成比較直覺的名稱
select(feature, value, description)
利用
DT
,datatable
建立一個方便查詢的表格 (可輸入某欄位名稱,檢視所有數值代表的涵意)
datatable(dcr, options = list(pageLength=20))
tidyr
中的 spread
function把欄位展開spread
相反的 function 是 gather
#BsmtFinType1
fintype1<-train0%>% #計算每個Id在某個BsmtFinType1出現的頻率
group_by(Id, BsmtFinType1)%>%
summarise(count=n())%>%
spread(BsmtFinType1, count, fill=0) #fill=0的意思是該Id沒有某種Type時,用0取代,預設值為NA
head(fintype1)
# A tibble: 6 x 8
# Groups: Id [6]
Id ALQ BLQ GLQ LwQ Rec Unf `<NA>`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 1 0 0 0 0
2 2 1 0 0 0 0 0 0
3 3 0 0 1 0 0 0 0
4 4 1 0 0 0 0 0 0
5 5 0 0 1 0 0 0 0
6 6 0 0 1 0 0 0 0
#BsmtFinType2
fintype2<-train0%>%
group_by(Id, BsmtFinType2)%>%
summarise(count=n())%>%
spread(BsmtFinType2, count, fill=0)
head(fintype2)
# A tibble: 6 x 8
# Groups: Id [6]
Id ALQ BLQ GLQ LwQ Rec Unf `<NA>`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 0 0 1 0
2 2 0 0 0 0 0 1 0
3 3 0 0 0 0 0 1 0
4 4 0 0 0 0 0 1 0
5 5 0 0 0 0 0 1 0
6 6 0 0 0 0 0 1 0
#合併以及加總
bsmtfintype<-
#bind_rows與rbind的不同是前者會自行比對相同的欄位去合併,後者必須欄位序相同才行
bind_rows(fintype1, fintype2) %>%
group_by(Id) %>%
summarise_all(funs(sum(., na.rm = TRUE)))%>%
rename_all(function(x) paste0("BsmtFinType_", x)) #標記"BsmtFinType"當作prefix
#併回train data
train<-train0%>%
left_join(bsmtfintype, by=c("Id"="BsmtFinType_Id"))
我們可以看到原本兩個欄位已經成功合併並且展開,利用這樣的方法我們可以把重要的 feature 做更多層次的運用。
datatable(train%>%select(Id, contains("BsmtFinType")))
請挑選 training data 中 1-3 個你覺得重要的欄位,或好幾個性質類似的欄位,進行轉換,拆解,或合併(自由發揮)。