Day 1 : Data Cleaning and Wrangling

本日重點: 學習 dplyr , tidyr 等套件中的資料清理及轉換技能

##設定環境
#setwd(dir) #設定working directory的存放位置
# MAC : setwd("/Users/rladiestaipei/R_DragonBall/") 
# Windows : setwd("C://Users/rladiestaipei/Desktop/R_DragonBall/")  

#安裝套件(僅需執行一次)
#install.packages(c("tidyverse", "ploty", "zoo", "lubridate", "rmarkdown","data.table", "DT", "kableExtra"), dependencies = TRUE)

#load packages
library(DT)
library(zoo)
library(plotly)
library(lubridate)
library(rmarkdown)
library(data.table)
library(tidyverse)
library(kableExtra)
options(dplyr.print_max=1e9)

Part 1 : Read and load data

介紹兩種最常用的把資料載入的方式： read.csv() 以及 fread()

read.csv() : 是 R 的 base function，最常用也最直接讀入 csv 檔的方式
fread() : 來自 data.table library，當資料量大到一個程度的時候（ex. 百萬筆以上）， fread() 會用更快的速度把資料載入
若讀入的資料為 .xlsx 檔，可以用 xlsx::read.xlsx()
大家可以用 str() (或 glimpse() )這兩個 function 來檢視 read.csv() 以及 fread() 讀入的資料差異為何？

#資料需放在此workspace中，若不是的話需要指定路徑
train1<-read.csv("train.csv")  
train2<-fread("train.csv")
#str(train1)        #檢視所有欄位
str(train1[,1:10])  #避免佔用教材過大篇幅，僅顯示前十個欄位

'data.frame':   1460 obs. of  10 variables:
 $ Id         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning   : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
 $ LotFrontage: int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea    : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street     : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley      : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape   : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
 $ LandContour: Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities  : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...

#str(train2)
str(train2[,1:10])

Classes 'data.table' and 'data.frame':  1460 obs. of  10 variables:
 $ Id         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning   : chr  "RL" "RL" "RL" "RL" ...
 $ LotFrontage: int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea    : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street     : chr  "Pave" "Pave" "Pave" "Pave" ...
 $ Alley      : chr  NA NA NA NA ...
 $ LotShape   : chr  "Reg" "Reg" "IR1" "IR1" ...
 $ LandContour: chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ Utilities  : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
 - attr(*, ".internal.selfref")=<externalptr>

發現差異了嗎？

用 read.csv 讀入的資料會自動把類別型資料 categorical variables 轉換成factor，而 fread() 則預設把類別型資料讀入成文字 character 。
用 factor 讀入的好處是，我們用 str() 檢視的時候，可以很快的知道這個欄位有幾種數值，例如 LotShape 這個欄位，顯示為 Factor w/ 4 levels "IR1","IR2","IR3".... ，我們就知道此欄位至少有 IR1 ~ IR4 四種。
若不希望 read.csv 做這樣的轉換，可以設定 stringsAsFactors = FALSE 即可。
factor 或 character 變數類型在後續跑模型的時候會有影響，有些模型只接受 categorical variables 一定得為 factor 才行，因此模型跑得熟練之後，大家就知道資料讀入的時候應該做何種設定比較合適。

Part 2 : 檢視 data 結構

請搭配資料集介紹服用 link
用 dim() 看資料的行列數 (rows x columns)
- 資料行(rows) : 水平資料為 rows , 也常被稱為 observations 或 records
- 欄位(columns)：垂直資料為 columns ，也被稱為 fields ，單指資料集的垂直組合
- 變數(variables)：加入統計模型的 columns 通常稱為 variables
- 參數(features)：其實 columns , variables 都是一樣的概念，由於機器學習以及深度學習正夯，現在較廣泛地稱為 features 喔！

#讀入raw data 
train0<-read.csv("train.csv", stringsAsFactors = FALSE)
test0<-read.csv("test.csv", stringsAsFactors = FALSE)
dim(train0)  #training data共有1460個房子的資料，81個欄位

[1] 1460   81

dim(test0)   #testing data共有1460個房子的資料，80個欄位

[1] 1459   80

#training 與 testing 資料差別在training data多了房價 (SalesPrice)，這是testing data在建模完畢後最終要去預測的欄位
colnames(train0)[!colnames(train0) %in% colnames(test0)]

[1] "SalePrice"

Part 3 : 初階資料清理及轉換

分析資料的第一步就是理解資料，這也是考驗分析者R coding的基本功。

在 tidyverse 系列套件推出之前，資料的清理難度比現在高， rbind , cbind , subset , merge 到地老天荒( ~~其實只有我？~~ )，還有腦筋要轉個幾圈才能搞懂的 apply , lapply , tapply , sapply , mapply 等等全部長很像系列家族。

一直到 2014 年 dplyr package 推出後，R的語法變得更直覺，跟資料庫語言 SQL 更貼近，可以一步接著一步的 piping 語法(%>%)讓 R codes 變得更好讀，瞬間將 R 的 coding style 提升到另一個境界！！！

後續推出的 tibble , tidyr , purrr 等套件，再加上本來就紅透半邊天的 ggplot2 ，如果沒聽過 tidyverse 家族?，別說你懂 R 。

3-1 : Subsetting data-資料行列的選取

3-1-1 : 選取特定欄位(select)

根據位置

dat<-train0%>%select(10:15)
head(dat)

  Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
1    AllPub    Inside       Gtl      CollgCr       Norm       Norm
2    AllPub       FR2       Gtl      Veenker      Feedr       Norm
3    AllPub    Inside       Gtl      CollgCr       Norm       Norm
4    AllPub    Corner       Gtl      Crawfor       Norm       Norm
5    AllPub       FR2       Gtl      NoRidge       Norm       Norm
6    AllPub    Inside       Gtl      Mitchel       Norm       Norm

根據名稱

dat<-train0%>%select(MSZoning, Utilities, HouseStyle, Heating, YearBuilt, SalePrice)
head(dat)

  MSZoning Utilities HouseStyle Heating YearBuilt SalePrice
1       RL    AllPub     2Story    GasA      2003    208500
2       RL    AllPub     1Story    GasA      1976    181500
3       RL    AllPub     2Story    GasA      2001    223500
4       RL    AllPub     2Story    GasA      1915    140000
5       RL    AllPub     2Story    GasA      2000    250000
6       RL    AllPub     1.5Fin    GasA      1993    143000

根據某種條件

#1. 欄位含某個字串:選取欄位名稱含有Lot，Bsmt開頭，或是以Condition結尾
dat<-train0%>%select(contains("Lot"), starts_with("Bsmt"), ends_with("Condition"))
head(dat)

  LotFrontage LotArea LotShape LotConfig BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
1          65    8450      Reg    Inside       Gd       TA           No          GLQ        706
2          80    9600      Reg       FR2       Gd       TA           Gd          ALQ        978
3          68   11250      IR1    Inside       Gd       TA           Mn          GLQ        486
4          60    9550      IR1    Corner       TA       Gd           No          ALQ        216
5          84   14260      IR1       FR2       Gd       TA           Av          GLQ        655
6          85   14115      IR1    Inside       Gd       TA           No          GLQ        732
  BsmtFinType2 BsmtFinSF2 BsmtUnfSF BsmtFullBath BsmtHalfBath SaleCondition
1          Unf          0       150            1            0        Normal
2          Unf          0       284            0            1        Normal
3          Unf          0       434            1            0        Normal
4          Unf          0       540            1            0       Abnorml
5          Unf          0       490            1            0        Normal
6          Unf          0        64            1            0        Normal

#2. 欄位符合某種pattern
dat<-train0%>%select(matches("Yr|Year|year|yr"))  #跟年份有關的欄位
head(dat)

  YearBuilt YearRemodAdd GarageYrBlt YrSold
1      2003         2003        2003   2008
2      1976         1976        1976   2007
3      2001         2002        2001   2008
4      1915         1970        1998   2006
5      2000         2000        2000   2008
6      1993         1995        1993   2009

#3. 刪除欄位
dat<-train0%>%select(-PoolArea, -Fence, -matches("Bsmt|Lot|Garage"))  #PoolArea, Fence, 含有Bsmt或Lot或Garage的欄位都刪除
colnames(dat)

 [1] "Id"            "MSSubClass"    "MSZoning"      "Street"        "Alley"         "LandContour"  
 [7] "Utilities"     "LandSlope"     "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
[13] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"     "YearRemodAdd"  "RoofStyle"    
[19] "RoofMatl"      "Exterior1st"   "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
[25] "ExterCond"     "Foundation"    "Heating"       "HeatingQC"     "CentralAir"    "Electrical"   
[31] "X1stFlrSF"     "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "FullBath"      "HalfBath"     
[37] "BedroomAbvGr"  "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"    "Fireplaces"   
[43] "FireplaceQu"   "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"   "EnclosedPorch" "X3SsnPorch"   
[49] "ScreenPorch"   "PoolQC"        "MiscFeature"   "MiscVal"       "MoSold"        "YrSold"       
[55] "SaleType"      "SaleCondition" "SalePrice"

3-1-2 : 選取特定 rows (filter, slice, top_n)

先把房價簡單的視覺化

plot_ly(train0, x=~SalePrice, type="histogram")

房價在某個範圍

range(train0$SalePrice)  #所有資料的房價範圍

[1]  34900 755000

dat<-train0%>%filter(SalePrice>=100000, SalePrice<=300000)
range(dat$SalePrice)  #選取後的房價範圍

[1] 100000 299800

#房價100000~300000佔所有資料的比例達84.3%
scales::percent(nrow(dat)/nrow(train0))

[1] "84.3%"

根據欄位資料類型篩選

table(train0$SaleType)  #SaleType當中WD類型最多


  COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
   43     2     9     5     5     4   122     3  1267

#選SaleTeyp=="WD"的資料(WD:Warranty Deed - Conventional)
dat<-train0%>%filter(SaleType=="WD") 
table(dat$SaleType)


  WD 
1267

可根據多種條件

table(train0$YrSold, train0$SaleType)  #售出年份與SaleType的數量

      
       COD Con ConLD ConLI ConLw CWD New Oth  WD
  2006   5   0     2     1     1   0  43   1 261
  2007   5   0     0     0     0   3  40   2 279
  2008  14   1     5     0     1   1  14   0 268
  2009  14   0     2     3     3   0  18   0 298
  2010   5   1     0     1     0   0   7   0 161

#選2008年以前售出，而且SaleType為New(剛蓋好就賣出)的資料
dat<-train0%>%        
  filter(YrSold<2008, SaleType=="New")   
table(dat$YrSold, dat$SaleType)

根據rows的位置

dat<-train0%>%slice(1000:1001)  #第1000-1001筆資料
head(dat[,1:3])

    Id MSSubClass MSZoning
1 1000         20       RL
2 1001         20       RL

根據某欄位前幾名

#選出售價前五名
dat<-train0%>%top_n(5, SalePrice)    
#列出Id, 社區名稱並將售價由高至低排序
head(dat%>%select(Id, Neighborhood, SalePrice)%>%arrange(-SalePrice))

    Id Neighborhood SalePrice
1  692      NoRidge    755000
2 1183      NoRidge    745000
3 1170      NoRidge    625000
4  899      NridgHt    611657
5  804      NridgHt    582933

3-2 : Grouping and summarizing

根據社區算該區最低，最高，平均房價，以及房價標準差

dat<-train0%>%
  group_by(Neighborhood)%>%
  summarise(low=min(SalePrice),
            high=max(SalePrice),
            average=mean(SalePrice),
            sd=sd(SalePrice))%>%
  arrange(-average)

Neighborhood	low	high	average	sd
NoRidge	190000	755000	335295.32	121412.659
NridgHt	154000	611657	316270.62	96392.545
StoneBr	170000	556581	310499.00	112969.677
Timber	137500	378500	242247.45	64845.652
Veenker	162500	385000	238772.73	72369.318
Somerst	144152	423000	225379.84	56177.556
ClearCr	130000	328000	212565.43	50231.539
Crawfor	90350	392500	210624.73	68866.395
CollgCr	110000	424870	197965.77	51403.666
Blmngtn	159895	264561	194870.88	30393.229
Gilbert	141000	377500	192854.51	35986.779
NWAmes	82500	299800	189050.07	37172.218
SawyerW	76000	320000	186555.80	55651.998
Mitchel	84500	271000	156270.12	36486.625
NAmes	87500	345000	145847.08	33075.345
NPkVill	127500	155000	142694.44	9377.315
SWISU	60000	200000	142591.36	32622.918
Blueste	124000	151000	137500.00	19091.883
Sawyer	62383	190000	136793.14	22345.129
OldTown	37900	475000	128225.30	52650.583
Edwards	58500	320000	128219.70	43208.616
BrkSide	39300	223500	124834.05	40348.689
BrDale	83000	125000	104493.75	14330.176
IDOTRR	34900	169500	100123.78	33376.710
MeadowV	75000	151400	98576.47	23491.050

亦可增加 grouping 條件 (ex. street)

dat<-train0%>%
  group_by(Neighborhood, Street)%>%
  summarise(low=min(SalePrice),
            high=max(SalePrice),
            average=mean(SalePrice),
            sd=sd(SalePrice))

Neighborhood	Street	low	high	average	sd
Blmngtn	Pave	159895	264561	194870.88	30393.229
Blueste	Pave	124000	151000	137500.00	19091.883
BrDale	Pave	83000	125000	104493.75	14330.176
BrkSide	Pave	39300	223500	124834.05	40348.689
ClearCr	Pave	130000	328000	212565.43	50231.539
CollgCr	Pave	110000	424870	197965.77	51403.666
Crawfor	Pave	90350	392500	210624.73	68866.395
Edwards	Grvl	118500	118500	118500.00	NA
Edwards	Pave	58500	320000	128317.88	43417.297
Gilbert	Pave	141000	377500	192854.51	35986.779
IDOTRR	Grvl	55993	110000	82331.00	27028.091
IDOTRR	Pave	34900	169500	101693.74	33759.212
MeadowV	Pave	75000	151400	98576.47	23491.050
Mitchel	Pave	84500	271000	156270.12	36486.625
NAmes	Pave	87500	345000	145847.08	33075.345
NoRidge	Pave	190000	755000	335295.32	121412.659
NPkVill	Pave	127500	155000	142694.44	9377.315
NridgHt	Pave	154000	611657	316270.62	96392.545
NWAmes	Pave	82500	299800	189050.07	37172.218
OldTown	Pave	37900	475000	128225.30	52650.583
Sawyer	Pave	62383	190000	136793.14	22345.129
SawyerW	Pave	76000	320000	186555.80	55651.998
Somerst	Pave	144152	423000	225379.84	56177.556
StoneBr	Pave	170000	556581	310499.00	112969.677
SWISU	Pave	60000	200000	142591.36	32622.918
Timber	Grvl	186700	228950	207825.00	29875.262
Timber	Pave	137500	378500	244159.81	65941.417
Veenker	Pave	162500	385000	238772.73	72369.318

簡單地看一下房屋建造年份，外牆材質跟售價的關係

dat<-train0%>%
  group_by(YearBuilt, MasVnrType)%>%
  summarise(average=mean(SalePrice))%>%
  arrange(-average)

plot_ly(dat, x = ~YearBuilt, y = ~average, text = ~MasVnrType, type = 'scatter', mode = 'markers', size = ~average, color = ~MasVnrType,
        #Choosing the range of the bubbles' sizes:
        sizes = c(10, 80),
        marker = list(opacity = 0.5, sizemode = 'diameter')) %>%
  layout(title = 'Estate Sale Price by Neighborhood',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE),
         showlegend = TRUE)

3-3 Mutating : 建立新欄位

3-3-1 更改現有欄位名稱

BsmtQual看起來是地下室的 quality, 但看了資料說明才發現是 basement height 為了讓欄位名稱更直覺，決定改成 BsmtHght (注意：新名稱放前面)

dat<-train0%>%   
  select(BsmtQual)%>%
  rename(BsmtHght=BsmtQual)   
head(dat)

  BsmtHght
1       Gd
2       Gd
3       Gd
4       TA
5       Gd
6       Gd

3-3-2 新增欄位

關於 basement 的欄位實在有點多，可以做一些合併

colnames(train0%>%select(contains("Bsmt")))

 [1] "BsmtQual"     "BsmtCond"     "BsmtExposure" "BsmtFinType1" "BsmtFinSF1"   "BsmtFinType2"
 [7] "BsmtFinSF2"   "BsmtUnfSF"    "TotalBsmtSF"  "BsmtFullBath" "BsmtHalfBath"

#以BsmtFullBath, BsmtHalfBath為例，可以將兩個欄位合併成BsmtBath
Bsmt<-train0%>%                    
  select(matches("Bsmt.*Bath"))%>%
  mutate(BsmtBath=case_when(BsmtFullBath>0|BsmtHalfBath>0~1,  #只要有Bath，無論種類都標示為1，否則為0
                            TRUE~0))
head(Bsmt%>%arrange(-BsmtBath))

  BsmtFullBath BsmtHalfBath BsmtBath
1            1            0        1
2            0            1        1
3            1            0        1
4            1            0        1
5            1            0        1
6            1            0        1

3-3-3 根據分組建立新欄位

根據房屋整體品質來分組，新增“平均房價”欄位

dat<-train0%>%                         
  group_by(OverallQual)%>%
  mutate(average_SalePrice=mean(SalePrice))%>%    
  select(Id, OverallQual, SalePrice)
dat[1:10,]

# A tibble: 10 x 3
# Groups:   OverallQual [4]
      Id OverallQual SalePrice
   <int>       <int>     <int>
 1     1           7    208500
 2     2           6    181500
 3     3           7    223500
 4     4           7    140000
 5     5           8    250000
 6     6           5    143000
 7     7           8    307000
 8     8           7    200000
 9     9           7    129900
10    10           5    118000

3-4 小練習 : 清理 Data description 檔案

很快地看一下 data description ，嗯，格式有點亂，欄位名稱跟數值顯示在同一欄

dcr0<-read.delim("data_description.txt", header = FALSE, stringsAsFactors = FALSE)

datatable(dcr0)

把 data description 做個簡單清理與 reformatting

dcr<-dcr0%>%
  #先切開欄位名稱跟該欄位的數值種類(by tab or space)，切完後取前面那段
  mutate(feature=sapply(strsplit(V1, '\t|[[:space:]]'), "[", 1))%>%  
  filter(!is.na(feature))%>%           #為了下一步的fill, 先把空白的欄位用NA取代
  mutate_all(na_if, "")%>%             #把feature這個欄位的NA用前一個非NA的值取代
  fill(feature, .direction="down")%>% 
  rename(value=V1, description=V2)%>%  #改成比較直覺的名稱
  select(feature, value, description)

利用 DT , datatable 建立一個方便查詢的表格 (可輸入某欄位名稱，檢視所有數值代表的涵意)

datatable(dcr, options = list(pageLength=20))

Part 4 : 進階資料清理及轉換

4-1 Reshaping

BsmtFinType1 以及 BsmtFinType2 是記錄房屋的多種 finish type
我們可以試著把這些 finish type 變成獨立的欄位，也就是每種 BsmtFinishType 都變成一個變數
利用 tidyr 中的 spread function把欄位展開
我們先分別把 BsmtFinType1 以及 BsmtFinType2 展開之後合併加總，最後再把兩組資料利用 Id 併回原始資料
ps. 跟 spread 相反的 function 是 gather

#BsmtFinType1
fintype1<-train0%>%             #計算每個Id在某個BsmtFinType1出現的頻率
  group_by(Id, BsmtFinType1)%>%
  summarise(count=n())%>%
  spread(BsmtFinType1, count, fill=0)  #fill=0的意思是該Id沒有某種Type時，用0取代，預設值為NA
head(fintype1)

# A tibble: 6 x 8
# Groups:   Id [6]
     Id   ALQ   BLQ   GLQ   LwQ   Rec   Unf `<NA>`
  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1     1     0     0     1     0     0     0      0
2     2     1     0     0     0     0     0      0
3     3     0     0     1     0     0     0      0
4     4     1     0     0     0     0     0      0
5     5     0     0     1     0     0     0      0
6     6     0     0     1     0     0     0      0

#BsmtFinType2
fintype2<-train0%>%
  group_by(Id, BsmtFinType2)%>%
  summarise(count=n())%>%
  spread(BsmtFinType2, count, fill=0)
head(fintype2)

# A tibble: 6 x 8
# Groups:   Id [6]
     Id   ALQ   BLQ   GLQ   LwQ   Rec   Unf `<NA>`
  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1     1     0     0     0     0     0     1      0
2     2     0     0     0     0     0     1      0
3     3     0     0     0     0     0     1      0
4     4     0     0     0     0     0     1      0
5     5     0     0     0     0     0     1      0
6     6     0     0     0     0     0     1      0

#合併以及加總
bsmtfintype<-
  #bind_rows與rbind的不同是前者會自行比對相同的欄位去合併，後者必須欄位序相同才行
  bind_rows(fintype1, fintype2) %>%            
  group_by(Id) %>%
  summarise_all(funs(sum(., na.rm = TRUE)))%>% 
  rename_all(function(x) paste0("BsmtFinType_", x)) #標記"BsmtFinType"當作prefix

#併回train data
train<-train0%>%
  left_join(bsmtfintype, by=c("Id"="BsmtFinType_Id"))

我們可以看到原本兩個欄位已經成功合併並且展開，利用這樣的方法我們可以把重要的 feature 做更多層次的運用。

datatable(train%>%select(Id, contains("BsmtFinType")))

本日小挑戰

請挑選 training data 中 1-3 個你覺得重要的欄位，或好幾個性質類似的欄位，進行轉換，拆解，或合併（自由發揮）。

Day 1 : Data Cleaning and Wrangling

Tricia

2019-08-18

Part 1 : Read and load data

Part 2 : 檢視 data 結構

Part 3 : 初階資料清理及轉換

3-1 : Subsetting data-資料行列的選取

3-1-1 : 選取特定欄位(select)

3-1-2 : 選取特定 rows (filter, slice, top_n)

3-2 : Grouping and summarizing

3-3 Mutating : 建立新欄位

3-3-1 更改現有欄位名稱

3-3-2 新增欄位

3-3-3 根據分組建立新欄位

3-4 小練習 : 清理 Data description 檔案

Part 4 : 進階資料清理及轉換

4-1 Reshaping

本日小挑戰