最近看到好多小伙伴提了数据清洗的问题,开个数据清洗的坑,分享下我的经验。
数据清洗说难也不难,觉得棘手的清洗用万能的for循环就行。但是,如何通过批量操作、向量化取代循环,这是值得思考的地方,也决定了清洗效率与代码简洁程度。
本系列通过一些典型例子,比如长宽表转换、移动平均、字符串批量拆分成行(本文未列出)等,并给出解决方案。
读者可以细细体会里面的处理思路,处理思路可以延伸到其他包、其他语言上。
note:本文所提方法案例仅适用于未达到分布式需求的场景,如总处理量50G以下,单批次处理15G以内。
主要工具
本人在公司也负责部分数据生产的工作,以R的data.table包为主,经生产环境实践后效率还是不错的。本系列的案例也主要以data.table为主,
关于data.table包的效率,在我之前的文章提到过,
对比了R、py、clickhouse、spark等的清洗效率。单批次处理10G以下的数据是一个常态,目前R在这个量级上优势很大。
note:生产环境千万别用dplyr(R)或者pandas(Py),数据量大(千万级别以上或者G级别以上)的情况下处理效率低。
废话了这么多,下面直接上例子。首先,用ggplot2中的数据集diamonds构造了1千万行的测试数据,其中price列构造了千分之一的缺失值。
后续的每个例子都会基于该数据计算,并附上运行时间与结果。
library(data.table)diamonds <- ggplot2::diamondsn = 10000000set.seed(1234)dtranges <- seq.Date(from = as.Date("2011-01-01"),to = as.Date("2020-01-01"),by = 1)n1 <- sample(nrow(diamonds), n, replace = TRUE)dat1 <- as.data.table(diamonds[n1, ])dat1[, "dt"] <- sample(dtranges, n, replace = TRUE) # 增加dt列n2 <- sample(nrow(dat1), nrow(dat1)/1000)dat1[n2, "price"] <- NA # price列构造千分之一缺失值dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))print(dat1)## carat cut color clarity depth table price x y z## 1: 0.61 Good E I1 63.4 57.1 1168 5.37 5.43 3.42## 2: 0.53 Premium G SI2 60.8 58.0 1173 5.21 5.19 3.16## 3: 0.23 Very Good E VVS2 62.3 55.0 505 3.90 3.93 2.44## 4: 1.33 Ideal J VS1 61.3 57.0 6118 7.11 7.08 4.35## 5: 0.30 Ideal E VVS1 61.6 56.0 838 4.30 4.34 2.66## ---## 9999996: 1.00 Premium D SI2 62.4 58.0 4816 6.36 6.33 3.96## 9999997: 0.40 Very Good D VS2 63.2 57.0 850 4.66 4.71 2.96## 9999998: 1.51 Very Good H SI2 62.3 58.0 7823 7.29 7.25 4.53## 9999999: 0.34 Ideal D VS1 60.3 56.0 998 4.53 4.56 2.74## 10000000: 1.25 Very Good G VS2 63.5 60.0 8575 6.74 6.64 4.25## dt## 1: 2018-09-12## 2: 2012-07-31## 3: 2017-08-19## 4: 2016-11-20## 5: 2016-01-10## ---## 9999996: 2015-10-05## 9999997: 2011-11-17## 9999998: 2018-12-30## 9999999: 2019-01-13## 10000000: 2018-06-24
聚合
求每种切割类型、每种颜色钻石的平均价格、中位数价格与最高价格
t1_1 <- system.time({r1_1 <- dat1[, .(mean_price = mean(price, na.rm = TRUE),median_price = median(price, na.rm = TRUE),max_price = max(price, na.rm = TRUE)),by = .(cut, color)]})[["elapsed"]]sprintf("耗时%s秒", round(t1_1, 2))r1_1
求每天最高出售价格对应的那笔订单
先排序,再分组求首行:
t1_2 <- system.time({setorder(dat1, dt, -price) # 先以dt升序,price倒序r1_2 <- dat1[!is.na(price), .SD[1], by = .(dt)]})[["elapsed"]]sprintf("耗时%s秒", round(t1_2, 2))# [1] "耗时1.15秒"r1_2# dt carat cut color clarity depth table price x y z# 1: 2011-01-01 2.80 Good G SI2 63.8 58.0 18788 8.90 8.85 0.00# 2: 2011-01-02 2.03 Ideal G SI1 60.0 55.8 18757 8.17 8.30 4.95# 3: 2011-01-03 2.08 Premium I VS2 58.9 61.0 18447 8.36 8.35 4.92# 4: 2011-01-04 2.07 Ideal G SI2 62.5 55.0 18804 8.20 8.13 5.11# 5: 2011-01-05 2.05 Ideal G SI1 61.9 57.0 18787 8.10 8.16 5.03# ---# 3284: 2019-12-28 2.55 Premium I VS1 61.8 62.0 18766 8.70 8.65 5.36# 3285: 2019-12-29 2.00 Very Good G SI1 63.5 56.0 18818 7.90 7.97 5.04# 3286: 2019-12-30 2.02 Ideal G VS2 62.0 57.0 18700 8.10 8.05 5.01# 3287: 2019-12-31 1.71 Ideal G VVS2 62.1 55.0 18768 7.66 7.63 4.75# 3288: 2020-01-01 2.03 Ideal G SI1 60.0 55.8 18757 8.17 8.30 4.95
join
dat1与dat2以dt列左连接
t2_1 <- system.time({r2_1 <- merge(dat1, dat2, by = "dt", all.x = TRUE)})[["elapsed"]]sprintf("耗时%s秒", t2_1)## [1] "耗时0.78秒"r2_1## dt carat cut color clarity depth table price x y## 1: 2011-01-01 0.71 Ideal H VS1 61.8 55 NA 5.72 5.77## 2: 2011-01-01 0.46 Ideal D VS2 62.7 57 NA 4.89 4.93## 3: 2011-01-01 2.80 Good G SI2 63.8 58 18788 8.90 8.85## 4: 2011-01-01 1.70 Premium G VVS2 59.8 59 18718 7.70 7.75## 5: 2011-01-01 2.16 Very Good F SI2 62.4 59 18678 8.28 8.33## ---## 9999996: 2020-01-01 0.31 Very Good J SI1 61.9 59 363 4.28 4.32## 9999997: 2020-01-01 0.28 Very Good H SI1 61.5 56 360 4.21 4.24## 9999998: 2020-01-01 0.23 Good F VS2 63.8 57 357 3.93 3.84## 9999999: 2020-01-01 0.23 Good F VS2 63.8 57 357 3.93 3.84## 10000000: 2020-01-01 0.23 Good E VS2 61.8 63 357 3.88 3.89## z price1## 1: 3.55 310## 2: 3.08 310## 3: 0.00 310## 4: 4.62 310## 5: 5.18 310## ---## 9999996: 2.66 65## 9999997: 2.60 65## 9999998: 2.48 65## 9999999: 2.48 65## 10000000: 2.40 65
多重join
dat1以dt列左连接dat2,dat3…,多重聚合需要reduce操作:
t2_2 <- system.time({mymerge <- function(x, y) merge.data.table(x, y, by = "dt", all.x = TRUE)r2_2 <- Reduce(mymerge, list(dat1, dat2, dat3))})[["elapsed"]]sprintf("耗时%s秒", round(t2_2, 2))## [1] "耗时1.53秒"r2_2## dt carat cut color clarity depth table price x y## 1: 2011-01-01 0.71 Ideal H VS1 61.8 55 NA 5.72 5.77## 2: 2011-01-01 0.46 Ideal D VS2 62.7 57 NA 4.89 4.93## 3: 2011-01-01 2.80 Good G SI2 63.8 58 18788 8.90 8.85## 4: 2011-01-01 1.70 Premium G VVS2 59.8 59 18718 7.70 7.75## 5: 2011-01-01 2.16 Very Good F SI2 62.4 59 18678 8.28 8.33## ---## 9999996: 2020-01-01 0.31 Very Good J SI1 61.9 59 363 4.28 4.32## 9999997: 2020-01-01 0.28 Very Good H SI1 61.5 56 360 4.21 4.24## 9999998: 2020-01-01 0.23 Good F VS2 63.8 57 357 3.93 3.84## 9999999: 2020-01-01 0.23 Good F VS2 63.8 57 357 3.93 3.84## 10000000: 2020-01-01 0.23 Good E VS2 61.8 63 357 3.88 3.89## z price1 price2## 1: 3.55 310 280## 2: 3.08 310 280## 3: 0.00 310 280## 4: 4.62 310 280## 5: 5.18 310 280## ---## 9999996: 2.66 65 963## 9999997: 2.60 65 963## 9999998: 2.48 65 963## 9999999: 2.48 65 963## 10000000: 2.40 65 963
长宽表转换
长宽表是通用但不高频的操作,也是个难点。
长表转宽表
一般用于方便查询,或者清洗成机器学习的训练集(一行为一个样本)。
以dat1为例,转成纵轴为切割类型(cut)、横轴为(color)的透视表,计算每个子类的平均深度(depth)与最高价格(price):
t3_1 <- system.time({# 由于price列有缺失值,需先构造忽略缺失值的mean、max函数mean1 <- function(x) mean(x, na.rm = TRUE)max1 <- function(x) max(x, na.rm = TRUE)r3_1 <- dcast.data.table(dat1, cut ~ color, value.var = c("depth", "price"), fun = list(mean1, max1))})[["elapsed"]]sprintf("耗时%s秒", round(t3_1, 2))## [1] "耗时0.33秒"r3_1## cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G## 1: Fair 64.03644 63.28830 63.52278 64.33349## 2: Good 62.36325 62.20834 62.20131 62.52365## 3: Very Good 61.75085 61.73201 61.72085 61.84400## 4: Premium 61.16783 61.17557 61.25882 61.27935## 5: Ideal 61.67709 61.68604 61.67461 61.70079## depth_mean1_H depth_mean1_I depth_mean1_J price_mean1_D price_mean1_E## 1: 64.59086 64.21593 64.37899 4333.475 3677.251## 2: 62.49905 62.48628 62.38909 3408.291 3415.467## 3: 61.96961 61.93466 61.90480 3475.854 3212.806## 4: 61.31966 61.33115 61.39429 3635.446 3541.243## 5: 61.73493 61.79271 61.82294 2633.070 2591.356## price_mean1_F price_mean1_G price_mean1_H price_mean1_I price_mean1_J## 1: 3833.757 4234.331 5143.889 4695.712 5002.936## 2: 3502.666 4110.020 4286.863 5097.310 4576.685## 3: 3780.723 3870.807 4521.889 5255.918 5108.285## 4: 4326.868 4497.991 5214.150 5928.537 6288.934## 5: 3371.575 3721.588 3882.526 4457.514 4926.173## depth_max1_D depth_max1_E depth_max1_F depth_max1_G depth_max1_H## 1: 71.6 79.0 71.0 72.9 71.8## 2: 67.0 65.9 66.5 65.8 66.1## 3: 64.7 64.5 64.5 64.9 64.7## 4: 63.0 63.0 63.0 63.0 63.0## 5: 64.5 65.5 65.3 64.1 65.1## depth_max1_I depth_max1_J price_max1_D price_max1_E price_max1_F## 1: 71.3 73.6 16386 15584 17995## 2: 65.9 66.0 18468 18236 18686## 3: 64.4 64.1 18542 18731 18777## 4: 63.0 63.0 18575 18477 18791## 5: 66.7 65.4 18693 18729 18780## price_max1_G price_max1_H price_max1_I price_max1_J## 1: 18574 18565 18242 18531## 2: 18788 18640 18707 18325## 3: 18818 18803 18500 18430## 4: 18741 18795 18823 18710## 5: 18806 18760 18779 18508
宽表转长表
以dat1为例,切割类型(cut)、横轴为(color)不动,扩展x、y、z字段及其值为长表,转后结果为3千万行:
t3_2 <- system.time({r3_2 <- melt.data.table(dat1, id.vars = c("cut", "color"), measure.vars = c("x", "y", "z"), variable.name = "xyz", value.name = "xyzvalue")})[["elapsed"]]sprintf("耗时%s秒", round(t3_2, 2))## [1] "耗时0.08秒"r3_2## cut color xyz xyzvalue## 1: Ideal H x 5.72## 2: Ideal D x 4.89## 3: Good G x 8.90## 4: Premium G x 7.70## 5: Very Good F x 8.28## ---## 29999996: Very Good J z 2.66## 29999997: Very Good H z 2.60## 29999998: Good F z 2.48## 29999999: Good F z 2.48## 30000000: Good E z 2.40
高阶
向上/下填充空值
一个常见的应用场景,停牌股票的价格填充。
此处先给出最小示例,再以dat1做例子:
x <- c(1, 2, 3, 3, NA, NA, 4, 5)nafill(x, type = "nocb") # 向上填充## [1] 1 2 3 3 4 4 4 5t4_1 <- system.time({dat1$price <- nafill(dat1$price, type = "locf") # 向下填充})[["elapsed"]]sprintf("耗时%s秒", round(t4_1, 2))## [1] "耗时0.13秒"print(dat1)## carat cut color clarity depth table price x y z## 1: 0.71 Ideal H VS1 61.8 55 NA 5.72 5.77 3.55## 2: 0.46 Ideal D VS2 62.7 57 NA 4.89 4.93 3.08## 3: 2.80 Good G SI2 63.8 58 18788 8.90 8.85 0.00## 4: 1.70 Premium G VVS2 59.8 59 18718 7.70 7.75 4.62## 5: 2.16 Very Good F SI2 62.4 59 18678 8.28 8.33 5.18## ---## 9999996: 0.31 Very Good J SI1 61.9 59 363 4.28 4.32 2.66## 9999997: 0.28 Very Good H SI1 61.5 56 360 4.21 4.24 2.60## 9999998: 0.23 Good F VS2 63.8 57 357 3.93 3.84 2.48## 9999999: 0.23 Good F VS2 63.8 57 357 3.93 3.84 2.48## 10000000: 0.23 Good E VS2 61.8 63 357 3.88 3.89 2.40## dt## 1: 2011-01-01## 2: 2011-01-01## 3: 2011-01-01## 4: 2011-01-01## 5: 2011-01-01## ---## 9999996: 2020-01-01## 9999997: 2020-01-01## 9999998: 2020-01-01## 9999999: 2020-01-01## 10000000: 2020-01-01
添加子维度聚合结果为新列
对于这类问题,很常见的思路是,先计算聚合结果,再join到原表上。data.table提供了直接添加聚合结果列的功能,代码少且效率高。
以dat1为例,添加两列,一列为以cut、color聚合求price的均值,另一列是求标准差:
t5_1 <- system.time({dat1[, `:=`(mean_price = mean(price, na.rm = TRUE),sd_price = sd(price, na.rm = TRUE)),by = .(cut, color)]})[["elapsed"]]sprintf("耗时%s秒", round(t5_1, 2))## [1] "耗时0.34秒"setorder(dat1, cut, color)print(dat1)## carat cut color clarity depth table price x y z dt## 1: 2.00 Fair D SI1 64.8 58 15540 7.98 7.89 5.14 2011-01-01## 2: 0.90 Fair D SI1 64.0 63 4796 6.12 6.04 3.89 2011-01-01## 3: 1.02 Fair D SI2 65.5 60 4381 6.27 6.24 4.10 2011-01-01## 4: 1.01 Fair D SI1 65.9 60 4276 6.32 6.18 4.12 2011-01-01## 5: 0.90 Fair D SI2 64.6 59 3847 6.04 6.01 3.89 2011-01-01## ---## 9999996: 0.40 Ideal J VS1 62.1 53 633 4.75 4.78 2.96 2020-01-01## 9999997: 0.37 Ideal J VS1 62.4 55 624 4.64 4.59 2.88 2020-01-01## 9999998: 0.32 Ideal J VS2 62.1 54 504 4.44 4.38 2.74 2020-01-01## 9999999: 0.37 Ideal J SI1 60.5 57 497 4.63 4.66 2.81 2020-01-01## 10000000: 0.30 Ideal J IF 61.5 57 489 4.29 4.36 2.66 2020-01-01## mean_price sd_price## 1: 4329.044 3291.299## 2: 4329.044 3291.299## 3: 4329.044 3291.299## 4: 4329.044 3291.299## 5: 4329.044 3291.299## ---## 9999996: 4921.961 4479.684## 9999997: 4921.961 4479.684## 9999998: 4921.961 4479.684## 9999999: 4921.961 4479.684## 10000000: 4921.961 4479.684
以dat1为例,以dt分组添加一列序号id:
t5_2 <- system.time({dat1[, id := seq(.N), by = .(dt)]})[["elapsed"]]sprintf("耗时%s秒", round(t5_2, 2))## [1] "耗时0.27秒"print(dat1)## carat cut color clarity depth table price x y z dt## 1: 2.00 Fair D SI1 64.8 58 15540 7.98 7.89 5.14 2011-01-01## 2: 0.90 Fair D SI1 64.0 63 4796 6.12 6.04 3.89 2011-01-01## 3: 1.02 Fair D SI2 65.5 60 4381 6.27 6.24 4.10 2011-01-01## 4: 1.01 Fair D SI1 65.9 60 4276 6.32 6.18 4.12 2011-01-01## 5: 0.90 Fair D SI2 64.6 59 3847 6.04 6.01 3.89 2011-01-01## ---## 9999996: 0.40 Ideal J VS1 62.1 53 633 4.75 4.78 2.96 2020-01-01## 9999997: 0.37 Ideal J VS1 62.4 55 624 4.64 4.59 2.88 2020-01-01## 9999998: 0.32 Ideal J VS2 62.1 54 504 4.44 4.38 2.74 2020-01-01## 9999999: 0.37 Ideal J SI1 60.5 57 497 4.63 4.66 2.81 2020-01-01## 10000000: 0.30 Ideal J IF 61.5 57 489 4.29 4.36 2.66 2020-01-01## mean_price sd_price id## 1: 4329.044 3291.299 1## 2: 4329.044 3291.299 2## 3: 4329.044 3291.299 3## 4: 4329.044 3291.299 4## 5: 4329.044 3291.299 5## ---## 9999996: 4921.961 4479.684 3105## 9999997: 4921.961 4479.684 3106## 9999998: 4921.961 4479.684 3107## 9999999: 4921.961 4479.684 3108## 10000000: 4921.961 4479.684 3109
移动函数
举例,k线图中的5日、10日均线为移动平均。
以dat1为例,以color分组,求price步长为10的移动均值、移动标准差:
t6_1 <- system.time({dat1[, `:=`(MA10_price = frollmean(price, 10),MSD10_price = frollapply(price, 10, FUN = sd)),by = .(color)] # 实际计算均线时一定要先按时间排序哦~})[["elapsed"]]sprintf("耗时%s秒", round(t6_1, 2)) # 移动标准差耗时很久## [1] "耗时61.33秒"print(dat1)## carat cut color clarity depth table price x y z dt## 1: 2.00 Fair D SI1 64.8 58 15540 7.98 7.89 5.14 2011-01-01## 2: 0.90 Fair D SI1 64.0 63 4796 6.12 6.04 3.89 2011-01-01## 3: 1.02 Fair D SI2 65.5 60 4381 6.27 6.24 4.10 2011-01-01## 4: 1.01 Fair D SI1 65.9 60 4276 6.32 6.18 4.12 2011-01-01## 5: 0.90 Fair D SI2 64.6 59 3847 6.04 6.01 3.89 2011-01-01## ---## 9999996: 0.40 Ideal J VS1 62.1 53 633 4.75 4.78 2.96 2020-01-01## 9999997: 0.37 Ideal J VS1 62.4 55 624 4.64 4.59 2.88 2020-01-01## 9999998: 0.32 Ideal J VS2 62.1 54 504 4.44 4.38 2.74 2020-01-01## 9999999: 0.37 Ideal J SI1 60.5 57 497 4.63 4.66 2.81 2020-01-01## 10000000: 0.30 Ideal J IF 61.5 57 489 4.29 4.36 2.66 2020-01-01## mean_price sd_price id MA10_price MSD10_price## 1: 4329.044 3291.299 1 NA NA## 2: 4329.044 3291.299 2 NA NA## 3: 4329.044 3291.299 3 NA NA## 4: 4329.044 3291.299 4 NA NA## 5: 4329.044 3291.299 5 NA NA## ---## 9999996: 4921.961 4479.684 3105 1072.5 483.8719## 9999997: 4921.961 4479.684 3106 947.5 409.5646## 9999998: 4921.961 4479.684 3107 812.6 279.7758## 9999999: 4921.961 4479.684 3108 723.8 210.1792## 10000000: 4921.961 4479.684 3109 650.0 126.9444
Sessioninfo
sessionInfo()# R version 3.6.1 (2019-07-05)# Platform: x86_64-w64-mingw32/x64 (64-bit)# Running under: Windows 10 x64 (build 18362)## Matrix products: default## locale:# [1] LC_COLLATE=Chinese (Simplified)_China.936# [2] LC_CTYPE=Chinese (Simplified)_China.936# [3] LC_MONETARY=Chinese (Simplified)_China.936# [4] LC_NUMERIC=C# [5] LC_TIME=Chinese (Simplified)_China.936## attached base packages:# [1] stats graphics grDevices utils datasets methods base## other attached packages:# [1] data.table_1.12.8## loaded via a namespace (and not attached):# [1] Rcpp_1.0.3 knitr_1.28 magrittr_1.5 tidyselect_1.0.0# [5] munsell_0.5.0 colorspace_1.4-1 R6_2.4.1 rlang_0.4.4# [9] dplyr_0.8.4 stringr_1.4.0 tools_3.6.1 grid_3.6.1# [13] gtable_0.3.0 xfun_0.12 htmltools_0.4.0 assertthat_0.2.1# [17] yaml_2.2.1 digest_0.6.23 tibble_2.1.3 lifecycle_0.1.0# [21] crayon_1.3.4 purrr_0.3.3 ggplot2_3.3.0.9000 glue_1.3.1# [25] evaluate_0.14 rmarkdown_2.1 stringi_1.4.5 compiler_3.6.1# [29] pillar_1.4.3 scales_1.1.0 pkgconfig_2.0.3
本系列第一篇文章就到这里,本人PC为8核16G内存,若在生产环境中效率会更高。
如果有更好的方案或者典型案例的,欢迎在下方留言~
Send to Author