Data Joins: Speed and Efficiency of `dplyr` and `data.table`
11 Oct 2019This short post is looking at data joins for both dplyr
and data.table
. There are a lot of moving parts when assessing these things, so the results here are just for this situation. It may differ in others. However, the results here are quite instructive.
As I’ve mentioned before:
I want to emphasize that this post is not to say one approach is better than another. My opinion is use what works for you. Ultimately, this is why I am trying to understand the basic behavior of
data.table
anddplyr
to do basic data manipulation—to understand when different tools are going to be more useful to me.
TL;DR
Ordering (aka arranging, sorting) is very important in speeding up and increasing efficiency of a data join. This is particularly poignant in data.table
which provides a very distinct form of ordering. Herein, I benchmark the performance of joining (unordered data) and ordering + joining data for both data.table
and dplyr
. The results of the benchmarks show:
data.table
joins data extremely quickly, especially when the key is numeric. It joins even faster when we join (including the time it took to order the data). This is close to magic. When the key is character, it is still good but far slower than with a numeric key.- Ordering data can be expensive.
data.table
does a great job of this, and so it reaps the rewards.dplyr
, although it does benefit from ordered data (not shown in this post but figures are shown on Twitter), loses a lot of time ordering the data. So overall, ordering + joining takes more time than simply just joining (for character ID vectors).dplyr
, with numeric ID vectors, is faster (but less efficient) with ordering + joining. Fordata.table
(like magic), ordering + joining is faster and more efficient for both character and numeric ID vectors. - Memory usage is roughly the same for
dplyr
anddata.table
overall (except for when we are also ordering the data withdplyr
).
Background
I’ve been interested in understand the inner workings of dplyr
and data.table
for some time. This is partly just for my personal benefit and learning but also to see in which situations one may work better than the other. My interest was further advanced by some posts I saw regarding the effect of ordering data before doing data wrangling.
One of the posts that spurred this interest was by Brodie Gaslam. He mentioned his fantastic post in this tweet:
Might be of interest:https://t.co/mVRsyMk9aK
— BrodieG (@BrodieGaslam) October 12, 2019
The analyses presented herein attempt to show when ordering can help in joins with dplyr
and with data.table
, and when it isn’t so beneficial.
Packages
First, we’ll use the following packages to further understand data.table
and dplyr
joins.
Note that on my machine, data.table
by default uses 4 threads. You can see this by using:
## 4
And don’t forget to set a random number seed so our random number work can be replicated.
Benchmarking Unordered and Ordered Data
In the benchmarking, I will assess both speed and efficiency (or memory use). This is based on 50 iteractions for each approach in each condition. Below, we look at when the ID (“key”) is a numeric vector and when it is a character vector.
Numeric ID
We’ll use the following tibble for the dplyr
joins where it has a randomly ordered (“unordered”) numeric ID variable. Both d1
and d2
are about 80MB.
Below, we benchmark both the regular join—b0
with unordered data—and benchmark both the ordering and the joining—b1
.
Next, we create the data tables used for the benchmarking with data.table
.
Below, I benchmark both the regular join—b2
with unordered data—and benchmark both the ordering and the joining—b3
.
The figures below highlight the results of the benchmarking. First, for the speed of the joins.
Next, for the average memory usage (allocated) of the joins.
Character ID
We’ll use the following tibble for the dplyr
joins where it has a randomly ordered (“unordered”) character ID variable. Again, both d1
and d2
are about 80MB.
Below, we benchmark both the regular join—b0
with unordered data—and benchmark both the ordering and the joining—b1
.
Next, we create the data tables used for the benchmarking with data.table
.
Below, I benchmark both the regular join—b2
with unordered data—and benchmark both the ordering and the joining—b3
.
The figures below highlight the results of the benchmarking. First, for the speed of the joins.
Next, for the average memory usage (allocated) of the joins.
Conclusions
Results here are more straightforward than my previous post looking at the behavior of dplyr
and data.table
when adding a variable, filtering rows, and summarizing data by group. I am still investigating some aspects of that post. In the meantime, I thought it would be worthwhile to look at joining moderately-sized data in conditions where I join the data as unordered or when we order it then join.
Three main takeaways from my perspective:
data.table
joins data extremely quickly, especially when the key is numeric. It joins even faster when we join (including the time it took to order the data). This is close to magic. When the key is character, it is still good but far slower than with a numeric key.- Ordering data can be expensive.
data.table
does a great job of this, and so it reaps the rewards.dplyr
, although it does benefit from ordered data (not shown in this post but figures are shown on Twitter), loses a lot of time ordering the data. So overall, ordering + joining takes more time than simply just joining. This is not the case fordata.table
, where ordering + joining is faster. - Memory usage is roughly the same for
dplyr
anddata.table
overall (except for when we are also ordering the data withdplyr
).
For those that enjoy dplyr
syntax (like myself), we can find similar speed improvements using the dtplyr
package (see this post for more information on these performance improvements).
Until next time!
Session Information
Note the package information for these analyses.
## ─ Session info ──────────────────────────────────────────────────────────
## setting value
## version R version 3.6.1 (2019-07-05)
## os macOS Mojave 10.14.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2019-10-10
##
## ─ Packages ──────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0)
## beeswarm 0.2.3 2016-04-25 [1] CRAN (R 3.6.0)
## bench * 1.0.4 2019-09-06 [1] CRAN (R 3.6.0)
## cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## cowplot * 1.0.0 2019-07-11 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## data.table * 1.12.4 2019-10-03 [1] CRAN (R 3.6.1)
## digest 0.6.21 2019-09-20 [1] CRAN (R 3.6.0)
## dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0)
## ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 3.6.0)
## ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0)
## glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## knitr 1.25 2019-09-18 [1] CRAN (R 3.6.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0)
## lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.0)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## profmem 0.5.0 2018-01-30 [1] CRAN (R 3.6.0)
## purrr 0.3.2 2019-03-15 [1] CRAN (R 3.6.0)
## R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0)
## Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.0)
## rlang 0.4.0 2019-06-25 [1] CRAN (R 3.6.0)
## rmarkdown 1.16 2019-10-01 [1] CRAN (R 3.6.0)
## scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
## tidyr 1.0.0 2019-09-11 [1] CRAN (R 3.6.0)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0)
## vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.6.0)
## vipor 0.4.5 2017-03-22 [1] CRAN (R 3.6.0)
## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.6.0)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
## xfun 0.10 2019-10-01 [1] CRAN (R 3.6.0)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
## zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0)
##
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library