Comparing polars
, data.table
, and dplyr
/pandas
2025-02-02
(“linguistic relativity”)
polars
data.table
dplyr
(pandas
)Note, fair benchmarking is difficult
Comparing based on five related but distinct criteria:
dplyr
and polars
built on some shared verbs/synonyms
Action | polars |
dplyr |
---|---|---|
Filter rows | filter() |
filter() |
Select columns | select() |
select() |
Manipulate columns | with_columns() |
mutate() |
Sort/arrange rows | sort() |
arrange() |
Rename columns | rename() |
rename() |
Summarize by groups | group_by().agg() |
group_by() %>% summarise() |
data.table
in contrast has an implicit syntax
dt[i, j, by]
dt[i, j, by]
Produces very concise syntax where position of syntax within []
and other symbols communicate intention
How does this affect our use of the syntax?
All three packages offer operation chaining
Unlike the others, polars
can directly optimize your call behind the scenes (similar to other database engines)
WITH_COLUMNS:
[col("foo").str.uppercase()]
DF ["foo", "bar"]; PROJECT */2 COLUMNS; SELECTION: [(col("bar")) > (0)]
All three packages offer operation chaining
Unlike the others, polars
can directly optimize your call behind the scenes (similar to other database engines)
foo | bar |
---|---|
str | i64 |
"B" | 1 |
"C" | 2 |
polars
: Requires pl.col() for complex operationsdata.table
: Direct column names within []dplyr
: Direct column namesshape: (5, 3)
┌─────┬─────┬──────────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════╪═════╪══════════╡
│ 1 ┆ a ┆ 0.96439 │
│ 2 ┆ a ┆ 0.800441 │
│ 3 ┆ a ┆ 0.55105 │
│ 4 ┆ b ┆ 0.759665 │
│ 5 ┆ b ┆ 0.346401 │
└─────┴─────┴──────────┘
polars
: Requires pl.col() for complex operationsdata.table
: Direct column names within []dplyr
: Direct column namesAgain, all three packages make this possible
A few nuances:
summarise()
will message if multiple rows per group are foundpolars
to use their functions whereas data.table
and dplyr
can use most functions in R
polars
prioritizes “a well-structured, typed API that is both expressive and easy to use.”dplyr
prioritizes code legibility using expressive verbsdata.table
prioritizes concise, symbol-based codeOf course, familiarity with the syntax is most important but assuming familiarity, data.table
offers very high speed of writing while code snippets can be very useful for polars
and dplyr
polars
(and similar projects like arrow
) work are likely the future
polars
is written in Rust
making it transferable to other languages (e.g. R
now has a polars
package)data.table
is highly used and will be for what looks like a long time
R
packages and production codedplyr
(and the tidyverse
) helped structure the grammar