Comparing polars, data.table, and dplyr/pandas
2025-02-02
(“linguistic relativity”)
polarsdata.tabledplyr (pandas)Note, fair benchmarking is difficult
Comparing based on five related but distinct criteria:
dplyr and polars built on some shared verbs/synonyms
| Action | polars |
dplyr |
|---|---|---|
| Filter rows | filter() |
filter() |
| Select columns | select() |
select() |
| Manipulate columns | with_columns() |
mutate() |
| Sort/arrange rows | sort() |
arrange() |
| Rename columns | rename() |
rename() |
| Summarize by groups | group_by().agg() |
group_by() %>% summarise() |
data.table in contrast has an implicit syntax
dt[i, j, by]
dt[i, j, by]
Produces very concise syntax where position of syntax within [] and other symbols communicate intention
How does this affect our use of the syntax?
All three packages offer operation chaining
Unlike the others, polars can directly optimize your call behind the scenes (similar to other database engines)
WITH_COLUMNS:
[col("foo").str.uppercase()]
DF ["foo", "bar"]; PROJECT */2 COLUMNS; SELECTION: [(col("bar")) > (0)]
All three packages offer operation chaining
Unlike the others, polars can directly optimize your call behind the scenes (similar to other database engines)
| foo | bar |
|---|---|
| str | i64 |
| "B" | 1 |
| "C" | 2 |
polars: Requires pl.col() for complex operationsdata.table: Direct column names within []dplyr: Direct column namesshape: (5, 3)
┌─────┬─────┬──────────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════╪═════╪══════════╡
│ 1 ┆ a ┆ 0.96439 │
│ 2 ┆ a ┆ 0.800441 │
│ 3 ┆ a ┆ 0.55105 │
│ 4 ┆ b ┆ 0.759665 │
│ 5 ┆ b ┆ 0.346401 │
└─────┴─────┴──────────┘
polars: Requires pl.col() for complex operationsdata.table: Direct column names within []dplyr: Direct column namesAgain, all three packages make this possible
A few nuances:
summarise() will message if multiple rows per group are foundpolars to use their functions whereas data.table and dplyr can use most functions in Rpolars prioritizes “a well-structured, typed API that is both expressive and easy to use.”dplyr prioritizes code legibility using expressive verbsdata.table prioritizes concise, symbol-based codeOf course, familiarity with the syntax is most important but assuming familiarity, data.table offers very high speed of writing while code snippets can be very useful for polars and dplyr
polars (and similar projects like arrow ) work are likely the future
polars is written in Rust making it transferable to other languages (e.g. R now has a polars package)data.table is highly used and will be for what looks like a long time
R packages and production codedplyr (and the tidyverse) helped structure the grammar