The Power of Syntax

Comparing polars, data.table, and dplyr/pandas

Tyson S. Barrett, PhD

2025-02-02

Sapir-Whorf Hypothesis

(“linguistic relativity”)

Proposes that the language a person speaks influences how they think, perceive, and experience the world

Color Perception
Spatial Orientation
Time and Grammatical Structures
Gender and Cognition

How does the language and syntax we code in affect our perceptions, plans, and thinking?

Who am I?

Brief History Lesson

The Packages

`polars`

# pip install polars
import polars as pl

`data.table`

# install.packages("data.table")
library(data.table)

`dplyr` (`pandas`)

# install.packages("dplyr")
library(dplyr)

Not Just Speed

https://duckdblabs.github.io/db-benchmark/

Note, fair benchmarking is difficult

Comparing Syntax

Comparing based on five related but distinct criteria:

Verbosity and Readability
Operation Chaining
Column References
Grouping Syntax
Speed of Writing

Verbosity and Readability

dplyr and polars built on some shared verbs/synonyms

Action	`polars`	`dplyr`
Filter rows	`filter()`	`filter()`
Select columns	`select()`	`select()`
Manipulate columns	`with_columns()`	`mutate()`
Sort/arrange rows	`sort()`	`arrange()`
Rename columns	`rename()`	`rename()`
Summarize by groups	`group_by().agg()`	`group_by() %>% summarise()`

Verbosity and Readability

data.table in contrast has an implicit syntax

dt[i, j, by]

Verbosity and Readability

dt[i, j, by]

Produces very concise syntax where position of syntax within [] and other symbols communicate intention

dt[var == 1]             # filter
dt[, .(var, var2)]       # select
dt[, new := var + var2]  # mutate/with_columns

Verbosity and Readability

How does this affect our use of the syntax?

Data exploration
Learning Curve
Collaboration
Production code

Operation Chaining

All three packages offer operation chaining

polars
data.table
dplyr

pl3.select(pl.col("x", "y")).filter(pl.col("x") > 1)

dt3[, .(x, y)][x > 1]

df3 |>
  select(x, y) |>
  filter(x > 1)

Operation Chaining

Unlike the others, polars can directly optimize your call behind the scenes (similar to other database engines)

pl1 = pl.DataFrame({"foo": ["a", "b", "c"], "bar": [0, 1, 2]}).lazy()
pl2 = (
  pl1.with_columns(pl.col("foo").str.to_uppercase())
  .filter(pl.col("bar") > 0)
)
pl2.explain()

WITH_COLUMNS:
  [col("foo").str.uppercase()]
  DF ["foo", "bar"]; PROJECT */2 COLUMNS; SELECTION: [(col("bar")) > (0)]

Operation Chaining

All three packages offer operation chaining

Unlike the others, polars can directly optimize your call behind the scenes (similar to other database engines)

pl2.collect()

shape: (2, 2)

foo	bar
str	i64
"B"	1
"C"	2

Column References

polars: Requires pl.col() for complex operations
data.table: Direct column names within []
dplyr: Direct column names

import numpy as np
pl3 = pl.DataFrame({
    'x': range(1, 6),
    'y': ['a', 'a', 'a', 'b', 'b'],
    'z': np.random.rand(5)
})

shape: (5, 3)
┌─────┬─────┬──────────┐
│ x   ┆ y   ┆ z        │
│ --- ┆ --- ┆ ---      │
│ i64 ┆ str ┆ f64      │
╞═════╪═════╪══════════╡
│ 1   ┆ a   ┆ 0.96439  │
│ 2   ┆ a   ┆ 0.800441 │
│ 3   ┆ a   ┆ 0.55105  │
│ 4   ┆ b   ┆ 0.759665 │
│ 5   ┆ b   ┆ 0.346401 │
└─────┴─────┴──────────┘

Column References

polars: Requires pl.col() for complex operations
data.table: Direct column names within []
dplyr: Direct column names

import numpy as np
pl3 = pl.DataFrame({
    'x': range(1, 6),
    'y': ['a', 'a', 'a', 'b', 'b'],
    'z': np.random.rand(5)
})

polars
data.table
dplyr

pl3.select(pl.col("x", "y"))

shape: (5, 2)

x	y
i64	str
1	"a"
2	"a"
3	"a"
4	"b"
5	"b"

dt3[, .(x, y)]

       x      y
   <int> <char>
1:     1      a
2:     2      a
3:     3      a
4:     4      b
5:     5      b

df3 %>% select(x, y)

# A tibble: 5 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 a    
3     3 a    
4     4 b    
5     5 b

Grouping Syntax

Again, all three packages make this possible

polars
data.table
dplyr

pl3.group_by('y').agg(pl.sum('x'))

shape: (2, 2)

y	x
str	i64
"a"	6
"b"	9

dt3[, sum(x), by = y]

        y    V1
   <char> <int>
1:      a     6
2:      b     9

df3 %>% 
  group_by(y) %>%            # grouping
  summarise(sum_x = sum(x))

# A tibble: 2 × 2
  y     sum_x
  <chr> <int>
1 a         6
2 b         9

Grouping Syntax

A few nuances:

summarise() will message if multiple rows per group are found
restrained in polars to use their functions whereas data.table and dplyr can use most functions in R

Speed of Writing

polars prioritizes “a well-structured, typed API that is both expressive and easy to use.”
dplyr prioritizes code legibility using expressive verbs
data.table prioritizes concise, symbol-based code

Of course, familiarity with the syntax is most important but assuming familiarity, data.table offers very high speed of writing while code snippets can be very useful for polars and dplyr

import polars as pl

library(data.table)

library(dplyr)

Closing Thoughts

The way polars (and similar projects like arrow ) work are likely the future
- polars is written in Rust making it transferable to other languages (e.g. R now has a polars package)
data.table is highly used and will be for what looks like a long time
- used in thousands of R packages and production code
dplyr (and the tidyverse) helped structure the grammar
- has amazing backends to make it possible for optimization and speed without developing a whole new framework

The Power of Syntax

Sapir-Whorf Hypothesis

Proposes that the language a person speaks influences how they think, perceive, and experience the world

How does the language and syntax we code in affect our perceptions, plans, and thinking?

Who am I?

Brief History Lesson

The Packages

polars

data.table

dplyr (pandas)

Not Just Speed

https://duckdblabs.github.io/db-benchmark/

Comparing Syntax

Verbosity and Readability

Verbosity and Readability

Verbosity and Readability

Verbosity and Readability

Operation Chaining

Operation Chaining

Operation Chaining

Column References

Column References

Grouping Syntax

Grouping Syntax

Speed of Writing

Closing Thoughts

`polars`

`data.table`

`dplyr` (`pandas`)