The Power of Syntax

Comparing polars, data.table, and dplyr/pandas

Tyson S. Barrett, PhD

2025-02-02

Sapir-Whorf Hypothesis

(“linguistic relativity”)


Proposes that the language a person speaks influences how they think, perceive, and experience the world

  • Color Perception
  • Spatial Orientation
  • Time and Grammatical Structures
  • Gender and Cognition

How does the language and syntax we code in affect our perceptions, plans, and thinking?

Who am I?


Brief History Lesson

The Packages

polars

# pip install polars
import polars as pl

data.table

# install.packages("data.table")
library(data.table)

dplyr (pandas)

# install.packages("dplyr")
library(dplyr)

Not Just Speed

https://duckdblabs.github.io/db-benchmark/

Note, fair benchmarking is difficult

Comparing Syntax


Comparing based on five related but distinct criteria:

  • Verbosity and Readability
  • Operation Chaining
  • Column References
  • Grouping Syntax
  • Speed of Writing

Verbosity and Readability


dplyr and polars built on some shared verbs/synonyms

Action polars dplyr
Filter rows filter() filter()
Select columns select() select()
Manipulate columns with_columns() mutate()
Sort/arrange rows sort() arrange()
Rename columns rename() rename()
Summarize by groups group_by().agg() group_by() %>% summarise()

Verbosity and Readability


data.table in contrast has an implicit syntax

dt[i, j, by]

Verbosity and Readability

dt[i, j, by]


Produces very concise syntax where position of syntax within [] and other symbols communicate intention

dt[var == 1]             # filter
dt[, .(var, var2)]       # select
dt[, new := var + var2]  # mutate/with_columns

Verbosity and Readability

How does this affect our use of the syntax?


  • Data exploration
  • Learning Curve
  • Collaboration
  • Production code

Operation Chaining


All three packages offer operation chaining

pl3.select(pl.col("x", "y")).filter(pl.col("x") > 1)
dt3[, .(x, y)][x > 1]
df3 |>
  select(x, y) |>
  filter(x > 1)

Operation Chaining

Unlike the others, polars can directly optimize your call behind the scenes (similar to other database engines)

pl1 = pl.DataFrame({"foo": ["a", "b", "c"], "bar": [0, 1, 2]}).lazy()
pl2 = (
  pl1.with_columns(pl.col("foo").str.to_uppercase())
  .filter(pl.col("bar") > 0)
)
pl2.explain()
WITH_COLUMNS:
  [col("foo").str.uppercase()]
  DF ["foo", "bar"]; PROJECT */2 COLUMNS; SELECTION: [(col("bar")) > (0)]

Operation Chaining


All three packages offer operation chaining

Unlike the others, polars can directly optimize your call behind the scenes (similar to other database engines)

pl2.collect()
shape: (2, 2)
foo bar
str i64
"B" 1
"C" 2

Column References

  • polars: Requires pl.col() for complex operations
  • data.table: Direct column names within []
  • dplyr: Direct column names
import numpy as np
pl3 = pl.DataFrame({
    'x': range(1, 6),
    'y': ['a', 'a', 'a', 'b', 'b'],
    'z': np.random.rand(5)
})
shape: (5, 3)
┌─────┬─────┬──────────┐
│ x   ┆ y   ┆ z        │
│ --- ┆ --- ┆ ---      │
│ i64 ┆ str ┆ f64      │
╞═════╪═════╪══════════╡
│ 1   ┆ a   ┆ 0.96439  │
│ 2   ┆ a   ┆ 0.800441 │
│ 3   ┆ a   ┆ 0.55105  │
│ 4   ┆ b   ┆ 0.759665 │
│ 5   ┆ b   ┆ 0.346401 │
└─────┴─────┴──────────┘

Column References

  • polars: Requires pl.col() for complex operations
  • data.table: Direct column names within []
  • dplyr: Direct column names
import numpy as np
pl3 = pl.DataFrame({
    'x': range(1, 6),
    'y': ['a', 'a', 'a', 'b', 'b'],
    'z': np.random.rand(5)
})
pl3.select(pl.col("x", "y"))
shape: (5, 2)
x y
i64 str
1 "a"
2 "a"
3 "a"
4 "b"
5 "b"
dt3[, .(x, y)]
       x      y
   <int> <char>
1:     1      a
2:     2      a
3:     3      a
4:     4      b
5:     5      b
df3 %>% select(x, y)
# A tibble: 5 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 a    
3     3 a    
4     4 b    
5     5 b    

Grouping Syntax

Again, all three packages make this possible

pl3.group_by('y').agg(pl.sum('x'))  
shape: (2, 2)
y x
str i64
"a" 6
"b" 9
dt3[, sum(x), by = y]
        y    V1
   <char> <int>
1:      a     6
2:      b     9
df3 %>% 
  group_by(y) %>%            # grouping
  summarise(sum_x = sum(x))
# A tibble: 2 × 2
  y     sum_x
  <chr> <int>
1 a         6
2 b         9

Grouping Syntax


A few nuances:

  • summarise() will message if multiple rows per group are found
  • restrained in polars to use their functions whereas data.table and dplyr can use most functions in R

Speed of Writing


  • polars prioritizes “a well-structured, typed API that is both expressive and easy to use.”
  • dplyr prioritizes code legibility using expressive verbs
  • data.table prioritizes concise, symbol-based code

Of course, familiarity with the syntax is most important but assuming familiarity, data.table offers very high speed of writing while code snippets can be very useful for polars and dplyr

Closing Thoughts

  • The way polars (and similar projects like arrow ) work are likely the future
    • polars is written in Rust making it transferable to other languages (e.g. R now has a polars package)
  • data.table is highly used and will be for what looks like a long time
    • used in thousands of R packages and production code
  • dplyr (and the tidyverse) helped structure the grammar
    • has amazing backends to make it possible for optimization and speed without developing a whole new framework