Two Python DataFrame libraries, compared side by side — from basic operations to advanced analytics, with guidance on when to use which.
The most common operations at a glance. Pandas on the left, Polars on the right.
| Operation | Pandas | Polars |
|---|---|---|
| Create DataFrame | pd.DataFrame({"a": [1,2]}) |
pl.DataFrame({"a": [1,2]}) |
| Read CSV | pd.read_csv("f.csv") |
pl.read_csv("f.csv") |
| Select columns | df[["a", "b"]] |
df.select("a", "b") |
| Filter rows | df[df["a"] > 5] |
df.filter(pl.col("a") > 5) |
| Add column | df["c"] = df["a"] + 1 |
df.with_columns((pl.col("a") + 1).alias("c")) |
| Group & aggregate | df.groupby("a").agg({"b": "sum"}) |
df.group_by("a").agg(pl.col("b").sum()) |
| Sort | df.sort_values("a", ascending=False) |
df.sort("a", descending=True) |
| Join | pd.merge(df1, df2, on="k") |
df1.join(df2, on="k") |
| Drop nulls | df.dropna() |
df.drop_nulls() |
| Rename columns | df.rename(columns={"a": "x"}) |
df.rename({"a": "x"}) |
| Shape | df.shape |
df.shape |
| Describe | df.describe() |
df.describe() |
| Feature | Pandas | Polars |
|---|---|---|
| Index-based | ✓ | ✗ |
| Lazy evaluation | ✗ | ✓ |
| Multi-threaded | ✗ | ✓ |
| Arrow-native | ~ | ✓ |
| Ecosystem maturity | ✓ | ~ |
| Mutation (in-place) | ✓ | ✗ |
| Streaming large files | ✗ | ✓ |
| GPU acceleration | ~ (cuDF) | ✓ (GPU engine) |
Getting started with both libraries. They can coexist in the same environment.
# pip
pip install pandas
# conda
conda install pandas
# With optional dependencies
pip install pandas[all]
# Specific extras
pip install pandas[parquet,excel,sql]
# pip
pip install polars
# conda
conda install polars
# With all optional features
pip install polars[all]
# Specific extras
pip install polars[numpy,pandas,pyarrow]
import pandas as pd
import numpy as np
# Version check
print(pd.__version__)
import polars as pl
# Version check
print(pl.__version__)
# Optional: expression namespace
from polars import col, lit, when
df_polars = pl.from_pandas(df_pandas) and
df_pandas = df_polars.to_pandas().
Polars uses Arrow memory, so conversion to Pandas 2.x with ArrowDtype is zero-copy.
The fundamental data structures. Both libraries center on the DataFrame, but differ in their approach to indexing and mutability.
# From dict
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"age": [30, 25, 35],
"score": [85.5, 92.0, 78.3]
})
# With explicit index
df = pd.DataFrame(
{"x": [1,2,3]},
index=["a","b","c"]
)
# From list of dicts
df = pd.DataFrame([
{"a": 1, "b": 2},
{"a": 3, "b": 4}
])
# From NumPy array
df = pd.DataFrame(
np.random.randn(5, 3),
columns=["x", "y", "z"]
)
# From dict
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"age": [30, 25, 35],
"score": [85.5, 92.0, 78.3]
})
# No index concept — use row_nr if needed
df = pl.DataFrame({
"x": [1,2,3]
}).with_row_index("idx")
# From list of dicts
df = pl.DataFrame([
{"a": 1, "b": 2},
{"a": 3, "b": 4}
])
# With explicit schema
df = pl.DataFrame(
{"x": [1,2], "y": [3.0,4.0]},
schema={"x": pl.Int32, "y": pl.Float64}
)
# Create a Series
s = pd.Series([1, 2, 3], name="values")
# Access by label/position
s.loc[0] # label-based
s.iloc[0] # position-based
# String methods
df["name"].str.upper()
df["name"].str.contains("li")
# Vectorized math
(df["score"] * 1.1).round(1)
# Create a Series
s = pl.Series("values", [1, 2, 3])
# Access by position (no label index)
s[0] # integer indexing
s.gather([0, 2]) # multiple indices
# String methods (via expressions)
df.select(pl.col("name").str.to_uppercase())
df.select(pl.col("name").str.contains("li"))
# Vectorized math
df.select((pl.col("score") * 1.1).round(1))
# Select columns
df[["name", "age"]]
# Add / mutate column (in-place)
df["senior"] = df["age"] >= 30
# Add via assign (returns copy)
df = df.assign(
senior=lambda x: x["age"] >= 30,
grade=lambda x: np.where(
x["score"] >= 90, "A", "B"
)
)
# Drop columns
df.drop(columns=["score"])
# Rename
df.rename(columns={"age": "years"})
# Select columns
df.select("name", "age")
df.select(pl.col("name", "age"))
# Add / mutate column (returns new df)
df = df.with_columns(
(pl.col("age") >= 30).alias("senior")
)
# Multiple new columns at once
df = df.with_columns(
(pl.col("age") >= 30).alias("senior"),
pl.when(pl.col("score") >= 90)
.then(pl.lit("A"))
.otherwise(pl.lit("B"))
.alias("grade")
)
# Drop columns
df.drop("score")
# Rename
df.rename({"age": "years"})
| Type | Pandas | Polars | Notes |
|---|---|---|---|
| Integer | int64, Int64 |
Int8..Int64 |
Polars is strict about width |
| Float | float64 |
Float32, Float64 |
|
| String | object, string |
Utf8 (String) |
Polars uses Arrow strings |
| Boolean | bool, boolean |
Boolean |
|
| Datetime | datetime64[ns] |
Datetime |
Polars supports microsecond precision |
| Categorical | category |
Categorical, Enum |
Polars Enum is strict/ordered |
| Nested | object (dicts/lists) |
List, Struct, Array |
Polars has first-class nested types |
| Null handling | NaN / pd.NA |
null |
Polars uses Arrow null bitmask |
Both libraries support CSV, Parquet, JSON, Excel, and databases. Polars adds lazy scanning for files larger than memory.
# Read
df = pd.read_csv("data.csv")
df = pd.read_csv("data.csv",
usecols=["a", "b"],
dtype={"a": "int32"},
nrows=1000,
parse_dates=["date"]
)
# Chunked reading
for chunk in pd.read_csv("big.csv",
chunksize=10000):
process(chunk)
# Write
df.to_csv("out.csv", index=False)
# Read (eager)
df = pl.read_csv("data.csv")
df = pl.read_csv("data.csv",
columns=["a", "b"],
dtypes={"a": pl.Int32},
n_rows=1000,
try_parse_dates=True
)
# Lazy scan (larger than memory)
lf = pl.scan_csv("big.csv")
result = lf.filter(
pl.col("a") > 100
).collect()
# Write
df.write_csv("out.csv")
# Requires pyarrow or fastparquet
df = pd.read_parquet("data.parquet")
df = pd.read_parquet("data.parquet",
columns=["a", "b"],
engine="pyarrow"
)
# Write with compression
df.to_parquet("out.parquet",
compression="snappy",
index=False
)
# Read from S3 (with s3fs)
df = pd.read_parquet("s3://bucket/data.parquet")
# Native Rust parquet reader
df = pl.read_parquet("data.parquet")
df = pl.read_parquet("data.parquet",
columns=["a", "b"]
)
# Write with compression
df.write_parquet("out.parquet",
compression="zstd"
)
# Lazy scan — predicate pushdown!
lf = pl.scan_parquet("s3://bucket/data.parquet")
result = lf.filter(
pl.col("date") > "2024-01-01"
).select("a", "b").collect()
| Format | Pandas | Polars |
|---|---|---|
| JSON | pd.read_json() / df.to_json() |
pl.read_json() / df.write_json() |
| NDJSON | pd.read_json(lines=True) |
pl.read_ndjson() / pl.scan_ndjson() |
| Excel | pd.read_excel() |
pl.read_excel() |
| SQL | pd.read_sql(query, conn) |
pl.read_database(query, conn) |
| IPC / Arrow | pd.read_feather() |
pl.read_ipc() / pl.scan_ipc() |
| Delta Lake | via deltalake pkg |
pl.read_delta() / pl.scan_delta() |
Selecting rows and columns. Pandas uses bracket indexing and .loc/.iloc; Polars uses expression-based filter/select.
# Boolean mask
df[df["age"] > 25]
# Multiple conditions
df[(df["age"] > 25) & (df["score"] >= 80)]
# Using .query() (string DSL)
df.query("age > 25 and score >= 80")
# isin
df[df["name"].isin(["Alice", "Bob"])]
# String contains
df[df["name"].str.startswith("A")]
# Between
df[df["age"].between(25, 35)]
# Null filtering
df[df["score"].notna()]
# Head / tail / sample
df.head(5)
df.tail(5)
df.sample(3)
# Expression-based filter
df.filter(pl.col("age") > 25)
# Multiple conditions
df.filter(
(pl.col("age") > 25) & (pl.col("score") >= 80)
)
# Using SQL (optional)
pl.sql("SELECT * FROM df WHERE age > 25")
# is_in
df.filter(pl.col("name").is_in(["Alice", "Bob"]))
# String contains
df.filter(pl.col("name").str.starts_with("A"))
# is_between
df.filter(pl.col("age").is_between(25, 35))
# Null filtering
df.filter(pl.col("score").is_not_null())
# Head / tail / sample
df.head(5)
df.tail(5)
df.sample(3)
# Label-based indexing
df.loc[0, "name"] # single cell
df.loc[0:2, ["name", "age"]] # slice
# Position-based indexing
df.iloc[0] # first row
df.iloc[0:5, 1:3] # row/col slice
# Setting values
df.loc[0, "age"] = 31
# Conditional setting
df.loc[df["age"] > 30, "senior"] = True
# No label-based index; use row/item
df.row(0) # tuple
df.item(0, "name") # single cell
# Slicing
df.slice(0, 5) # offset, length
df[0:5] # Python slice syntax
# Polars is immutable — use with_columns
df = df.with_columns(
pl.when(pl.col("age") > 30)
.then(pl.lit(True))
.otherwise(pl.lit(False))
.alias("senior")
)
# Gather specific rows
df.gather([0, 3, 7])
df.with_row_index("idx").
Split-apply-combine patterns. Both libraries support grouping with multiple aggregation functions, but the syntax differs significantly.
# Single aggregation
df.groupby("dept")["salary"].mean()
# Multiple aggregations
df.groupby("dept").agg({
"salary": ["mean", "max"],
"age": "count"
})
# Named aggregations (recommended)
df.groupby("dept").agg(
avg_salary=("salary", "mean"),
max_salary=("salary", "max"),
headcount=("age", "count")
)
# Group by multiple columns
df.groupby(["dept", "level"]).mean()
# Single aggregation
df.group_by("dept").agg(
pl.col("salary").mean()
)
# Multiple aggregations
df.group_by("dept").agg(
pl.col("salary").mean().alias("avg_salary"),
pl.col("salary").max().alias("max_salary"),
pl.col("age").count().alias("headcount"),
)
# Expressions are composable
df.group_by("dept").agg(
(pl.col("salary").max() - pl.col("salary").min())
.alias("salary_range")
)
# Group by multiple columns
df.group_by("dept", "level").agg(
pl.all().mean()
)
# Transform (broadcast back to original shape)
df["z_score"] = df.groupby("dept")["salary"] \
.transform(lambda x: (x - x.mean()) / x.std())
# Filter groups
df.groupby("dept").filter(
lambda g: g["salary"].mean() > 50000
)
# Apply custom function
df.groupby("dept").apply(
lambda g: g.nlargest(3, "salary")
)
# Pivot table
pd.pivot_table(df,
values="salary",
index="dept",
columns="level",
aggfunc="mean"
)
# Over expressions (like SQL window funcs)
df = df.with_columns(
((pl.col("salary") - pl.col("salary").mean().over("dept"))
/ pl.col("salary").std().over("dept"))
.alias("z_score")
)
# Filter groups
df.filter(
pl.col("salary").mean().over("dept") > 50000
)
# Top N per group
df.sort("salary", descending=True) \
.group_by("dept").head(3)
# Pivot
df.pivot(
on="level",
index="dept",
values="salary",
aggregate_function="mean"
)
| Function | Pandas | Polars |
|---|---|---|
| Count | .count() | .count() |
| Sum | .sum() | .sum() |
| Mean | .mean() | .mean() |
| Median | .median() | .median() |
| Std Dev | .std() | .std() |
| Min / Max | .min() / .max() | .min() / .max() |
| First / Last | .first() / .last() | .first() / .last() |
| N unique | .nunique() | .n_unique() |
| Quantile | .quantile(0.75) | .quantile(0.75) |
| Value counts | .value_counts() | .value_counts() |
Combining DataFrames. Pandas uses merge/join/concat; Polars uses join/concat with more explicit join types.
# Inner join
result = pd.merge(orders, customers,
on="customer_id",
how="inner"
)
# Left join
result = pd.merge(orders, customers,
on="customer_id",
how="left"
)
# Join on different column names
result = pd.merge(orders, customers,
left_on="cust_id",
right_on="id"
)
# Multi-key join
result = pd.merge(df1, df2,
on=["year", "month"]
)
# Inner join
result = orders.join(customers,
on="customer_id",
how="inner"
)
# Left join
result = orders.join(customers,
on="customer_id",
how="left"
)
# Join on different column names
result = orders.join(customers,
left_on="cust_id",
right_on="id"
)
# Multi-key join
result = df1.join(df2,
on=["year", "month"]
)
| Join Type | Pandas how= |
Polars how= |
Description |
|---|---|---|---|
| Inner | "inner" |
"inner" |
Only matching rows |
| Left | "left" |
"left" |
All left rows + matching right |
| Right | "right" |
"right" (v1.x: supported) |
All right rows + matching left |
| Full outer | "outer" |
"full" |
All rows from both sides |
| Cross | "cross" |
"cross" |
Cartesian product |
| Semi | N/A (use isin) |
"semi" |
Left rows where key exists in right |
| Anti | N/A (use ~isin) |
"anti" |
Left rows where key NOT in right |
# Vertical stack (union)
result = pd.concat([df1, df2, df3])
# Horizontal (side by side)
result = pd.concat([df1, df2], axis=1)
# Reset index after concat
result = pd.concat([df1, df2],
ignore_index=True
)
# Vertical stack (union)
result = pl.concat([df1, df2, df3])
# Horizontal (side by side)
result = pl.concat([df1, df2], how="horizontal")
# Diagonal (union with different schemas)
result = pl.concat([df1, df2],
how="diagonal"
)
Computing values relative to groups or ordered windows. Polars uses the powerful .over() expression; Pandas uses .transform() and rolling APIs.
# Running total per group
df["cum_sales"] = df.groupby("dept")["sales"] \
.cumsum()
# Rank within group
df["rank"] = df.groupby("dept")["sales"] \
.rank(ascending=False, method="dense")
# Shift (lag/lead)
df["prev_sales"] = df.groupby("dept")["sales"] \
.shift(1)
# Pct change within group
df["growth"] = df.groupby("dept")["sales"] \
.pct_change()
# Running total per group
df = df.with_columns(
pl.col("sales").cum_sum().over("dept")
.alias("cum_sales")
)
# Rank within group
df = df.with_columns(
pl.col("sales").rank(descending=True,
method="dense").over("dept")
.alias("rank")
)
# Shift (lag/lead)
df = df.with_columns(
pl.col("sales").shift(1).over("dept")
.alias("prev_sales")
)
# Pct change within group
df = df.with_columns(
pl.col("sales").pct_change().over("dept")
.alias("growth")
)
# Rolling average
df["ma7"] = df["price"].rolling(7).mean()
# Rolling std with min_periods
df["vol"] = df["price"].rolling(
window=20, min_periods=10
).std()
# Expanding (cumulative) window
df["cum_max"] = df["price"].expanding().max()
# EWM (exponential weighted)
df["ewm"] = df["price"].ewm(span=12).mean()
# Rolling average
df = df.with_columns(
pl.col("price").rolling_mean(window_size=7)
.alias("ma7")
)
# Rolling std with min_periods
df = df.with_columns(
pl.col("price").rolling_std(
window_size=20, min_periods=10
).alias("vol")
)
# Expanding (cumulative) window
df = df.with_columns(
pl.col("price").cum_max().alias("cum_max")
)
# EWM (exponential weighted)
df = df.with_columns(
pl.col("price").ewm_mean(span=12)
.alias("ewm")
)
Polars' killer feature: build a query plan, then let the optimizer rewrite and parallelize it before execution. Pandas has no equivalent.
# Create a LazyFrame from eager DataFrame
lf = df.lazy()
# Or scan directly (recommended for large files)
lf = pl.scan_csv("huge_dataset.csv")
lf = pl.scan_parquet("data/*.parquet") # glob patterns!
lf = pl.scan_ipc("data.arrow")
# Build a query plan (nothing executes yet)
result = (
lf
.filter(pl.col("country") == "US")
.group_by("state")
.agg(
pl.col("revenue").sum().alias("total_rev"),
pl.col("customers").n_unique().alias("unique_cust"),
)
.sort("total_rev", descending=True)
.head(10)
)
# Execute the plan
df_result = result.collect()
# Inspect the optimized plan
print(result.explain(optimized=True))
Filters are pushed as close to the data source as possible, reducing rows read from disk.
# Filter pushed into scan
pl.scan_parquet("data.parquet")
.filter(pl.col("year") == 2024)
.collect()
# Only reads matching row groups
Only the columns used downstream are read from the source file.
# Only reads "name" and "score" columns
pl.scan_csv("wide_table.csv")
.select("name", "score")
.collect()
Duplicate computations are detected and computed once.
# mean(x) computed once, reused
lf.with_columns(
(pl.col("x") - pl.col("x").mean()).alias("centered"),
(pl.col("x") / pl.col("x").mean()).alias("normed"),
)
Limits are pushed down to avoid processing unnecessary rows.
# Only processes 10 groups total
lf.group_by("region").agg(
pl.col("sales").sum()
).sort("sales", descending=True)
.head(10).collect()
# Process data larger than memory in streaming chunks
result = (
pl.scan_csv("100gb_file.csv")
.filter(pl.col("status") == "active")
.group_by("category")
.agg(pl.col("amount").sum())
.collect(streaming=True) # streaming engine
)
# Sink results directly to file (no memory needed)
(
pl.scan_parquet("input/*.parquet")
.filter(pl.col("date") >= "2024-01-01")
.sink_parquet("output/filtered.parquet")
)
pd.read_csv("f.csv", chunksize=N) to iterate in chunks, but you must manage concatenation and aggregation manually. There is no query optimizer.
Benchmark comparisons and memory usage patterns. Polars is consistently faster due to its Rust engine, multi-threading, and Arrow memory model.
Relative execution time (lower = faster). Based on typical 10M-row CSV workload on 8-core machine.
Written in Rust, compiled to native code. No Python GIL overhead for data operations. The Python API is a thin wrapper over the Rust engine.
Automatically parallelizes across all available CPU cores. GroupBy, joins, and aggregations run in parallel without user configuration.
Uses Arrow columnar format natively. Cache-friendly memory layout, zero-copy interop with other Arrow-based tools, and efficient null handling.
Lazy mode rewrites your query plan: eliminates redundant operations, pushes predicates/projections, and fuses operations for minimal passes over data.
| Aspect | Pandas | Polars |
|---|---|---|
| String storage | Python objects (high overhead) | Arrow strings (contiguous, compact) |
| Null handling | NaN sentinel (wastes float64 space) | Arrow null bitmask (1 bit per value) |
| Integer nullability | Upcasts to float64 for NaN | Native nullable integers |
| Copy-on-write | CoW in Pandas 2.x (opt-in) | Immutable by design |
| Peak memory | Often 5-10x dataset size | Usually 2-3x dataset size |
| Larger-than-RAM | Manual chunking only | Streaming engine + sink |
# Pandas: check memory usage
df.info(memory_usage="deep")
df.memory_usage(deep=True).sum() / 1e6 # MB
# Polars: check memory usage
df.estimated_size("mb")
# Pandas: downcast types to save memory
df["id"] = pd.to_numeric(df["id"], downcast="integer")
df["name"] = df["name"].astype("category")
# Polars: cast to smaller types
df = df.with_columns(
pl.col("id").cast(pl.Int32),
pl.col("name").cast(pl.Categorical)
)
Both libraries are excellent. The right choice depends on your dataset size, team familiarity, ecosystem needs, and performance requirements.
Your workflow depends on libraries that expect Pandas DataFrames: scikit-learn, matplotlib, seaborn, statsmodels, Prophet, etc. Most ML and visualization libraries assume Pandas.
Working with datasets under ~1-5 GB that fit comfortably in memory. The performance difference with Polars may not matter at this scale, and Pandas is more familiar.
Your team already knows Pandas well. Decades of Stack Overflow answers, tutorials, and institutional knowledge. The switching cost may not be justified.
Jupyter notebook exploratory analysis where mutation (in-place modification) and the index are convenient. Quick prototyping, plotting, and ad-hoc analysis.
Working with large datasets (1GB+), production ETL pipelines, or any scenario where wall-clock time matters. Polars is typically 5-20x faster than Pandas.
Data doesn't fit in RAM. Polars' lazy evaluation with streaming and sink operations can process arbitrarily large datasets with bounded memory.
Building production data pipelines where correctness and performance matter. Strict typing, immutability, and the query optimizer catch bugs early and optimize automatically.
Starting a new project with no legacy constraints. Polars' expression-based API is more consistent and composable than Pandas' mixed bag of methods.
| Scenario | Recommendation | Why |
|---|---|---|
| EDA in Jupyter | Pandas | Better plotting integration, mutation convenience |
| ETL pipeline > 1GB | Polars | Lazy eval, streaming, multi-threaded |
| ML feature engineering | Either | Polars for speed, Pandas for sklearn compat |
| Real-time data processing | Polars | Lower latency, no GIL contention |
| Legacy codebase | Pandas | Migration cost; gradual Polars adoption possible |
| Time-series analysis | Either | Both have rolling/window; Pandas has more ts-specific tools |
| Parquet data lake | Polars | scan_parquet with predicate pushdown is unbeatable |
| New green-field project | Polars | Better API design, faster, growing ecosystem |
import pandas as pd
import polars as pl
# 1. Use Polars for heavy ETL
clean_data = (
pl.scan_parquet("raw/*.parquet")
.filter(pl.col("valid") == True)
.group_by("category")
.agg(
pl.col("revenue").sum(),
pl.col("orders").count(),
)
.sort("revenue", descending=True)
.collect()
)
# 2. Convert to Pandas for ML / visualization
df_pd = clean_data.to_pandas()
# 3. Use with scikit-learn
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(df_pd[["revenue", "orders"]])
# 4. Plot with matplotlib/seaborn
import matplotlib.pyplot as plt
df_pd.plot.bar(x="category", y="revenue")
plt.show()
ArrowDtype backends, converting between Polars and Pandas can be zero-copy (no data duplication). Use df_polars.to_pandas(use_pyarrow_extension_array=True) to preserve Arrow memory layout in Pandas.