Polars Overview
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
Quick Start Installation and Basic Usage
Install Polars:
uv pip install polars
Basic DataFrame creation and operations:
import polars as pl
Create DataFrame
df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "city": ["NY", "LA", "SF"] })
Select columns
df.select("name", "age")
Filter rows
df.filter(pl.col("age") > 25)
Add computed columns
df.with_columns( age_plus_10=pl.col("age") + 10 )
Core Concepts Expressions
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
Key principles:
Use pl.col("column_name") to reference columns Chain methods to build complex transformations Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
Example:
Expression-based computation
df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )
Lazy vs Eager Evaluation
Eager (DataFrame): Operations execute immediately
df = pl.read_csv("file.csv") # Reads immediately result = df.filter(pl.col("age") > 25) # Executes immediately
Lazy (LazyFrame): Operations build a query plan, optimized before execution
lf = pl.scan_csv("file.csv") # Doesn't read yet result = lf.filter(pl.col("age") > 25).select("name", "age") df = result.collect() # Now executes optimized query
When to use lazy:
Working with large datasets Complex query pipelines When only some columns/rows are needed Performance is critical
Benefits of lazy evaluation:
Automatic query optimization Predicate pushdown Projection pushdown Parallel execution
For detailed concepts, load references/core_concepts.md.
Common Operations Select
Select and manipulate columns:
Select specific columns
df.select("name", "age")
Select with expressions
df.select( pl.col("name"), (pl.col("age") * 2).alias("double_age") )
Select all columns matching a pattern
df.select(pl.col("^.*_id$"))
Filter
Filter rows by conditions:
Single condition
df.filter(pl.col("age") > 25)
Multiple conditions (cleaner than using &)
df.filter( pl.col("age") > 25, pl.col("city") == "NY" )
Complex conditions
df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )
With Columns
Add or modify columns while preserving existing ones:
Add new columns
df.with_columns( age_plus_10=pl.col("age") + 10, name_upper=pl.col("name").str.to_uppercase() )
Parallel computation (all columns computed in parallel)
df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )
Group By and Aggregations
Group data and compute aggregations:
Basic grouping
df.group_by("city").agg( pl.col("age").mean().alias("avg_age"), pl.len().alias("count") )
Multiple group keys
df.group_by("city", "department").agg( pl.col("salary").sum() )
Conditional aggregations
df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )
For detailed operation patterns, load references/operations.md.
Aggregations and Window Functions Aggregation Functions
Common aggregations within group_by context:
pl.len() - count rows pl.col("x").sum() - sum values pl.col("x").mean() - average pl.col("x").min() / pl.col("x").max() - extremes pl.first() / pl.last() - first/last values Window Functions with over()
Apply aggregations while preserving row count:
Add group statistics to each row
df.with_columns( avg_age_by_city=pl.col("age").mean().over("city"), rank_in_city=pl.col("salary").rank().over("city") )
Multiple grouping columns
df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )
Mapping strategies:
group_to_rows (default): Preserves original row order explode: Faster but groups rows together join: Creates list columns Data I/O Supported Formats
Polars supports reading and writing:
CSV, Parquet, JSON, Excel Databases (via connectors) Cloud storage (S3, Azure, GCS) Google BigQuery Multiple/partitioned files Common I/O Operations
CSV:
Eager
df = pl.read_csv("file.csv") df.write_csv("output.csv")
Lazy (preferred for large files)
lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()
Parquet (recommended for performance):
df = pl.read_parquet("file.parquet") df.write_parquet("output.parquet")
JSON:
df = pl.read_json("file.json") df.write_json("output.json")
For comprehensive I/O documentation, load references/io_guide.md.
Transformations Joins
Combine DataFrames:
Inner join
df1.join(df2, on="id", how="inner")
Left join
df1.join(df2, on="id", how="left")
Join on different column names
df1.join(df2, left_on="user_id", right_on="id")
Concatenation
Stack DataFrames:
Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")
Pivot and Unpivot
Reshape data:
Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
For detailed transformation examples, load references/transformations.md.
Pandas Migration
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
Conceptual Differences No index: Polars uses integer positions only Strict typing: No silent type conversions Lazy evaluation: Available via LazyFrame Parallel by default: Operations parallelized automatically Common Operation Mappings Operation Pandas Polars Select column df["col"] df.select("col") Filter df[df["col"] > 10] df.filter(pl.col("col") > 10) Add column df.assign(x=...) df.with_columns(x=...) Group by df.groupby("col").agg(...) df.group_by("col").agg(...) Window df.groupby("col").transform(...) df.with_columns(...).over("col") Key Syntax Patterns
Pandas sequential (slow):
df.assign( col_a=lambda df_: df_.value * 10, col_b=lambda df_: df_.value * 100 )
Polars parallel (fast):
df.with_columns( col_a=pl.col("value") * 10, col_b=pl.col("value") * 100, )
For comprehensive migration guide, load references/pandas_migration.md.
Best Practices Performance Optimization
Use lazy evaluation for large datasets:
lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect()
Avoid Python functions in hot paths:
Stay within expression API for parallelization Use .map_elements() only when necessary Prefer native Polars operations
Use streaming for very large data:
lf.collect(streaming=True)
Select only needed columns early:
Good: Select columns early
lf.select("col1", "col2").filter(...)
Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")
Use appropriate data types:
Categorical for low-cardinality strings Appropriate integer sizes (i32 vs i64) Date types for temporal data Expression Patterns
Conditional operations:
pl.when(condition).then(value).otherwise(other_value)
Column operations across multiple columns:
df.select(pl.col("^.*_value$") * 2) # Regex pattern
Null handling:
pl.col("x").fill_null(0) pl.col("x").is_null() pl.col("x").drop_nulls()
For additional best practices and patterns, load references/best_practices.md.
Resources
This skill includes comprehensive reference documentation:
references/ core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type system operations.md - Comprehensive guide to all common operations with examples pandas_migration.md - Complete migration guide from pandas to Polars io_guide.md - Data I/O operations for all supported formats transformations.md - Joins, concatenation, pivots, and reshaping operations best_practices.md - Performance optimization tips and common patterns
Load these references as needed when users require detailed information about specific topics.