Polars Overview

Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.

Quick Start Installation and Basic Usage

Install Polars:

uv pip install polars

Basic DataFrame creation and operations:

import polars as pl

Create DataFrame

df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "city": ["NY", "LA", "SF"] })

Select columns

df.select("name", "age")

Filter rows

df.filter(pl.col("age") > 25)

Add computed columns

df.with_columns( age_plus_10=pl.col("age") + 10 )

Core Concepts Expressions

Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.

Key principles:

Use pl.col("column_name") to reference columns Chain methods to build complex transformations Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)

Example:

Expression-based computation

df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )

Lazy vs Eager Evaluation

Eager (DataFrame): Operations execute immediately

df = pl.read_csv("file.csv") # Reads immediately result = df.filter(pl.col("age") > 25) # Executes immediately

Lazy (LazyFrame): Operations build a query plan, optimized before execution

lf = pl.scan_csv("file.csv") # Doesn't read yet result = lf.filter(pl.col("age") > 25).select("name", "age") df = result.collect() # Now executes optimized query

When to use lazy:

Working with large datasets Complex query pipelines When only some columns/rows are needed Performance is critical

Benefits of lazy evaluation:

Automatic query optimization Predicate pushdown Projection pushdown Parallel execution

For detailed concepts, load references/core_concepts.md.

Common Operations Select

Select and manipulate columns:

Select specific columns

df.select("name", "age")

Select with expressions

df.select( pl.col("name"), (pl.col("age") * 2).alias("double_age") )

Select all columns matching a pattern

df.select(pl.col("^.*_id$"))

Filter

Filter rows by conditions:

Single condition

df.filter(pl.col("age") > 25)

Multiple conditions (cleaner than using &)

df.filter( pl.col("age") > 25, pl.col("city") == "NY" )

Complex conditions

df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )

With Columns

Add or modify columns while preserving existing ones:

Add new columns

df.with_columns( age_plus_10=pl.col("age") + 10, name_upper=pl.col("name").str.to_uppercase() )

Parallel computation (all columns computed in parallel)

df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )

Group By and Aggregations

Group data and compute aggregations:

Basic grouping

df.group_by("city").agg( pl.col("age").mean().alias("avg_age"), pl.len().alias("count") )

Multiple group keys

df.group_by("city", "department").agg( pl.col("salary").sum() )

Conditional aggregations

df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )

For detailed operation patterns, load references/operations.md.

Aggregations and Window Functions Aggregation Functions

Common aggregations within group_by context:

pl.len() - count rows pl.col("x").sum() - sum values pl.col("x").mean() - average pl.col("x").min() / pl.col("x").max() - extremes pl.first() / pl.last() - first/last values Window Functions with over()

Apply aggregations while preserving row count:

Add group statistics to each row

df.with_columns( avg_age_by_city=pl.col("age").mean().over("city"), rank_in_city=pl.col("salary").rank().over("city") )

Multiple grouping columns

df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )

Mapping strategies:

group_to_rows (default): Preserves original row order explode: Faster but groups rows together join: Creates list columns Data I/O Supported Formats

Polars supports reading and writing:

CSV, Parquet, JSON, Excel Databases (via connectors) Cloud storage (S3, Azure, GCS) Google BigQuery Multiple/partitioned files Common I/O Operations

CSV:

Eager

df = pl.read_csv("file.csv") df.write_csv("output.csv")

Lazy (preferred for large files)

lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()

Parquet (recommended for performance):

df = pl.read_parquet("file.parquet") df.write_parquet("output.parquet")

JSON:

df = pl.read_json("file.json") df.write_json("output.json")

For comprehensive I/O documentation, load references/io_guide.md.

Transformations Joins

Combine DataFrames:

Inner join

df1.join(df2, on="id", how="inner")

Left join

df1.join(df2, on="id", how="left")

Join on different column names

df1.join(df2, left_on="user_id", right_on="id")

Concatenation

Stack DataFrames:

Vertical (stack rows)

pl.concat([df1, df2], how="vertical")

Horizontal (add columns)

pl.concat([df1, df2], how="horizontal")

Diagonal (union with different schemas)

pl.concat([df1, df2], how="diagonal")

Pivot and Unpivot

Reshape data:

Pivot (wide format)

df.pivot(values="sales", index="date", columns="product")

Unpivot (long format)

df.unpivot(index="id", on=["col1", "col2"])

For detailed transformation examples, load references/transformations.md.

Pandas Migration

Polars offers significant performance improvements over pandas with a cleaner API. Key differences:

Conceptual Differences No index: Polars uses integer positions only Strict typing: No silent type conversions Lazy evaluation: Available via LazyFrame Parallel by default: Operations parallelized automatically Common Operation Mappings Operation Pandas Polars Select column df["col"] df.select("col") Filter df[df["col"] > 10] df.filter(pl.col("col") > 10) Add column df.assign(x=...) df.with_columns(x=...) Group by df.groupby("col").agg(...) df.group_by("col").agg(...) Window df.groupby("col").transform(...) df.with_columns(...).over("col") Key Syntax Patterns

Pandas sequential (slow):

df.assign( col_a=lambda df_: df_.value * 10, col_b=lambda df_: df_.value * 100 )

Polars parallel (fast):

df.with_columns( col_a=pl.col("value") * 10, col_b=pl.col("value") * 100, )

For comprehensive migration guide, load references/pandas_migration.md.

Best Practices Performance Optimization

Use lazy evaluation for large datasets:

lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect()

Avoid Python functions in hot paths:

Stay within expression API for parallelization Use .map_elements() only when necessary Prefer native Polars operations

Use streaming for very large data:

lf.collect(streaming=True)

Select only needed columns early:

Good: Select columns early

lf.select("col1", "col2").filter(...)

Bad: Filter on all columns first

lf.filter(...).select("col1", "col2")

Use appropriate data types:

Categorical for low-cardinality strings Appropriate integer sizes (i32 vs i64) Date types for temporal data Expression Patterns

Conditional operations:

pl.when(condition).then(value).otherwise(other_value)

Column operations across multiple columns:

df.select(pl.col("^.*_value$") * 2) # Regex pattern

Null handling:

pl.col("x").fill_null(0) pl.col("x").is_null() pl.col("x").drop_nulls()

For additional best practices and patterns, load references/best_practices.md.

Resources

This skill includes comprehensive reference documentation:

references/ core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type system operations.md - Comprehensive guide to all common operations with examples pandas_migration.md - Complete migration guide from pandas to Polars io_guide.md - Data I/O operations for all supported formats transformations.md - Joins, concatenation, pivots, and reshaping operations best_practices.md - Performance optimization tips and common patterns

Load these references as needed when users require detailed information about specific topics.

安装