Data Analysis Skill Comprehensive data analysis toolkit using Polars - a blazingly fast DataFrame library. This skill provides instructions, reference documentation, and ready-to-use scripts for common data analysis tasks. Iteration Checkpoints Step What to Present User Input Type Data Loading Shape, columns, sample rows "Is this the right data?" Data Exploration Summary stats, data quality issues "Any columns to focus on?" Transformation Before/after comparison "Does this transformation look correct?" Analysis Key findings, charts "Should I dig deeper into anything?" Export Output preview "Ready to save, or any changes?" Quick Start import polars as pl from polars import col

Load data

df

pl . read_csv ( "data.csv" )

Explore

print ( df . shape , df . schema ) df . describe ( )

Transform and analyze

result

( df . filter ( col ( "value" )

0 ) . group_by ( "category" ) . agg ( col ( "value" ) . sum ( ) . alias ( "total" ) ) . sort ( "total" , descending = True ) )

Export

result . write_csv ( "output.csv" ) When to Use This Skill Loading datasets (CSV, JSON, Parquet, Excel, databases) Data cleaning, filtering, and transformation Aggregations, grouping, and pivot tables Statistical analysis and summary statistics Time series analysis and resampling Joining and merging multiple datasets Creating visualizations and charts Exporting results to various formats Skill Contents Reference Documentation Detailed API reference and patterns for specific operations: reference/loading.md - Loading data from all supported formats reference/transformations.md - Column operations, filtering, sorting, type casting reference/aggregations.md - Group by, window functions, running totals reference/time_series.md - Date parsing, resampling, lag features reference/statistics.md - Correlations, distributions, hypothesis testing setup reference/visualization.md - Creating charts with matplotlib/plotly Ready-to-Use Scripts Executable Python scripts for common tasks: scripts/explore_data.py - Quick dataset exploration and profiling scripts/summary_stats.py - Generate comprehensive statistics report Core Patterns Loading Data

CSV (most common)

df

pl . read_csv ( "data.csv" )

Lazy loading for large files

df

pl . scan_csv ( "large.csv" ) . filter ( col ( "x" )

0 ) . collect ( )

Parquet (recommended for large datasets)

df

pl . read_parquet ( "data.parquet" )

JSON

df

pl . read_json ( "data.json" ) df = pl . read_ndjson ( "data.ndjson" )

Newline-delimited

Filtering and Selection

Select columns

df . select ( "col1" , "col2" ) df . select ( col ( "name" ) , col ( "value" ) * 2 )

Filter rows

df . filter ( col ( "age" )

25 ) df . filter ( ( col ( "status" ) == "active" ) & ( col ( "value" )

100 ) ) df . filter ( col ( "name" ) . str . contains ( "Smith" ) ) Transformations

Add/modify columns

df

df . with_columns ( ( col ( "price" ) * col ( "qty" ) ) . alias ( "total" ) , col ( "date_str" ) . str . to_date ( "%Y-%m-%d" ) . alias ( "date" ) , )

Conditional values

df

df . with_columns ( pl . when ( col ( "score" )

= 90 ) . then ( pl . lit ( "A" ) ) . when ( col ( "score" ) = 80 ) . then ( pl . lit ( "B" ) ) . otherwise ( pl . lit ( "C" ) ) . alias ( "grade" ) ) Aggregations

Group by

df . group_by ( "category" ) . agg ( col ( "value" ) . sum ( ) . alias ( "total" ) , col ( "value" ) . mean ( ) . alias ( "avg" ) , pl . len ( ) . alias ( "count" ) , )

Window functions

df . with_columns ( col ( "value" ) . sum ( ) . over ( "group" ) . alias ( "group_total" ) , col ( "value" ) . rank ( ) . over ( "group" ) . alias ( "rank_in_group" ) , ) Exporting df . write_csv ( "output.csv" ) df . write_parquet ( "output.parquet" ) df . write_json ( "output.json" , row_oriented = True ) Best Practices Use lazy evaluation for large datasets: pl.scan_csv() + .collect() Filter early to reduce data volume before expensive operations Select only needed columns to minimize memory usage Prefer Parquet for storage - faster I/O, better compression Use .explain() to understand and optimize query plans

data_analysis

安装