Data Analysis Skill Comprehensive data analysis toolkit using Polars - a blazingly fast DataFrame library. This skill provides instructions, reference documentation, and ready-to-use scripts for common data analysis tasks. Iteration Checkpoints Step What to Present User Input Type Data Loading Shape, columns, sample rows "Is this the right data?" Data Exploration Summary stats, data quality issues "Any columns to focus on?" Transformation Before/after comparison "Does this transformation look correct?" Analysis Key findings, charts "Should I dig deeper into anything?" Export Output preview "Ready to save, or any changes?" Quick Start import polars as pl from polars import col
Load data
df
pl . read_csv ( "data.csv" )
Explore
print ( df . shape , df . schema ) df . describe ( )
Transform and analyze
result
( df . filter ( col ( "value" )
0 ) . group_by ( "category" ) . agg ( col ( "value" ) . sum ( ) . alias ( "total" ) ) . sort ( "total" , descending = True ) )
Export
result . write_csv ( "output.csv" ) When to Use This Skill Loading datasets (CSV, JSON, Parquet, Excel, databases) Data cleaning, filtering, and transformation Aggregations, grouping, and pivot tables Statistical analysis and summary statistics Time series analysis and resampling Joining and merging multiple datasets Creating visualizations and charts Exporting results to various formats Skill Contents Reference Documentation Detailed API reference and patterns for specific operations: reference/loading.md - Loading data from all supported formats reference/transformations.md - Column operations, filtering, sorting, type casting reference/aggregations.md - Group by, window functions, running totals reference/time_series.md - Date parsing, resampling, lag features reference/statistics.md - Correlations, distributions, hypothesis testing setup reference/visualization.md - Creating charts with matplotlib/plotly Ready-to-Use Scripts Executable Python scripts for common tasks: scripts/explore_data.py - Quick dataset exploration and profiling scripts/summary_stats.py - Generate comprehensive statistics report Core Patterns Loading Data
CSV (most common)
df
pl . read_csv ( "data.csv" )
Lazy loading for large files
df
pl . scan_csv ( "large.csv" ) . filter ( col ( "x" )
0 ) . collect ( )
Parquet (recommended for large datasets)
df
pl . read_parquet ( "data.parquet" )
JSON
df
pl . read_json ( "data.json" ) df = pl . read_ndjson ( "data.ndjson" )
Newline-delimited
Filtering and Selection
Select columns
df . select ( "col1" , "col2" ) df . select ( col ( "name" ) , col ( "value" ) * 2 )
Filter rows
df . filter ( col ( "age" )
25 ) df . filter ( ( col ( "status" ) == "active" ) & ( col ( "value" )
100 ) ) df . filter ( col ( "name" ) . str . contains ( "Smith" ) ) Transformations
Add/modify columns
df
df . with_columns ( ( col ( "price" ) * col ( "qty" ) ) . alias ( "total" ) , col ( "date_str" ) . str . to_date ( "%Y-%m-%d" ) . alias ( "date" ) , )
Conditional values
df
df . with_columns ( pl . when ( col ( "score" )
= 90 ) . then ( pl . lit ( "A" ) ) . when ( col ( "score" ) = 80 ) . then ( pl . lit ( "B" ) ) . otherwise ( pl . lit ( "C" ) ) . alias ( "grade" ) ) Aggregations
Group by
df . group_by ( "category" ) . agg ( col ( "value" ) . sum ( ) . alias ( "total" ) , col ( "value" ) . mean ( ) . alias ( "avg" ) , pl . len ( ) . alias ( "count" ) , )
Window functions
df . with_columns ( col ( "value" ) . sum ( ) . over ( "group" ) . alias ( "group_total" ) , col ( "value" ) . rank ( ) . over ( "group" ) . alias ( "rank_in_group" ) , ) Exporting df . write_csv ( "output.csv" ) df . write_parquet ( "output.parquet" ) df . write_json ( "output.json" , row_oriented = True ) Best Practices Use lazy evaluation for large datasets: pl.scan_csv() + .collect() Filter early to reduce data volume before expensive operations Select only needed columns to minimize memory usage Prefer Parquet for storage - faster I/O, better compression Use .explain() to understand and optimize query plans