- Data Analysis
- When to use this skill
- Data exploration
-
- Understand a new dataset
- Report generation
-
- Derive data-driven insights
- Quality validation
-
- Check data consistency
- Decision support
- Make data-driven recommendations Instructions Step 1: Load and explore data Python (Pandas) : import pandas as pd import numpy as np
Load CSV
df
pd . read_csv ( 'data.csv' )
Basic info
print ( df . info ( ) ) print ( df . describe ( ) ) print ( df . head ( 10 ) )
Check missing values
print ( df . isnull ( ) . sum ( ) )
Data types
print ( df . dtypes ) SQL : -- Inspect table schema DESCRIBE table_name ; -- Sample data SELECT * FROM table_name LIMIT 10 ; -- Basic stats SELECT COUNT ( * ) as total_rows , COUNT ( DISTINCT column_name ) as unique_values , MIN ( numeric_column ) as min_val , MAX ( numeric_column ) as max_val , AVG ( numeric_column ) as avg_val FROM table_name ; Step 2: Data cleaning
Handle missing values
df [ 'column' ] . fillna ( df [ 'column' ] . mean ( ) , inplace = True ) df . dropna ( subset = [ 'required_column' ] , inplace = True )
Remove duplicates
df . drop_duplicates ( inplace = True )
Type conversions
df [ 'date' ] = pd . to_datetime ( df [ 'date' ] ) df [ 'category' ] = df [ 'category' ] . astype ( 'category' )
Remove outliers (IQR method)
Q1
df [ 'value' ] . quantile ( 0.25 ) Q3 = df [ 'value' ] . quantile ( 0.75 ) IQR = Q3 - Q1 df = df [ ( df [ 'value' ]
= Q1 - 1.5 * IQR ) & ( df [ 'value' ] <= Q3 + 1.5 * IQR ) ] Step 3: Statistical analysis
Descriptive statistics
print ( df [ 'numeric_column' ] . describe ( ) )
Grouped analysis
grouped
df . groupby ( 'category' ) . agg ( { 'value' : [ 'mean' , 'sum' , 'count' ] , 'other' : 'nunique' } ) print ( grouped )
Correlation
correlation
df [ [ 'col1' , 'col2' , 'col3' ] ] . corr ( ) print ( correlation )
Pivot table
pivot
pd . pivot_table ( df , values = 'sales' , index = 'region' , columns = 'month' , aggfunc = 'sum' ) Step 4: Visualization import matplotlib . pyplot as plt import seaborn as sns
Histogram
plt . figure ( figsize = ( 10 , 6 ) ) df [ 'value' ] . hist ( bins = 30 ) plt . title ( 'Distribution of Values' ) plt . savefig ( 'histogram.png' )
Boxplot
plt . figure ( figsize = ( 10 , 6 ) ) sns . boxplot ( x = 'category' , y = 'value' , data = df ) plt . title ( 'Value by Category' ) plt . savefig ( 'boxplot.png' )
Heatmap (correlation)
plt . figure ( figsize = ( 10 , 8 ) ) sns . heatmap ( correlation , annot = True , cmap = 'coolwarm' ) plt . title ( 'Correlation Matrix' ) plt . savefig ( 'heatmap.png' )
Time series
plt . figure ( figsize = ( 12 , 6 ) ) df . groupby ( 'date' ) [ 'value' ] . sum ( ) . plot ( ) plt . title ( 'Time Series of Values' ) plt . savefig ( 'timeseries.png' ) Step 5: Derive insights
Top/bottom analysis
top_10
df . nlargest ( 10 , 'value' ) bottom_10 = df . nsmallest ( 10 , 'value' )
Trend analysis
df [ 'month' ] = df [ 'date' ] . dt . to_period ( 'M' ) monthly_trend = df . groupby ( 'month' ) [ 'value' ] . sum ( ) growth = monthly_trend . pct_change ( ) * 100
Segment analysis
segments
df . groupby ( 'segment' ) . agg ( { 'revenue' : 'sum' , 'customers' : 'nunique' , 'orders' : 'count' } ) segments [ 'avg_order_value' ] = segments [ 'revenue' ] / segments [ 'orders' ] Output format Analysis report structure
Data Analysis Report
1. Dataset overview
Dataset: [name]
Records: X,XXX
Columns: XX
Date range: YYYY-MM-DD ~ YYYY-MM-DD
2. Key findings
Insight 1
Insight 2
Insight 3
- Statistical summary | Metric | Value | |
|
| | Mean | X.XX | | Median | X.XX | | Std dev | X.XX |
-
- Recommendations
- 1.
- [Recommendation 1]
- 2.
- [Recommendation 2]
- Best practices
- Understand the data first
-
- Learn structure and meaning before analysis
- Incremental analysis
-
- Move from simple to complex analyses
- Use visualization
-
- Use a variety of charts to spot patterns
- Validate assumptions
-
- Always verify assumptions about the data
- Reproducibility
- Document analysis code and results Constraints Required rules (MUST) Preserve raw data (work on a copy) Document the analysis process Validate results Prohibited (MUST NOT) Do not expose sensitive personal data Do not draw unsupported conclusions References Pandas Documentation Matplotlib Gallery Seaborn Tutorial Examples Example 1: Basic usage Example 2: Advanced usage