Data Analysis

When to use this skill

Data exploration

Understand a new dataset

Report generation

Derive data-driven insights

Quality validation

Check data consistency
Decision support: Make data-driven recommendations Instructions Step 1: Load and explore data Python (Pandas) : import pandas as pd import numpy as np

Load CSV

df

pd . read_csv ( 'data.csv' )

Basic info

print ( df . info ( ) ) print ( df . describe ( ) ) print ( df . head ( 10 ) )

Check missing values

print ( df . isnull ( ) . sum ( ) )

Data types

print ( df . dtypes ) SQL : -- Inspect table schema DESCRIBE table_name ; -- Sample data SELECT * FROM table_name LIMIT 10 ; -- Basic stats SELECT COUNT ( * ) as total_rows , COUNT ( DISTINCT column_name ) as unique_values , MIN ( numeric_column ) as min_val , MAX ( numeric_column ) as max_val , AVG ( numeric_column ) as avg_val FROM table_name ; Step 2: Data cleaning

Handle missing values

df [ 'column' ] . fillna ( df [ 'column' ] . mean ( ) , inplace = True ) df . dropna ( subset = [ 'required_column' ] , inplace = True )

Remove duplicates

df . drop_duplicates ( inplace = True )

Type conversions

df [ 'date' ] = pd . to_datetime ( df [ 'date' ] ) df [ 'category' ] = df [ 'category' ] . astype ( 'category' )

Remove outliers (IQR method)

Q1

df [ 'value' ] . quantile ( 0.25 ) Q3 = df [ 'value' ] . quantile ( 0.75 ) IQR = Q3 - Q1 df = df [ ( df [ 'value' ]

= Q1 - 1.5 * IQR ) & ( df [ 'value' ] <= Q3 + 1.5 * IQR ) ] Step 3: Statistical analysis

Descriptive statistics

print ( df [ 'numeric_column' ] . describe ( ) )

Grouped analysis

grouped

df . groupby ( 'category' ) . agg ( { 'value' : [ 'mean' , 'sum' , 'count' ] , 'other' : 'nunique' } ) print ( grouped )

Correlation

correlation

df [ [ 'col1' , 'col2' , 'col3' ] ] . corr ( ) print ( correlation )

Pivot table

pivot

pd . pivot_table ( df , values = 'sales' , index = 'region' , columns = 'month' , aggfunc = 'sum' ) Step 4: Visualization import matplotlib . pyplot as plt import seaborn as sns

Histogram

plt . figure ( figsize = ( 10 , 6 ) ) df [ 'value' ] . hist ( bins = 30 ) plt . title ( 'Distribution of Values' ) plt . savefig ( 'histogram.png' )

Boxplot

plt . figure ( figsize = ( 10 , 6 ) ) sns . boxplot ( x = 'category' , y = 'value' , data = df ) plt . title ( 'Value by Category' ) plt . savefig ( 'boxplot.png' )

Heatmap (correlation)

plt . figure ( figsize = ( 10 , 8 ) ) sns . heatmap ( correlation , annot = True , cmap = 'coolwarm' ) plt . title ( 'Correlation Matrix' ) plt . savefig ( 'heatmap.png' )

Time series

plt . figure ( figsize = ( 12 , 6 ) ) df . groupby ( 'date' ) [ 'value' ] . sum ( ) . plot ( ) plt . title ( 'Time Series of Values' ) plt . savefig ( 'timeseries.png' ) Step 5: Derive insights

Top/bottom analysis

top_10

df . nlargest ( 10 , 'value' ) bottom_10 = df . nsmallest ( 10 , 'value' )

Trend analysis

df [ 'month' ] = df [ 'date' ] . dt . to_period ( 'M' ) monthly_trend = df . groupby ( 'month' ) [ 'value' ] . sum ( ) growth = monthly_trend . pct_change ( ) * 100

Segment analysis

segments

df . groupby ( 'segment' ) . agg ( { 'revenue' : 'sum' , 'customers' : 'nunique' , 'orders' : 'count' } ) segments [ 'avg_order_value' ] = segments [ 'revenue' ] / segments [ 'orders' ] Output format Analysis report structure

Data Analysis Report

1. Dataset overview

Dataset: [name]

Records: X,XXX

Columns: XX

Date range: YYYY-MM-DD ~ YYYY-MM-DD

2. Key findings

Insight 1

Insight 2

Insight 3

Statistical summary | Metric | Value | |

|

| | Mean | X.XX | | Median | X.XX | | Std dev | X.XX |

Recommendations

1.

[Recommendation 1]

2.

[Recommendation 2]

Best practices

Understand the data first

Learn structure and meaning before analysis

Incremental analysis

Move from simple to complex analyses

Use visualization

Use a variety of charts to spot patterns

Validate assumptions

Always verify assumptions about the data

Reproducibility

Document analysis code and results Constraints Required rules (MUST) Preserve raw data (work on a copy) Document the analysis process Validate results Prohibited (MUST NOT) Do not expose sensitive personal data Do not draw unsupported conclusions References Pandas Documentation Matplotlib Gallery Seaborn Tutorial Examples Example 1: Basic usage Example 2: Advanced usage

data-analysis

安装