Data journalism methodology

Systematic approaches for finding, analyzing, and presenting data in journalism.

Data acquisition Public data sources

Federal data sources

General

Data.gov - Federal open data portal
Census Bureau (census.gov) - Demographics, economic data
BLS (bls.gov) - Employment, inflation, wages
BEA (bea.gov) - GDP, economic accounts
Federal Reserve (federalreserve.gov) - Financial data
SEC EDGAR - Corporate filings

Specific domains

EPA (epa.gov/data) - Environmental data
FDA (fda.gov/data) - Drug approvals, recalls, adverse events
CDC WONDER - Health statistics
NHTSA - Vehicle safety data
DOT - Transportation statistics
FEC - Campaign finance
USASpending.gov - Federal contracts and grants

State and local

State open data portals (search: "[state] open data")
Socrata-powered sites (many cities/states)
OpenStreets, municipal GIS portals
State comptroller/auditor reports

Data request strategies

Getting data that isn't public

FOIA for datasets

Request databases, not just documents
Ask for data dictionary/schema
Request in native format (CSV, SQL dump)
Specify field-level needs

Building your own dataset

Scraping public information
Crowdsourcing from readers
Systematic document review
Surveys (with proper methodology)

Commercial data sources (for newsrooms)

LexisNexis
Refinitiv
Bloomberg
Industry-specific databases

Data cleaning and preparation Common data problems import pandas as pd import numpy as np

Load messy data

df = pd.read_csv('raw_data.csv')

1. INCONSISTENT FORMATTING

Problem: Names in different formats

"SMITH, JOHN" vs "John Smith" vs "smith john"

def standardize_name(name): """Standardize name format to 'First Last'.""" if pd.isna(name): return None name = str(name).strip().lower() # Handle "LAST, FIRST" format if ',' in name: parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name.title()

df['name_clean'] = df['name'].apply(standardize_name)

2. DATE INCONSISTENCIES

Problem: Dates in multiple formats

"01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"

def parse_date(date_str): """Parse dates in various formats.""" if pd.isna(date_str): return None

formats = [
    '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
    '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]

for fmt in formats:
    try:
        return pd.to_datetime(date_str, format=fmt)
    except:
        continue

# Fall back to pandas parser
try:
    return pd.to_datetime(date_str)
except:
    return None

df['date_clean'] = df['date'].apply(parse_date)

3. MISSING VALUES

Strategy depends on context

Check missing value patterns

print(df.isnull().sum()) print(df.isnull().sum() / len(df) * 100) # Percentage

Options:

- Drop rows with critical missing values

df_clean = df.dropna(subset=['required_field'])

- Fill with appropriate values

df['category'] = df['category'].fillna('Unknown') df['amount'] = df['amount'].fillna(df['amount'].median())

- Flag as missing (preserve for analysis)

df['amount_missing'] = df['amount'].isna()

4. DUPLICATES

Find and handle duplicates

Exact duplicates

print(f"Exact duplicates: {df.duplicated().sum()}") df = df.drop_duplicates()

Fuzzy duplicates (similar but not identical)

Use record linkage or manual review

from fuzzywuzzy import fuzz

def find_similar_names(names, threshold=85): """Find potentially duplicate names.""" duplicates = [] for i, name1 in enumerate(names): for j, name2 in enumerate(names[i+1:], i+1): score = fuzz.ratio(str(name1).lower(), str(name2).lower()) if score >= threshold: duplicates.append((name1, name2, score)) return duplicates

5. OUTLIERS

Identify potential data entry errors

def flag_outliers(series, method='iqr', threshold=1.5): """Flag statistical outliers.""" if method == 'iqr': Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR return (series < lower) | (series > upper) elif method == 'zscore': z_scores = np.abs((series - series.mean()) / series.std()) return z_scores > threshold

df['amount_outlier'] = flag_outliers(df['amount']) print(f"Outliers found: {df['amount_outlier'].sum()}")

6. DATA TYPE CORRECTIONS

Ensure proper types for analysis

Convert to numeric (handling errors)

df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

Convert to categorical (saves memory, enables ordering)

df['status'] = pd.Categorical(df['status'], categories=['Pending', 'Active', 'Closed'], ordered=True)

Convert to datetime

df['date'] = pd.to_datetime(df['date'], errors='coerce')

Data validation checklist

Pre-analysis data validation

Structural checks

[ ] Row count matches expected
[ ] Column count and names correct
[ ] Data types appropriate
[ ] No unexpected null columns

Content checks

[ ] Date ranges make sense
[ ] Numeric values within expected bounds
[ ] Categorical values match expected options
[ ] Geographic data resolves correctly
[ ] IDs are unique where expected

Consistency checks

[ ] Totals add up to expected values
[ ] Cross-tabulations balance
[ ] Related fields are consistent
[ ] Time series is continuous

Source verification

[ ] Can trace back to original source
[ ] Methodology documented
[ ] Known limitations noted
[ ] Update frequency understood

Statistical analysis for journalism Basic statistics with context

Essential statistics for any dataset

def describe_for_journalism(df, column): """Generate journalist-friendly statistics.""" stats = { 'count': len(df[column].dropna()), 'missing': df[column].isna().sum(), 'min': df[column].min(), 'max': df[column].max(), 'mean': df[column].mean(), 'median': df[column].median(), 'std': df[column].std(), }

# Percentiles for context
stats['25th_percentile'] = df[column].quantile(0.25)
stats['75th_percentile'] = df[column].quantile(0.75)
stats['90th_percentile'] = df[column].quantile(0.90)
stats['99th_percentile'] = df[column].quantile(0.99)

# Distribution shape
stats['skewness'] = df[column].skew()

return stats

Example interpretation

stats = describe_for_journalism(df, 'salary') print(f""" SALARY ANALYSIS

We analyzed {stats['count']:,} salary records.

The median salary is ${stats['median']:,.0f}, meaning half of workers earn more and half earn less.

The average salary is ${stats['mean']:,.0f}, which is {'higher' if stats['mean'] > stats['median'] else 'lower'} than the median, indicating the distribution is {'right-skewed (pulled up by high earners)' if stats['skewness'] > 0 else 'left-skewed'}.

The top 10% of earners make at least ${stats['90th_percentile']:,.0f}. The top 1% make at least ${stats['99th_percentile']:,.0f}. """)

Comparisons and context

Year-over-year change

def calculate_change(current, previous): """Calculate change with multiple metrics.""" absolute = current - previous if previous != 0: percent = (current - previous) / previous * 100 else: percent = float('inf') if current > 0 else 0

return {
    'current': current,
    'previous': previous,
    'absolute_change': absolute,
    'percent_change': percent,
    'direction': 'increased' if absolute > 0 else 'decreased' if absolute < 0 else 'unchanged'
}

Per capita calculations (essential for fair comparisons)

def per_capita(value, population): """Calculate per capita rate.""" return (value / population) * 100000 # Per 100,000 is standard

Example: Crime rates

city_a = {'crimes': 5000, 'population': 100000} city_b = {'crimes': 8000, 'population': 500000}

rate_a = per_capita(city_a['crimes'], city_a['population']) rate_b = per_capita(city_b['crimes'], city_b['population'])

print(f"City A: {rate_a:.1f} crimes per 100,000 residents") print(f"City B: {rate_b:.1f} crimes per 100,000 residents")

City A actually has higher crime rate despite fewer total crimes!

Inflation adjustment

def adjust_for_inflation(amount, from_year, to_year, cpi_data): """Adjust dollar amounts for inflation.""" from_cpi = cpi_data[from_year] to_cpi = cpi_data[to_year] return amount * (to_cpi / from_cpi)

Always adjust when comparing dollars across years!

Correlation vs causation

Reporting correlations responsibly

What you CAN say

"X and Y are correlated"
"As X increases, Y tends to increase"
"Areas with higher X also tend to have higher Y"
"X is associated with Y"

What you CANNOT say (without more evidence)

"X causes Y"
"X leads to Y"
"Y happens because of X"

Questions to ask before implying causation

Is there a plausible mechanism?
Does the timing make sense (cause before effect)?
Is there a dose-response relationship?
Has the finding been replicated?
Have confounding variables been controlled?
Are there alternative explanations?

Red flags for spurious correlations

Extremely high correlation (r > 0.95) with unrelated things
No logical connection between variables
Third variable could explain both
Small sample size with high variance

Data visualization Chart selection guide

Choosing the right chart

Comparison

Bar chart: Compare categories
Grouped bar: Compare categories across groups
Bullet chart: Actual vs target

Change over time

Line chart: Trends over time
Area chart: Cumulative totals over time
Slope chart: Change between two points

Distribution

Histogram: Distribution of one variable
Box plot: Compare distributions across groups
Violin plot: Detailed distribution shape

Relationship

Scatter plot: Relationship between two variables
Bubble chart: Three variables (x, y, size)
Connected scatter: Change in relationship over time

Composition

Pie chart: Parts of a whole (use sparingly, max 5 slices)
Stacked bar: Parts of whole across categories
Treemap: Hierarchical composition

Geographic

Choropleth: Values by region (use normalized data!)
Dot map: Individual locations
Proportional symbol: Magnitude at locations

Visualization best practices import matplotlib.pyplot as plt import seaborn as sns

Journalist-friendly chart defaults

plt.rcParams.update({ 'figure.figsize': (10, 6), 'font.size': 12, 'axes.titlesize': 16, 'axes.labelsize': 12, 'axes.spines.top': False, 'axes.spines.right': False, })

def create_bar_chart(data, title, source, xlabel='', ylabel=''): """Create a publication-ready bar chart.""" fig, ax = plt.subplots()

# Create bars
bars = ax.bar(data.keys(), data.values(), color='#2c7bb6')

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.annotate(f'{height:,.0f}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                ha='center', va='bottom',
                fontsize=10)

# Labels and title
ax.set_title(title, fontweight='bold', pad=20)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)

# Add source annotation
fig.text(0.99, 0.01, f'Source: {source}',
         ha='right', va='bottom', fontsize=9, color='gray')

plt.tight_layout()
return fig

Example

data = {'2020': 1200, '2021': 1450, '2022': 1380, '2023': 1620} fig = create_bar_chart(data, 'Annual Widget Production', 'Department of Widgets, 2024', ylabel='Units produced') fig.savefig('chart.png', dpi=150, bbox_inches='tight')

Avoiding misleading visualizations

Chart integrity checklist

Axes

[ ] Y-axis starts at zero (for bar charts)
[ ] Axis labels are clear
[ ] Scale is appropriate (not truncated to exaggerate)
[ ] Both axes labeled with units

Data representation

[ ] All data points visible
[ ] Colors are distinguishable (including colorblind)
[ ] Proportions are accurate
[ ] 3D effects not distorting perception

Context

[ ] Title describes what's shown, not conclusion
[ ] Time period clearly stated
[ ] Source cited
[ ] Sample size/methodology noted if relevant
[ ] Uncertainty shown where appropriate

Honesty

[ ] Cherry-picking dates avoided
[ ] Outliers explained, not hidden
[ ] Dual axes justified (usually avoid)
[ ] Annotations don't mislead

Story structure for data journalism Data story framework

The data story arc

1. The hook (nut graf)

What's the key finding?
Why should readers care?
What's the human impact?

2. The evidence

Show the data
Explain the methodology
Acknowledge limitations

3. The context

How does this compare to past?
How does this compare to elsewhere?
What's the trend?

4. The human element

Individual examples that illustrate the data
Expert interpretation
Affected voices

5. The implications

What does this mean going forward?
What questions remain?
What actions could result?

6. The methodology box

Where did data come from?
How was it analyzed?
What are the limitations?
How can readers explore further?

Methodology documentation template

How we did this analysis

Data sources

[List all data sources with links and access dates]

Time period

[Specify exactly what time period is covered]

Definitions

[Define key terms and how you operationalized them]

Analysis steps

[First step of analysis]
[Second step]
[Continue...]

Limitations

[Limitation 1]
[Limitation 2]

What we excluded and why

Verification

[How findings were verified/checked]

Code and data availability

[Link to GitHub repo if sharing code/data]

Contact

[How readers can reach you with questions]

Tools and resources Essential tools Tool Purpose Cost Python + pandas Data analysis Free R + tidyverse Statistical analysis Free Excel/Sheets Quick analysis Free/Low Datawrapper Charts for web Free tier Flourish Interactive viz Free tier QGIS Mapping Free Tabula PDF table extraction Free OpenRefine Data cleaning Free Learning resources NICAR (Investigative Reporters & Editors) Knight Center for Journalism in the Americas Data Journalism Handbook (datajournalism.com) Flowing Data (flowingdata.com) The Pudding (pudding.cool) - examples

安装