data-journalism

安装量: 71
排名: #10889

安装

npx skills add https://github.com/jamditis/claude-skills-journalism --skill data-journalism

Data journalism methodology

Systematic approaches for finding, analyzing, and presenting data in journalism.

Data acquisition Public data sources

Federal data sources

General

  • Data.gov - Federal open data portal
  • Census Bureau (census.gov) - Demographics, economic data
  • BLS (bls.gov) - Employment, inflation, wages
  • BEA (bea.gov) - GDP, economic accounts
  • Federal Reserve (federalreserve.gov) - Financial data
  • SEC EDGAR - Corporate filings

Specific domains

  • EPA (epa.gov/data) - Environmental data
  • FDA (fda.gov/data) - Drug approvals, recalls, adverse events
  • CDC WONDER - Health statistics
  • NHTSA - Vehicle safety data
  • DOT - Transportation statistics
  • FEC - Campaign finance
  • USASpending.gov - Federal contracts and grants

State and local

  • State open data portals (search: "[state] open data")
  • Socrata-powered sites (many cities/states)
  • OpenStreets, municipal GIS portals
  • State comptroller/auditor reports

Data request strategies

Getting data that isn't public

FOIA for datasets

  • Request databases, not just documents
  • Ask for data dictionary/schema
  • Request in native format (CSV, SQL dump)
  • Specify field-level needs

Building your own dataset

  • Scraping public information
  • Crowdsourcing from readers
  • Systematic document review
  • Surveys (with proper methodology)

Commercial data sources (for newsrooms)

  • LexisNexis
  • Refinitiv
  • Bloomberg
  • Industry-specific databases

Data cleaning and preparation Common data problems import pandas as pd import numpy as np

Load messy data

df = pd.read_csv('raw_data.csv')

1. INCONSISTENT FORMATTING

Problem: Names in different formats

"SMITH, JOHN" vs "John Smith" vs "smith john"

def standardize_name(name): """Standardize name format to 'First Last'.""" if pd.isna(name): return None name = str(name).strip().lower() # Handle "LAST, FIRST" format if ',' in name: parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name.title()

df['name_clean'] = df['name'].apply(standardize_name)

2. DATE INCONSISTENCIES

Problem: Dates in multiple formats

"01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"

def parse_date(date_str): """Parse dates in various formats.""" if pd.isna(date_str): return None

formats = [
    '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
    '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]

for fmt in formats:
    try:
        return pd.to_datetime(date_str, format=fmt)
    except:
        continue

# Fall back to pandas parser
try:
    return pd.to_datetime(date_str)
except:
    return None

df['date_clean'] = df['date'].apply(parse_date)

3. MISSING VALUES

Strategy depends on context

Check missing value patterns

print(df.isnull().sum()) print(df.isnull().sum() / len(df) * 100) # Percentage

Options:

- Drop rows with critical missing values

df_clean = df.dropna(subset=['required_field'])

- Fill with appropriate values

df['category'] = df['category'].fillna('Unknown') df['amount'] = df['amount'].fillna(df['amount'].median())

- Flag as missing (preserve for analysis)

df['amount_missing'] = df['amount'].isna()

4. DUPLICATES

Find and handle duplicates

Exact duplicates

print(f"Exact duplicates: {df.duplicated().sum()}") df = df.drop_duplicates()

Fuzzy duplicates (similar but not identical)

Use record linkage or manual review

from fuzzywuzzy import fuzz

def find_similar_names(names, threshold=85): """Find potentially duplicate names.""" duplicates = [] for i, name1 in enumerate(names): for j, name2 in enumerate(names[i+1:], i+1): score = fuzz.ratio(str(name1).lower(), str(name2).lower()) if score >= threshold: duplicates.append((name1, name2, score)) return duplicates

5. OUTLIERS

Identify potential data entry errors

def flag_outliers(series, method='iqr', threshold=1.5): """Flag statistical outliers.""" if method == 'iqr': Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR return (series < lower) | (series > upper) elif method == 'zscore': z_scores = np.abs((series - series.mean()) / series.std()) return z_scores > threshold

df['amount_outlier'] = flag_outliers(df['amount']) print(f"Outliers found: {df['amount_outlier'].sum()}")

6. DATA TYPE CORRECTIONS

Ensure proper types for analysis

Convert to numeric (handling errors)

df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

Convert to categorical (saves memory, enables ordering)

df['status'] = pd.Categorical(df['status'], categories=['Pending', 'Active', 'Closed'], ordered=True)

Convert to datetime

df['date'] = pd.to_datetime(df['date'], errors='coerce')

Data validation checklist

Pre-analysis data validation

Structural checks

  • [ ] Row count matches expected
  • [ ] Column count and names correct
  • [ ] Data types appropriate
  • [ ] No unexpected null columns

Content checks

  • [ ] Date ranges make sense
  • [ ] Numeric values within expected bounds
  • [ ] Categorical values match expected options
  • [ ] Geographic data resolves correctly
  • [ ] IDs are unique where expected

Consistency checks

  • [ ] Totals add up to expected values
  • [ ] Cross-tabulations balance
  • [ ] Related fields are consistent
  • [ ] Time series is continuous

Source verification

  • [ ] Can trace back to original source
  • [ ] Methodology documented
  • [ ] Known limitations noted
  • [ ] Update frequency understood

Statistical analysis for journalism Basic statistics with context

Essential statistics for any dataset

def describe_for_journalism(df, column): """Generate journalist-friendly statistics.""" stats = { 'count': len(df[column].dropna()), 'missing': df[column].isna().sum(), 'min': df[column].min(), 'max': df[column].max(), 'mean': df[column].mean(), 'median': df[column].median(), 'std': df[column].std(), }

# Percentiles for context
stats['25th_percentile'] = df[column].quantile(0.25)
stats['75th_percentile'] = df[column].quantile(0.75)
stats['90th_percentile'] = df[column].quantile(0.90)
stats['99th_percentile'] = df[column].quantile(0.99)

# Distribution shape
stats['skewness'] = df[column].skew()

return stats

Example interpretation

stats = describe_for_journalism(df, 'salary') print(f""" SALARY ANALYSIS


We analyzed {stats['count']:,} salary records.

The median salary is ${stats['median']:,.0f}, meaning half of workers earn more and half earn less.

The average salary is ${stats['mean']:,.0f}, which is {'higher' if stats['mean'] > stats['median'] else 'lower'} than the median, indicating the distribution is {'right-skewed (pulled up by high earners)' if stats['skewness'] > 0 else 'left-skewed'}.

The top 10% of earners make at least ${stats['90th_percentile']:,.0f}. The top 1% make at least ${stats['99th_percentile']:,.0f}. """)

Comparisons and context

Year-over-year change

def calculate_change(current, previous): """Calculate change with multiple metrics.""" absolute = current - previous if previous != 0: percent = (current - previous) / previous * 100 else: percent = float('inf') if current > 0 else 0

return {
    'current': current,
    'previous': previous,
    'absolute_change': absolute,
    'percent_change': percent,
    'direction': 'increased' if absolute > 0 else 'decreased' if absolute < 0 else 'unchanged'
}

Per capita calculations (essential for fair comparisons)

def per_capita(value, population): """Calculate per capita rate.""" return (value / population) * 100000 # Per 100,000 is standard

Example: Crime rates

city_a = {'crimes': 5000, 'population': 100000} city_b = {'crimes': 8000, 'population': 500000}

rate_a = per_capita(city_a['crimes'], city_a['population']) rate_b = per_capita(city_b['crimes'], city_b['population'])

print(f"City A: {rate_a:.1f} crimes per 100,000 residents") print(f"City B: {rate_b:.1f} crimes per 100,000 residents")

City A actually has higher crime rate despite fewer total crimes!

Inflation adjustment

def adjust_for_inflation(amount, from_year, to_year, cpi_data): """Adjust dollar amounts for inflation.""" from_cpi = cpi_data[from_year] to_cpi = cpi_data[to_year] return amount * (to_cpi / from_cpi)

Always adjust when comparing dollars across years!

Correlation vs causation

Reporting correlations responsibly

What you CAN say

  • "X and Y are correlated"
  • "As X increases, Y tends to increase"
  • "Areas with higher X also tend to have higher Y"
  • "X is associated with Y"

What you CANNOT say (without more evidence)

  • "X causes Y"
  • "X leads to Y"
  • "Y happens because of X"

Questions to ask before implying causation

  1. Is there a plausible mechanism?
  2. Does the timing make sense (cause before effect)?
  3. Is there a dose-response relationship?
  4. Has the finding been replicated?
  5. Have confounding variables been controlled?
  6. Are there alternative explanations?

Red flags for spurious correlations

  • Extremely high correlation (r > 0.95) with unrelated things
  • No logical connection between variables
  • Third variable could explain both
  • Small sample size with high variance

Data visualization Chart selection guide

Choosing the right chart

Comparison

  • Bar chart: Compare categories
  • Grouped bar: Compare categories across groups
  • Bullet chart: Actual vs target

Change over time

  • Line chart: Trends over time
  • Area chart: Cumulative totals over time
  • Slope chart: Change between two points

Distribution

  • Histogram: Distribution of one variable
  • Box plot: Compare distributions across groups
  • Violin plot: Detailed distribution shape

Relationship

  • Scatter plot: Relationship between two variables
  • Bubble chart: Three variables (x, y, size)
  • Connected scatter: Change in relationship over time

Composition

  • Pie chart: Parts of a whole (use sparingly, max 5 slices)
  • Stacked bar: Parts of whole across categories
  • Treemap: Hierarchical composition

Geographic

  • Choropleth: Values by region (use normalized data!)
  • Dot map: Individual locations
  • Proportional symbol: Magnitude at locations

Visualization best practices import matplotlib.pyplot as plt import seaborn as sns

Journalist-friendly chart defaults

plt.rcParams.update({ 'figure.figsize': (10, 6), 'font.size': 12, 'axes.titlesize': 16, 'axes.labelsize': 12, 'axes.spines.top': False, 'axes.spines.right': False, })

def create_bar_chart(data, title, source, xlabel='', ylabel=''): """Create a publication-ready bar chart.""" fig, ax = plt.subplots()

# Create bars
bars = ax.bar(data.keys(), data.values(), color='#2c7bb6')

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.annotate(f'{height:,.0f}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                ha='center', va='bottom',
                fontsize=10)

# Labels and title
ax.set_title(title, fontweight='bold', pad=20)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)

# Add source annotation
fig.text(0.99, 0.01, f'Source: {source}',
         ha='right', va='bottom', fontsize=9, color='gray')

plt.tight_layout()
return fig

Example

data = {'2020': 1200, '2021': 1450, '2022': 1380, '2023': 1620} fig = create_bar_chart(data, 'Annual Widget Production', 'Department of Widgets, 2024', ylabel='Units produced') fig.savefig('chart.png', dpi=150, bbox_inches='tight')

Avoiding misleading visualizations

Chart integrity checklist

Axes

  • [ ] Y-axis starts at zero (for bar charts)
  • [ ] Axis labels are clear
  • [ ] Scale is appropriate (not truncated to exaggerate)
  • [ ] Both axes labeled with units

Data representation

  • [ ] All data points visible
  • [ ] Colors are distinguishable (including colorblind)
  • [ ] Proportions are accurate
  • [ ] 3D effects not distorting perception

Context

  • [ ] Title describes what's shown, not conclusion
  • [ ] Time period clearly stated
  • [ ] Source cited
  • [ ] Sample size/methodology noted if relevant
  • [ ] Uncertainty shown where appropriate

Honesty

  • [ ] Cherry-picking dates avoided
  • [ ] Outliers explained, not hidden
  • [ ] Dual axes justified (usually avoid)
  • [ ] Annotations don't mislead

Story structure for data journalism Data story framework

The data story arc

1. The hook (nut graf)

  • What's the key finding?
  • Why should readers care?
  • What's the human impact?

2. The evidence

  • Show the data
  • Explain the methodology
  • Acknowledge limitations

3. The context

  • How does this compare to past?
  • How does this compare to elsewhere?
  • What's the trend?

4. The human element

  • Individual examples that illustrate the data
  • Expert interpretation
  • Affected voices

5. The implications

  • What does this mean going forward?
  • What questions remain?
  • What actions could result?

6. The methodology box

  • Where did data come from?
  • How was it analyzed?
  • What are the limitations?
  • How can readers explore further?

Methodology documentation template

How we did this analysis

Data sources

[List all data sources with links and access dates]

Time period

[Specify exactly what time period is covered]

Definitions

[Define key terms and how you operationalized them]

Analysis steps

  1. [First step of analysis]
  2. [Second step]
  3. [Continue...]

Limitations

  • [Limitation 1]
  • [Limitation 2]

What we excluded and why

Verification

[How findings were verified/checked]

Code and data availability

[Link to GitHub repo if sharing code/data]

Contact

[How readers can reach you with questions]

Tools and resources Essential tools Tool Purpose Cost Python + pandas Data analysis Free R + tidyverse Statistical analysis Free Excel/Sheets Quick analysis Free/Low Datawrapper Charts for web Free tier Flourish Interactive viz Free tier QGIS Mapping Free Tabula PDF table extraction Free OpenRefine Data cleaning Free Learning resources NICAR (Investigative Reporters & Editors) Knight Center for Journalism in the Americas Data Journalism Handbook (datajournalism.com) Flowing Data (flowingdata.com) The Pudding (pudding.cool) - examples

返回排行榜