Data Validation Skill

Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.

Pre-Delivery QA Checklist

Run through this checklist before sharing any analysis with stakeholders.

Data Quality Checks

Source verification

Confirmed which tables/data sources were used. Are they the right ones for this question?

Freshness

Data is current enough for the analysis. Noted the "as of" date.

Completeness

No unexpected gaps in time series or missing segments.

Null handling

Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged).

Deduplication

Confirmed no double-counting from bad joins or duplicate source records.

Filter verification

All WHERE clauses and filters are correct. No unintended exclusions.

Calculation Checks

Aggregation logic

GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain.

Denominator correctness

Rate and percentage calculations use the right denominator. Denominators are non-zero.

Date alignment

Comparisons use the same time period length. Partial periods are excluded or noted.

Join correctness

JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven't inflated counts.

Metric definitions

Metrics match how stakeholders define them. Any deviations are noted.

Subtotals sum

Parts add up to the whole where expected. If they don't, explain why (e.g., overlap).

Reasonableness Checks

Magnitude

Numbers are in a plausible range. Revenue isn't negative. Percentages are between 0-100%.

Trend continuity

No unexplained jumps or drops in time series.

Cross-reference

Key numbers match other known sources (dashboards, previous reports, finance data).

Order of magnitude

Total revenue is in the right ballpark. User counts match known figures.

Edge cases

What happens at the boundaries? Empty segments, zero-activity periods, new entities.

Presentation Checks

Chart accuracy

Bar charts start at zero. Axes are labeled. Scales are consistent across panels.

Number formatting

Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed.

Title clarity

Titles state the insight, not just the metric. Date ranges are specified.

Caveat transparency

Known limitations and assumptions are stated explicitly.

Reproducibility

Someone else could recreate this analysis from the documentation provided.

Common Data Analysis Pitfalls

Join Explosion

The problem

A many-to-many join silently multiplies rows, inflating counts and sums.

How to detect

:

-- Check row count before and after join

SELECT

COUNT

(

*

)

FROM

table_a

;

-- 1,000

SELECT

COUNT

(

*

)

FROM

table_a a

JOIN

table_b b

ON

a

.

id

=

b

.

a_id

;

-- 3,500 (uh oh)

How to prevent

:

Always check row counts after joins

If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)

Use

COUNT(DISTINCT a.id)

instead of

COUNT(*)

when counting entities through joins

Survivorship Bias

The problem

Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.

Examples

:

Analyzing user behavior of "current users" misses churned users

Looking at "companies using our product" ignores those who evaluated and left

Studying properties of "successful" outcomes without "unsuccessful" ones

How to prevent

Ask "who is NOT in this dataset?" before drawing conclusions.

Incomplete Period Comparison

The problem

Comparing a partial period to a full period.

Examples

:

"January revenue is $500K vs. December's $800K" -- but January isn't over yet

"This week's signups are down" -- checked on Wednesday, comparing to a full prior week

How to prevent

Always filter to complete periods, or compare same-day-of-month / same-number-of-days.

Denominator Shifting

The problem

The denominator changes between periods, making rates incomparable.

Examples

:

Conversion rate improves because you changed how you count "eligible" users

Churn rate changes because the definition of "active" was updated

How to prevent

Use consistent definitions across all compared periods. Note any definition changes.

Average of Averages

The problem

Averaging pre-computed averages gives wrong results when group sizes differ.

Example

:

Group A: 100 users, average revenue $50

Group B: 10 users, average revenue $200

Wrong: Average of averages = ($50 + $200) / 2 = $125

Right: Weighted average = (100$50 + 10$200) / 110 = $63.64

How to prevent

Always aggregate from raw data. Never average pre-aggregated averages.

Timezone Mismatches

The problem

Different data sources use different timezones, causing misalignment.

Examples

:

Event timestamps in UTC vs. user-facing dates in local time

Daily rollups that use different cutoff times

How to prevent

Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.

Selection Bias in Segmentation

The problem

Segments are defined by the outcome you're measuring, creating circular logic.
Examples
:
"Users who completed onboarding have higher retention" -- obviously, they self-selected
"Power users generate more revenue" -- they became power users BY generating revenue
How to prevent: Define segments based on pre-treatment characteristics, not outcomes. Result Sanity Checking Magnitude Checks For any key number in your analysis, verify it passes the "smell test": Metric Type Sanity Check User counts Does this match known MAU/DAU figures? Revenue Is this in the right order of magnitude vs. known ARR? Conversion rates Is this between 0% and 100%? Does it match dashboard figures? Growth rates Is 50%+ MoM growth realistic, or is there a data issue? Averages Is the average reasonable given what you know about the distribution? Percentages Do segment percentages sum to ~100%? Cross-Validation Techniques Calculate the same metric two different ways and verify they match Spot-check individual records -- pick a few specific entities and trace their data manually Compare to known benchmarks -- match against published dashboards, finance reports, or prior analyses Reverse engineer -- if total revenue is X, does per-user revenue times user count approximately equal X? Boundary checks -- what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible? Red Flags That Warrant Investigation Any metric that changed by more than 50% period-over-period without an obvious cause Counts or sums that are exact round numbers (suggests a filter or default value issue) Rates exactly at 0% or 100% (may indicate incomplete data) Results that perfectly confirm the hypothesis (reality is usually messier) Identical values across time periods or segments (suggests the query is ignoring a dimension) Documentation Standards for Reproducibility Analysis Documentation Template Every non-trivial analysis should include:

Analysis: [Title]

Question [The specific question being answered]

Data Sources

Table: [schema.table_name] (as of [date])

Table: [schema.other_table] (as of [date])

File: [filename] (source: [where it came from])

Definitions

[ Metric A ] : [Exactly how it's calculated] - [ Segment X ] : [Exactly how membership is determined] - [ Time period ] : [Start date] to [end date], [timezone]

Methodology 1. [Step 1 of the analysis approach] 2. [Step 2] 3. [Step 3]

Assumptions and Limitations

[Assumption 1 and why it's reasonable]

[Limitation 1 and its potential impact on conclusions]

Key Findings 1. [Finding 1 with supporting evidence] 2. [Finding 2 with supporting evidence]

SQL Queries [All queries used, with comments]

Caveats

[Things the reader should know before acting on this] Code Documentation For any code (SQL, Python) that may be reused: """ Analysis: Monthly Cohort Retention Author: [Name] Date: [Date] Data Source: events table, users table Last Validated: [Date] -- results matched dashboard within 2% Purpose: Calculate monthly user retention cohorts based on first activity date. Assumptions: - "Active" means at least one event in the month - Excludes test/internal accounts (user_type != 'internal') - Uses UTC dates throughout Output: Cohort retention matrix with cohort_month rows and months_since_signup columns. Values are retention rates (0-100%). """ Version Control for Analyses Save queries and code in version control (git) or a shared docs system Note the date of the data snapshot used If an analysis is re-run with updated data, document what changed and why Link to prior versions of recurring analyses for trend comparison

data-validation

安装

Data Sources

Table: [schema.table_name] (as of [date])

Table: [schema.other_table] (as of [date])

Definitions

Assumptions and Limitations

[Assumption 1 and why it's reasonable]

Caveats