Data Pipeline Engineer
Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.
Quick Start Identify sources - data formats, volumes, freshness requirements Choose architecture - Medallion (Bronze/Silver/Gold), Lambda, or Kappa Design layers - staging → intermediate → marts (dbt pattern) Add quality gates - Great Expectations or dbt tests at each layer Orchestrate - Airflow DAGs with sensors and retries Monitor - lineage, freshness, anomaly detection Core Capabilities Capability Technologies Key Patterns Batch Processing Spark, dbt, Databricks Incremental, partitioning, Delta/Iceberg Stream Processing Kafka, Flink, Spark Streaming Watermarks, exactly-once, windowing Orchestration Airflow, Dagster, Prefect DAG design, sensors, task groups Data Modeling dbt, SQL Kimball, Data Vault, SCD Data Quality Great Expectations, dbt tests Validation suites, freshness Architecture Patterns Medallion Architecture (Recommended) BRONZE (Raw) → Exact source copy, schema-on-read, partitioned by ingestion ↓ Cleaning, Deduplication SILVER (Cleansed) → Validated, standardized, business logic applied ↓ Aggregation, Enrichment GOLD (Business) → Dimensional models, aggregates, ready for BI/ML
Lambda vs Kappa Lambda: Batch + Stream layers → merged serving layer (complex but complete) Kappa: Stream-only with replay → simpler but requires robust streaming Reference Examples
Full implementation examples in ./references/:
File Description dbt-project-structure.md Complete dbt layout with staging, intermediate, marts airflow-dag.py Production DAG with sensors, task groups, quality checks spark-streaming.py Kafka-to-Delta processor with windowing great-expectations-suite.json Comprehensive data quality expectation suite Anti-Patterns (10 Critical Mistakes) 1. Full Table Refreshes
Symptom: Truncate and rebuild entire tables every run Fix: Use incremental models with is_incremental(), partition by date
- Tight Coupling to Source Schemas
Symptom: Pipeline breaks when upstream adds/removes columns Fix: Explicit source contracts, select only needed columns in staging
- Monolithic DAGs
Symptom: One 200-task DAG running 8 hours Fix: Domain-specific DAGs, ExternalTaskSensor for dependencies
- No Data Quality Gates
Symptom: Bad data reaches production before detection Fix: Great Expectations or dbt tests at each layer, block on failures
- Processing Before Archiving
Symptom: Raw data transformed without preserving original Fix: Always land raw in Bronze first, make transformations reproducible
- Hardcoded Dates in Queries
Symptom: Manual updates needed for date filters Fix: Use Airflow templating (e.g., ds variable) or dynamic date functions
- Missing Watermarks in Streaming
Symptom: Unbounded state growth, OOM in long-running jobs Fix: Add withWatermark() to handle late-arriving data
- No Retry/Backoff Strategy
Symptom: Transient failures cause DAG failures Fix: retries=3, retry_exponential_backoff=True, max_retry_delay
- Undocumented Data Lineage
Symptom: No one knows where data comes from or who uses it Fix: dbt docs, data catalog integration, column-level lineage
- Testing Only in Production
Symptom: Bugs discovered by stakeholders, not engineers Fix: dbt --target dev, sample datasets, CI/CD for models
Quality Checklist
Pipeline Design:
Incremental processing where possible Idempotent transformations (re-runnable safely) Partitioning strategy defined and documented Backfill procedures documented
Data Quality:
Tests at Bronze layer (schema, nulls, ranges) Tests at Silver layer (business rules, referential integrity) Tests at Gold layer (aggregation checks, trend monitoring) Anomaly detection for volumes and distributions
Orchestration:
Retry and alerting configured SLAs defined and monitored Cross-DAG dependencies use sensors max_active_runs prevents parallel conflicts
Operations:
Data lineage documented Runbooks for common failures Monitoring dashboards for pipeline health On-call procedures defined Validation Script
Run ./scripts/validate-pipeline.sh to check:
dbt project structure and conventions Airflow DAG best practices Spark job configurations Data quality setup External Resources dbt Best Practices Airflow Best Practices Great Expectations Docs Delta Lake Guide Kafka Streams