Incremental Fetch
Build data pipelines that never lose progress and never re-fetch existing data.
The Two Watermarks Pattern
Track TWO cursors to support both forward and backward fetching:
Watermark Purpose API Parameter newest_id Fetch new data since last run since_id oldest_id Backfill older data until_id
A single watermark only fetches forward. Two watermarks enable:
Regular runs: fetch NEW data (since newest_id) Backfill runs: fetch OLD data (until oldest_id) No overlap, no gaps Critical: Data vs Watermark Saving
These are different operations with different timing:
What When to Save Why Data records After EACH page Resilience: interrupted on page 47? Keep 46 pages Watermarks ONCE at end of run Correctness: only commit progress after full success fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks
Workflow Decision Tree First run (no watermarks)? ├── YES → Full fetch (no since_id, no until_id) └── NO → Backfill flag set? ├── YES → Backfill mode (until_id = oldest_id) └── NO → Update mode (since_id = newest_id)
Implementation Checklist Database: Create ingestion_state table (see patterns.md) Fetch loop: Insert records immediately after each API page Watermark tracking: Track newest/oldest IDs seen in this run Watermark update: Save watermarks ONCE at end of successful run Retry: Exponential backoff with jitter Rate limits: Wait for reset or skip and record for next run Pagination Types
This pattern works best with ID-based pagination (numeric IDs that can be compared). For other pagination types:
Type Adaptation Cursor/token Store cursor string instead of ID; can't compare numerically Timestamp Use last_timestamp column; compare as dates Offset/limit Store page number; resume from last saved page
See references/patterns.md for schemas and code examples.