Incremental Fetch

Build data pipelines that never lose progress and never re-fetch existing data.

The Two Watermarks Pattern

Track TWO cursors to support both forward and backward fetching:

Watermark Purpose API Parameter newest_id Fetch new data since last run since_id oldest_id Backfill older data until_id

A single watermark only fetches forward. Two watermarks enable:

Regular runs: fetch NEW data (since newest_id) Backfill runs: fetch OLD data (until oldest_id) No overlap, no gaps Critical: Data vs Watermark Saving

These are different operations with different timing:

What When to Save Why Data records After EACH page Resilience: interrupted on page 47? Keep 46 pages Watermarks ONCE at end of run Correctness: only commit progress after full success fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks

Workflow Decision Tree First run (no watermarks)? ├── YES → Full fetch (no since_id, no until_id) └── NO → Backfill flag set? ├── YES → Backfill mode (until_id = oldest_id) └── NO → Update mode (since_id = newest_id)

Implementation Checklist Database: Create ingestion_state table (see patterns.md) Fetch loop: Insert records immediately after each API page Watermark tracking: Track newest/oldest IDs seen in this run Watermark update: Save watermarks ONCE at end of successful run Retry: Exponential backoff with jitter Rate limits: Wait for reset or skip and record for next run Pagination Types

This pattern works best with ID-based pagination (numeric IDs that can be compared). For other pagination types:

Type Adaptation Cursor/token Store cursor string instead of ID; can't compare numerically Timestamp Use last_timestamp column; compare as dates Offset/limit Store page number; resume from last saved page

See references/patterns.md for schemas and code examples.

incremental-fetch

安装