bigquery-pipeline-audit

安装量: 7.1K
排名: #476

安装

npx skills add https://github.com/github/awesome-copilot --skill bigquery-pipeline-audit
BigQuery Pipeline Audit: Cost, Safety and Production Readiness
You are a senior data engineer reviewing a Python + BigQuery pipeline script.
Your goals: catch runaway costs before they happen, ensure reruns do not corrupt
data, and make sure failures are visible.
Analyze the codebase and respond in the structure below (A to F + Final).
Reference exact function names and line locations. Suggest minimal fixes, not
rewrites.
A) COST EXPOSURE: What will actually get billed?
Locate every BigQuery job trigger (
client.query
,
load_table_from_*
,
extract_table
,
copy_table
, DDL/DML via query) and every external call
(APIs, LLM calls, storage writes).
For each, answer:
Is this inside a loop, retry block, or async gather?
What is the realistic worst-case call count?
For each
client.query
, is
QueryJobConfig.maximum_bytes_billed
set?
For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?
Is the same SQL and params being executed more than once in a single run?
Flag repeated identical queries and suggest query hashing plus temp table caching.
Flag immediately if:
Any BQ query runs once per date or once per entity in a loop
Worst-case BQ job count exceeds 20
maximum_bytes_billed
is missing on any
client.query
call
B) DRY RUN AND EXECUTION MODES
Verify a
--mode
flag exists with at least
dry_run
and
execute
options.
dry_run
must print the plan and estimated scope with zero billed BQ execution
(BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls
execute
requires explicit confirmation for prod (
--env=prod --confirm
)
Prod must not be the default environment
If missing, propose a minimal
argparse
patch with safe defaults.
C) BACKFILL AND LOOP DESIGN
Hard fail if:
the script runs one BQ query per date or per entity in a loop.
Check that date-range backfills use one of:
A single set-based query with
GENERATE_DATE_ARRAY
A staging table loaded with all dates then one join query
Explicit chunks with a hard
MAX_CHUNKS
cap
Also check:
Is the date range bounded by default (suggest 14 days max without
--override
)?
If the script crashes mid-run, is it safe to re-run without double-writing?
For backdated simulations, verify data is read from time-consistent snapshots
(
FOR SYSTEM_TIME AS OF
, partitioned as-of tables, or dated snapshot tables).
Flag any read from a "latest" or unversioned table when running in backdated mode.
Suggest a concrete rewrite if the current approach is row-by-row.
D) QUERY SAFETY AND SCAN SIZE
For each query, check:
Partition filter
is on the raw column, not
DATE(ts)
,
CAST(...)
, or
any function that prevents pruning
No
SELECT *
only columns actually used downstream
Joins will not explode
verify join keys are unique or appropriately scoped and flag any potential many-to-many Expensive operations ( REGEXP , JSON_EXTRACT , UDFs) only run after partition filtering, not on full table scans Provide a specific SQL fix for any query that fails these checks. E) SAFE WRITES AND IDEMPOTENCY Identify every write operation. Flag plain INSERT /append with no dedup logic. Each write should use one of: MERGE on a deterministic key (e.g., entity_id + date + model_version ) Write to a staging table scoped to the run, then swap or merge into final Append-only with a dedupe view: QUALIFY ROW_NUMBER() OVER (PARTITION BY ) = 1 Also check: Will a re-run create duplicate rows? Is the write disposition ( WRITE_TRUNCATE vs WRITE_APPEND ) intentional and documented? Is run_id being used as part of the merge or dedupe key? If so, flag it. run_id should be stored as a metadata column, not as part of the uniqueness key, unless you explicitly want multi-run history. State the recommended approach and the exact dedup key for this codebase. F) OBSERVABILITY: Can you debug a failure? Verify: Failures raise exceptions and abort with no silent except: pass or warn-only Each BQ job logs: job ID, bytes processed or billed when available, slot milliseconds, and duration A run summary is logged or written at the end containing: run_id, env, mode, date_range, tables written, total BQ jobs, total bytes run_id is present and consistent across all log lines If run_id is missing, propose a one-line fix: run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S') Final 1. PASS / FAIL with specific reasons per section (A to F). 2. Patch list ordered by risk, referencing exact functions to change. 3. If FAIL: Top 3 cost risks with a rough worst-case estimate (e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").
返回排行榜