troubleshooting-dbt-job-errors

安装量: 85
排名: #9315

安装

npx skills add https://github.com/dbt-labs/dbt-agent-skills --skill troubleshooting-dbt-job-errors

Systematically diagnose and resolve dbt Cloud job failures using available MCP tools, CLI commands, and data investigation.

When to Use

  • dbt Cloud / dbt platform job failed and you need to find the root cause

  • Intermittent job failures that are hard to reproduce

  • Error messages that don't clearly indicate the problem

  • Post-merge failures where a recent change may have caused the issue

Not for: Local dbt development errors - use the skill using-dbt-for-analytics-engineering instead

The Iron Rule

Never modify a test to make it pass without understanding why it's failing.

A failing test is evidence of a problem. Changing the test to pass hides the problem. Investigate the root cause first.

Rationalizations That Mean STOP

| "Just make the test pass" | The test is telling you something is wrong. Investigate first.

| "There's a board meeting in 2 hours" | Rushing to a fix without diagnosis creates bigger problems.

| "We've already spent 2 days on this" | Sunk cost doesn't justify skipping proper diagnosis.

| "I'll just update the accepted values" | Are the new values valid business data or bugs? Verify first.

| "It's probably just a flaky test" | "Flaky" means there's an overall issue. Find it. We don't allow flaky tests to stay.

Workflow

flowchart TD
    A[Job failure reported] --> B{MCP Admin API available?}
    B -->|yes| C[Use list_jobs_runs to get history]
    B -->|no| D[Ask user for logs and run_results.json]
    C --> E[Use get_job_run_error for details]
    D --> F[Classify error type]
    E --> F
    F --> G{Error type?}
    G -->|Infrastructure| H[Check warehouse, connections, timeouts]
    G -->|Code/Compilation| I[Check git history for recent changes]
    G -->|Data/Test Failure| J[Use discovering-data skill to investigate]
    H --> K{Root cause found?}
    I --> K
    J --> K
    K -->|yes| L[Create branch, implement fix]
    K -->|no| M[Create findings document]
    L --> N[Add test - prefer unit test]
    N --> O[Create PR with explanation]
    M --> P[Document what was checked and next steps]

Step 1: Gather Job Run Information

If dbt MCP Server Admin API Available

Use these tools first - they provide the most comprehensive data:

| list_jobs_runs | Get recent run history, identify patterns

| get_job_run_error | Get detailed error message and context

# Example: Get recent runs for job 12345
list_jobs_runs(job_id=12345, limit=10)

# Example: Get error details for specific run
get_job_run_error(run_id=67890)

Without MCP Admin API

Ask the user to provide these artifacts:

  • Job run logs from dbt Cloud UI (Debug logs preferred)

  • run_results.json - contains execution status for each node

To get the run_results.json, generate the artifact URL for the user:

https://<DBT_ENDPOINT>/api/v2/accounts/<ACCOUNT_ID>/runs/<RUN_ID>/artifacts/run_results.json?step=<STEP_NUMBER>

Where:

  • <DBT_ENDPOINT> - The dbt Cloud endpoint. e.g

cloud.getdbt.com for the US multi-tenant platform (there are other endpoints for other regions)

  • ACCOUNT_PREFIX.us1.dbt.com for the cell-based platforms (there are different cell endpoints for different regions and cloud providers)

  • <ACCOUNT_ID> - The dbt Cloud account ID

  • <RUN_ID> - The failed job run ID

  • <STEP_NUMBER> - The step that failed (e.g., if step 4 failed, use ?step=4)

Example request:

"I don't have access to the dbt MCP server. Could you provide:

  • The debug logs from dbt Cloud (Job Run → Logs → Download)

  • The run_results.json - open this URL and copy/paste or upload the contents: https://cloud.getdbt.com/api/v2/accounts/12345/runs/67890/artifacts/run_results.json?step=4

Step 2: Classify the Error

| Infrastructure | Connection timeout, warehouse error, permissions | Check warehouse status, connection settings

| Code/Compilation | Undefined macro, syntax error, parsing error | Check git history for recent changes, use LSP tools

| Data/Test Failure | Test failed with N results, schema mismatch | Use discovering-data skill to query actual data

Step 3: Investigate Root Cause

For Infrastructure Errors

  • Check job configuration (timeout settings, execution steps, etc.)

  • Look for concurrent jobs competing for resources

  • Check if failures correlate with time of day or data volume

For Code/Compilation Errors

  • Check git history for recent changes:

If you're not in the dbt project directory, use the dbt MCP server to find the repository:

# Get project details including repository URL and project subdirectory
get_project_details(project_id=<project_id>)

The response includes:

repository - The git repository URL

  • dbt_project_subdirectory - Optional subfolder where the dbt project lives (e.g., dbt/, transform/analytics/)

Then either:

  • Query the repository directly using gh CLI if it's on GitHub

  • Clone to a temporary folder: git clone <repo_url> /tmp/dbt-investigation

Important: If the project is in a subfolder, navigate to it after cloning:

cd /tmp/dbt-investigation/<project_subdirectory>

Once in the project directory:

git log --oneline -20
git diff HEAD~5..HEAD -- models/ macros/
  • Use the CLI and LSP tools from the dbt MCP server or use the dbt CLI to check for errors:

If the dbt MCP server is available, use its tools:

# CLI tools
mcp__dbt_parse()                              # Check for parsing errors
mcp__dbt_list_models()                        # With selectos and `+` for finding models dependencies
mcp__dbt_compile(models="failing_model")      # Check compilation

# LSP tools
mcp__dbt_get_column_lineage()                 # Check column lineage

Otherwise, use the dbt CLI directly:

dbt parse          # Check for parsing errors
dbt list --select +failing_model          # Check for models upstream of the failing model
dbt compile --select failing_model  # Check compilation
  • Search for the error pattern:

Find where the undefined macro/model should be defined

  • Check if a file was deleted or renamed

For Data/Test Failures

Use the discovering-data skill to investigate the actual data.

  • Get the test SQL
dbt compile --select project_name.folder1.folder2.test_unique_name --output json

the full path for the test can be found with a dbt ls --resource-type test command

  • Query the failing test's underlying data:
dbt show --inline "<query_from_the_test_SQL>" --output json
  • Compare to recent git changes:

Did a transformation change introduce new values?

  • Did upstream source data change?

Step 4: Resolution

If Root Cause Is Found

  • Create a new branch:
git checkout -b fix/job-failure-<description>
  • Implement the fix addressing the actual root cause

  • Add a test to prevent recurrence:

Prefer unit tests for logic issues

  • Use data tests for data quality issues

  • Example unit test for transformation logic:

unit_tests:
  - name: test_status_mapping
    model: orders
    given:
      - input: ref('stg_orders')
        rows:
          - {status_code: 1, expected_status: 'pending'}
          - {status_code: 2, expected_status: 'shipped'}
    expect:
      rows:
        - {status: 'pending'}
        - {status: 'shipped'}
  • Create a PR with:

Description of the issue

  • Root cause analysis

  • How the fix resolves it

  • Test coverage added

If Root Cause Is NOT Found

Do not guess. Create a findings document.

Create docs/investigations/job-failure-<date>.md:

# Job Failure Investigation: <Job Name>

**Date:** YYYY-MM-DD
**Job ID:** <id>
**Status:** Unresolved

## Summary
Brief description of the failure and symptoms.

## What Was Checked

### Tools Used
- [ ] list_jobs_runs - findings
- [ ] get_job_run_error - findings
- [ ] git history - findings
- [ ] Data investigation - findings

### Hypotheses Tested
| Hypothesis | Evidence | Result |
|------------|----------|--------|
| Recent code change | No changes to affected models in 7 days | Ruled out |

## Patterns Observed
- Failures occur between 2-4 AM (peak load time?)
- Always fails on model X

## Suggested Next Steps
1. [ ] Check the data ingestion process to see if new data was added
2. [ ] Check if a new version of dbt or of the dbt adapter was released

## Related Resources
- Link to job run logs
- Link to relevant documentation

Commit this document to the repository so findings aren't lost.

Quick Reference

| Get job run history | list_jobs_runs (MCP)

| Get detailed error | get_job_run_error (MCP)

| Check recent git changes | git log --oneline -20

| Parse project | dbt parse

| Compile specific model | dbt compile --select model_name

| Query data | dbt show --inline "SELECT ..." --output json

| Run specific test | dbt test --select test_name

Common Mistakes

Modifying tests to pass without investigation

  • A failing test is a signal, not an obstacle. Understand WHY before changing anything.

Skipping git history review

  • Most failures correlate with recent changes. Always check what changed.

Not documenting when unresolved

  • "I couldn't figure it out" leaves no trail. Document what was checked and what remains.

Making best-guess fixes under pressure

  • A wrong fix creates more problems. Take time to diagnose properly.

Ignoring data investigation for test failures

  • Test failures often reveal data issues. Query the actual data before assuming code is wrong.
返回排行榜