Systematically diagnose and resolve dbt Cloud job failures using available MCP tools, CLI commands, and data investigation.
When to Use
-
dbt Cloud / dbt platform job failed and you need to find the root cause
-
Intermittent job failures that are hard to reproduce
-
Error messages that don't clearly indicate the problem
-
Post-merge failures where a recent change may have caused the issue
Not for: Local dbt development errors - use the skill using-dbt-for-analytics-engineering instead
The Iron Rule
Never modify a test to make it pass without understanding why it's failing.
A failing test is evidence of a problem. Changing the test to pass hides the problem. Investigate the root cause first.
Rationalizations That Mean STOP
| "Just make the test pass" | The test is telling you something is wrong. Investigate first.
| "There's a board meeting in 2 hours" | Rushing to a fix without diagnosis creates bigger problems.
| "We've already spent 2 days on this" | Sunk cost doesn't justify skipping proper diagnosis.
| "I'll just update the accepted values" | Are the new values valid business data or bugs? Verify first.
| "It's probably just a flaky test" | "Flaky" means there's an overall issue. Find it. We don't allow flaky tests to stay.
Workflow
flowchart TD
A[Job failure reported] --> B{MCP Admin API available?}
B -->|yes| C[Use list_jobs_runs to get history]
B -->|no| D[Ask user for logs and run_results.json]
C --> E[Use get_job_run_error for details]
D --> F[Classify error type]
E --> F
F --> G{Error type?}
G -->|Infrastructure| H[Check warehouse, connections, timeouts]
G -->|Code/Compilation| I[Check git history for recent changes]
G -->|Data/Test Failure| J[Use discovering-data skill to investigate]
H --> K{Root cause found?}
I --> K
J --> K
K -->|yes| L[Create branch, implement fix]
K -->|no| M[Create findings document]
L --> N[Add test - prefer unit test]
N --> O[Create PR with explanation]
M --> P[Document what was checked and next steps]
Step 1: Gather Job Run Information
If dbt MCP Server Admin API Available
Use these tools first - they provide the most comprehensive data:
| list_jobs_runs
| Get recent run history, identify patterns
| get_job_run_error
| Get detailed error message and context
# Example: Get recent runs for job 12345
list_jobs_runs(job_id=12345, limit=10)
# Example: Get error details for specific run
get_job_run_error(run_id=67890)
Without MCP Admin API
Ask the user to provide these artifacts:
-
Job run logs from dbt Cloud UI (Debug logs preferred)
-
run_results.json- contains execution status for each node
To get the run_results.json, generate the artifact URL for the user:
https://<DBT_ENDPOINT>/api/v2/accounts/<ACCOUNT_ID>/runs/<RUN_ID>/artifacts/run_results.json?step=<STEP_NUMBER>
Where:
<DBT_ENDPOINT>- The dbt Cloud endpoint. e.g
cloud.getdbt.com for the US multi-tenant platform (there are other endpoints for other regions)
-
ACCOUNT_PREFIX.us1.dbt.comfor the cell-based platforms (there are different cell endpoints for different regions and cloud providers) -
<ACCOUNT_ID>- The dbt Cloud account ID -
<RUN_ID>- The failed job run ID -
<STEP_NUMBER>- The step that failed (e.g., if step 4 failed, use?step=4)
Example request:
"I don't have access to the dbt MCP server. Could you provide:
-
The debug logs from dbt Cloud (Job Run → Logs → Download)
-
The run_results.json - open this URL and copy/paste or upload the contents:
https://cloud.getdbt.com/api/v2/accounts/12345/runs/67890/artifacts/run_results.json?step=4
Step 2: Classify the Error
| Infrastructure | Connection timeout, warehouse error, permissions | Check warehouse status, connection settings
| Code/Compilation | Undefined macro, syntax error, parsing error | Check git history for recent changes, use LSP tools
| Data/Test Failure
| Test failed with N results, schema mismatch
| Use discovering-data skill to query actual data
Step 3: Investigate Root Cause
For Infrastructure Errors
-
Check job configuration (timeout settings, execution steps, etc.)
-
Look for concurrent jobs competing for resources
-
Check if failures correlate with time of day or data volume
For Code/Compilation Errors
- Check git history for recent changes:
If you're not in the dbt project directory, use the dbt MCP server to find the repository:
# Get project details including repository URL and project subdirectory
get_project_details(project_id=<project_id>)
The response includes:
repository - The git repository URL
dbt_project_subdirectory- Optional subfolder where the dbt project lives (e.g.,dbt/,transform/analytics/)
Then either:
-
Query the repository directly using
ghCLI if it's on GitHub -
Clone to a temporary folder:
git clone <repo_url> /tmp/dbt-investigation
Important: If the project is in a subfolder, navigate to it after cloning:
cd /tmp/dbt-investigation/<project_subdirectory>
Once in the project directory:
git log --oneline -20
git diff HEAD~5..HEAD -- models/ macros/
- Use the CLI and LSP tools from the dbt MCP server or use the dbt CLI to check for errors:
If the dbt MCP server is available, use its tools:
# CLI tools
mcp__dbt_parse() # Check for parsing errors
mcp__dbt_list_models() # With selectos and `+` for finding models dependencies
mcp__dbt_compile(models="failing_model") # Check compilation
# LSP tools
mcp__dbt_get_column_lineage() # Check column lineage
Otherwise, use the dbt CLI directly:
dbt parse # Check for parsing errors
dbt list --select +failing_model # Check for models upstream of the failing model
dbt compile --select failing_model # Check compilation
- Search for the error pattern:
Find where the undefined macro/model should be defined
- Check if a file was deleted or renamed
For Data/Test Failures
Use the discovering-data skill to investigate the actual data.
- Get the test SQL
dbt compile --select project_name.folder1.folder2.test_unique_name --output json
the full path for the test can be found with a dbt ls --resource-type test command
- Query the failing test's underlying data:
dbt show --inline "<query_from_the_test_SQL>" --output json
- Compare to recent git changes:
Did a transformation change introduce new values?
- Did upstream source data change?
Step 4: Resolution
If Root Cause Is Found
- Create a new branch:
git checkout -b fix/job-failure-<description>
-
Implement the fix addressing the actual root cause
-
Add a test to prevent recurrence:
Prefer unit tests for logic issues
-
Use data tests for data quality issues
-
Example unit test for transformation logic:
unit_tests:
- name: test_status_mapping
model: orders
given:
- input: ref('stg_orders')
rows:
- {status_code: 1, expected_status: 'pending'}
- {status_code: 2, expected_status: 'shipped'}
expect:
rows:
- {status: 'pending'}
- {status: 'shipped'}
- Create a PR with:
Description of the issue
-
Root cause analysis
-
How the fix resolves it
-
Test coverage added
If Root Cause Is NOT Found
Do not guess. Create a findings document.
Create docs/investigations/job-failure-<date>.md:
# Job Failure Investigation: <Job Name>
**Date:** YYYY-MM-DD
**Job ID:** <id>
**Status:** Unresolved
## Summary
Brief description of the failure and symptoms.
## What Was Checked
### Tools Used
- [ ] list_jobs_runs - findings
- [ ] get_job_run_error - findings
- [ ] git history - findings
- [ ] Data investigation - findings
### Hypotheses Tested
| Hypothesis | Evidence | Result |
|------------|----------|--------|
| Recent code change | No changes to affected models in 7 days | Ruled out |
## Patterns Observed
- Failures occur between 2-4 AM (peak load time?)
- Always fails on model X
## Suggested Next Steps
1. [ ] Check the data ingestion process to see if new data was added
2. [ ] Check if a new version of dbt or of the dbt adapter was released
## Related Resources
- Link to job run logs
- Link to relevant documentation
Commit this document to the repository so findings aren't lost.
Quick Reference
| Get job run history
| list_jobs_runs (MCP)
| Get detailed error
| get_job_run_error (MCP)
| Check recent git changes
| git log --oneline -20
| Parse project
| dbt parse
| Compile specific model
| dbt compile --select model_name
| Query data
| dbt show --inline "SELECT ..." --output json
| Run specific test
| dbt test --select test_name
Common Mistakes
Modifying tests to pass without investigation
- A failing test is a signal, not an obstacle. Understand WHY before changing anything.
Skipping git history review
- Most failures correlate with recent changes. Always check what changed.
Not documenting when unresolved
- "I couldn't figure it out" leaves no trail. Document what was checked and what remains.
Making best-guess fixes under pressure
- A wrong fix creates more problems. Take time to diagnose properly.
Ignoring data investigation for test failures
- Test failures often reveal data issues. Query the actual data before assuming code is wrong.