- Reduce Unoptimized Query Oracle Test Failure
- Reduce an unoptimized-query-oracle test failure log to the simplest possible
- reproduction case.
- The unoptimized-query-oracle roachtest runs a series of random SQL statements to
- create a random dataset, and then executes a random "Query of Interest" twice,
- with different optimization settings. If the two executions return different
- results, it indicates a bug in CockroachDB.
- When to Use
- Use this skill when:
- You have a test failure from the unoptimized-query-oracle roachtest.
- You need to find the minimal SQL to reproduce the test failure.
- Step 1: Locate artifacts
- Ask the user where the artifacts directory is.
- Find the relevant files in the artifacts directory:
- Test parameters
- :
- params.log
- (the parameters from the roachtest)
- Test log
- :
- test.log
- (the log from the roachtest)
- Failure log
- :
- failure*.log
- (the failure log from the roachtest)
- Full SQL log
- :
- unoptimized-query-oracle*.log
- (the SQL statements that led to failure)
- Query of interest log
- :
- unoptimized-query-oracle*.failure.log
- (containing
- the query of interest and possibly more information about the failure)
- Cockroach log
- :
- logs/1.unredacted/cockroach.log
- or
- logs/unredacted/cockroach.log
- (contains the git commit)
- Step 2: Determine test configuration
- Determine the git commit from
- cockroach.log
- :
- grep
- "binary: CockroachDB"
- cockroach.log
- Look for the commit hash in the version string (e.g.,
- cb94db961b8f55e3473f279d98ae90f0eeb0adcb
- ).
- Determine if runtime assertions are enabled by checking for:
- "runtimeAssertionsBuild": "true"
- in
- params.log
- or
- Runtime assertions enabled
- in
- test.log
- Determine if metamorphic settings apply by looking for:
- lines like these in
- params.log
- :
- "metamorphicBufferedSender": "true",
- "metamorphicWriteBuffering": "true",
- or lines like these in
- test.log
- :
- metamorphically setting "kv.rangefeed.buffered_sender.enabled" to 'true'
- metamorphically setting "kv.transaction.write_buffering.enabled" to 'true'
- Determine environment variables from the beginning of
- cockroach.log
- :
- grep
- -A10
- "using local environment variables:"
- cockroach.log
- Important environment variables include:
- COCKROACH_INTERNAL_CHECK_CONSISTENCY_FATAL
- COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING
- COCKROACH_RANDOM_SEED
- COCKROACH_TESTING_FORCE_RELEASE_BRANCH
- But there might be more important environment variables, so best to get all of
- them.
- Determine if this is a multi-region test or single-region test by checking:
- the test name (e.g.,
- seed-multi-region
- in
- test.log
- indicates multi-region)
- or the presence of
- \connect
- lines in the full SQL log
- If both of these are missing, it's a single-region test.
- Step 3: Check Out and Build
- For a normal build use:
- git
- checkout
- <
- commit-hash
- >
- ./dev build short
- If runtime assertions were enabled, use a test build instead:
- git
- checkout
- <
- commit-hash
- >
- ./dev build short --
- --crdb_test
- Note:
- Only build libgeos if the reproduction uses geospatial functions (BOX2D,
- geometry, geography, etc.):
- ./dev build libgeos
- Step 4: Prepare the Full SQL Log File
- First, check that the following statements are at the top of the full SQL log
- file. If they are not, add them:
- SET
- statement_timeout
- =
- '1m0s'
- ;
- SET
- sql_safe_updates
- =
- false
- ;
- If metamorphic settings were used, also add them to the top of the full SQL log
- file:
- SET
- CLUSTER SETTING kv
- .
- rangefeed
- .
- buffered_sender
- .
- enabled
- =
- true
- ;
- SET
- CLUSTER SETTING kv
- .
- transaction
- .
- write_buffering
- .
- enabled
- =
- true
- ;
- Create an appropriate directory either in the artifacts directory or in the
- repository root for holding temp files.
- Step 5: Initial Reproduction
- Determine the correct demo command based on test type:
- Multi-region test
-
- Use
- --nodes=9
- Single-region test
- Omit
--nodes
option
Use a command like this to try reproducing the test failure from the full SQL
log file. This command could take up to 20 minutes to finish.
<
env vars
./cockroach demo --multitenant = false --nodes = 9 --insecure --set = errexit = false --no-example-database --format = tsv -f < full-sql-log-file
Check that the output reproduces the test failure described in the failure log. There are many possible failure modes. Look for one of the following, which should match the failure log: Different results between the two executions of the "Query of Interest" (which is the randomly generated SELECT statement repeated twice near the end of the log, wrapped in various SET and RESET staements). These different results could take the form of different result sets, or could also be an error in one case and no error in the other case. This is an "oracle" failure . Or, internal error or assertion failure . Note the error message for the reduce step. Or, a panic . Note the error message for the reduce step. Or, a timeout . Note the statement that timed out. Troubleshooting IMPORTANT: Many failures are nondeterministic, especially for multi-region tests. If no failure happens on the first run, try up to 10 times before concluding it doesn't reproduce. It can be helpful at this point to compare the output with the failure*.log which should show the failure from the original test run. If the initial run fails to reproduce after 10 times, pause here and report to the user that the failure cannot be reproduced, and show the command that was tried. The user might have additional instructions. If it looks like it reproduces, it's time to move on to the next step. Step 6: Use the Reduce Tool Build the reduce tool: ./dev build reduce Prepare the Full SQL Log File again For multi-region tests, remove \connect lines (they cause syntax errors in the reduce tool): grep -v '^\connect' < full-sql-log-file
< cleaned-log
Run Reduce IMPORTANT: The reduce tool must be run from the cockroach repository root directory, because it looks for ./cockroach in the current directory. Use the -multi-region option for multi-region tests, or omit it for single-region tests. For "oracle" failures (different results): ./bin/reduce -unoptimized-query-oracle -multi-region -chunk 25 -v -file < cleaned-log
2
&1 | tee reduce-output.log The -unoptimized-query-oracle option checks whether the two executions of the "Query of Interest" produce the same results. For internal errors/assertion failures/panics: ./bin/reduce -contains "
" -multi-region -chunk 25 -v -file < cleaned-log 2
&1 | tee reduce-output.log Use a distinctive part of the error message as the -contains regex (e.g., "nil LeafTxnInputState" ). The reduce tool might take up to an hour to run. Extract the Reduced SQL The reduce tool outputs progress lines followed by the final SQL. Extract just the SQL: grep -A1000 "^reduction: " reduce-output.log | tail -n +2
reduced.sql IMPORTANT: Immediately save a backup of the reduce output before manual simplification: cp reduced.sql reduced_original.sql This provides a recovery point if the working file gets corrupted during simplification. If the reduce tool fails to reproduce, pause here and report this to the user. They might have additional instructions. Occasionally we have to modify the reduce tool itself, if the test failure is not reproducing. Step 7: Create Test Script and Determine Reproduction Rate IMPORTANT: Many bugs are nondeterministic. Before manual simplification, create a reusable test script and determine the reproduction rate. Create a small test script (adjust as needed): cat
test_repro.sh << 'EOF'
!/bin/bash
Test if reduced_v2.sql reproduces the error (exits on first success, up to 10 attempts)
for i in {1..10}; do
if ./cockroach demo --multitenant=false --nodes=9 --insecure \
--set=errexit=false --no-example-database --format=tsv \
-f reduced_v2.sql 2>&1 | grep -q "
Minimal Reproduction
reduced.sql
cat
reduced.sql << 'EOF' CREATE TABLE t (); SET testing_optimizer_random_seed = 1234567890; SET testing_optimizer_disable_rule_probability = 0.5; SELECT ...; EOF
bisect_run.sh
cat
bisect_run.sh << 'EOF'
!/bin/bash
Git bisect run script
Exit codes: 0=good (bug not present), 1=bad (bug present), 125=skip (build failed)
REPO_DIR="/path/to/cockroach" REPRO_SQL="/path/to/reduced.sql" cd "$REPO_DIR" || exit 125 echo "=== Testing commit $(git rev-parse --short HEAD) ==="
Build (use --crdb_test if runtime assertions were enabled in the original test)
if ! ./dev build short -- --crdb_test 2>&1 | grep -q "Successfully built"; then echo "BUILD FAILED - skipping" exit 125 fi
Test for bug (try 3 times for flaky bugs)
for i in {1..3}; do
if ./cockroach demo --multitenant=false --insecure \
--set=errexit=false --no-example-database --format=tsv \
-f "$REPRO_SQL" 2>&1 | grep -q "
Command to reproduce
git checkout < commit-hash
./bisect_run.sh
Command to bisect
git bisect start .. . git bisect run bisect_run.sh
Failure
Repro rate: ~X% (may need multiple attempts)
After showing this output, ask the user if they want to try reproducing the bug on master branch. Optional Step 10: Check if Bug is Fixed on Master Before bisecting, check whether the bug has already been fixed on master. git stash
if needed
git checkout master ./dev build short -- --crdb_test ./cockroach demo --multitenant = false --insecure --set = errexit = false --no-example-database --format = tsv -f reduced.sql Run this a few times to account for flakiness. Note whether the bug reproduces on master or not. Optional Step 11: Bisect If the user wants to find the commit that introduced or fixed the bug, use git bisect . If the Bug is Already Fixed on Master Bisect to find the fix commit (the first commit where the bug no longer reproduces). Use custom terms since the "good" commit (master) is newer than the "bad" commit: git bisect start --first-parent --term-old = broken --term-new = fixed git bisect broken < commit-where-bug-exists
e.g., the original failing commit
git bisect fixed master
master is fixed
git bisect run ./bisect_run.sh
When done
git bisect reset Note: The --first-parent option follows only merge commits on the main branch, avoiding detours into feature branches. The bisect script must return 0 when the bug is NOT present (fixed) and 1 when the bug IS present (broken). If the Bug Still Exists on Master Bisect to find the regression commit (the first commit where the bug was introduced): git bisect start --first-parent git bisect good < known-good-commit
e.g., a previous release tag
git bisect bad master
master has the bug
git bisect run ./bisect_run.sh
When done
git bisect reset The bisect will identify the commit that introduced or fixed the bug. Finding a Good Commit If you don't know a good commit (where the bug doesn't exist), you can jump back in time to find one.
Find a commit from ~6 months ago on the main branch
git rev-list --first-parent -1 --before = "6 months ago" HEAD Test whether the bug exists at that commit. If not, use it as the good commit for bisect. If the bug still exists, try going back further in time, but don't go back further than 1 year. If a known good commit can't be found within 1 year, stop and report this to the user.