ADK Evaluation Guide
Scaffolded project?
If you used
/adk-scaffold
, you already have
make eval
,
tests/eval/evalsets/
, and
tests/eval/eval_config.json
. Start with
make eval
and iterate from there.
Non-scaffolded?
Use
adk eval
directly — see
Running Evaluations
below.
Reference Files
File
Contents
references/criteria-guide.md
Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config
references/user-simulation.md
Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics
references/builtin-tools-eval.md
google_search and model-internal tools — trajectory behavior, metric compatibility
references/multimodal-eval.md
Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern
The Eval-Fix Loop
Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.
How to iterate
Start small: Begin with 1-2 eval cases, not the full suite Run eval : make eval (or adk eval if no Makefile) Read the scores — identify what failed and why Fix the code — adjust prompts, tool logic, instructions, or the evalset Rerun eval — verify the fix worked Repeat steps 3-5 until the case passes Only then add more eval cases and expand coverage Expect 5-10+ iterations. This is normal — each iteration makes the agent better. What to fix when scores fail Failure What to change tool_trajectory_avg_score low Fix agent instructions (tool ordering), update evalset tool_uses , or switch to IN_ORDER / ANY_ORDER match type response_match_score low Adjust agent instruction wording, or relax the expected response final_response_match_v2 low Refine agent instructions, or adjust expected response — this is semantic, not lexical rubric_based score low Refine agent instructions to address the specific rubric that failed hallucinations_v1 low Tighten agent instructions to stay grounded in tool output Agent calls wrong tools Fix tool descriptions, agent instructions, or tool_config Agent calls extra tools Use IN_ORDER / ANY_ORDER match type, add strict stop instructions, or switch to rubric_based_tool_use_quality_v1 Choosing the Right Criteria Goal Recommended Metric Regression testing / CI/CD (fast, deterministic) tool_trajectory_avg_score + response_match_score Semantic response correctness (flexible phrasing OK) final_response_match_v2 Response quality without reference answer rubric_based_final_response_quality_v1 Validate tool usage reasoning rubric_based_tool_use_quality_v1 Detect hallucinated claims hallucinations_v1 Safety compliance safety_v1 Dynamic multi-turn conversations User simulation + hallucinations_v1 / safety_v1 (see references/user-simulation.md ) Multimodal input (image, audio, file) tool_trajectory_avg_score + custom metric for response quality (see references/multimodal-eval.md ) For the complete metrics reference with config examples, match types, and custom metrics, see references/criteria-guide.md . Running Evaluations

Scaffolded projects:

make eval EVALSET = tests/eval/evalsets/my_evalset.json

Or directly via ADK CLI:

adk eval ./app < path_to_evalset.json

--config_file_path

< path_to_config.json

--print_detailed_results

Run specific eval cases from a set:

adk eval ./app my_evalset.json:eval_1,eval_2

With GCS storage:

adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals CLI options: --config_file_path , --print_detailed_results , --eval_storage_uri , --log_level Eval set management: adk eval_set create < agent_path

< eval_set_id

adk eval_set add_eval_case < agent_path

< eval_set_id

--scenarios_file < path

--session_input_file < path

Configuration Schema ( eval_config.json ) Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs. Full example { "criteria" : { "tool_trajectory_avg_score" : { "threshold" : 1.0 , "match_type" : "IN_ORDER" } , "final_response_match_v2" : { "threshold" : 0.8 , "judge_model_options" : { "judge_model" : "gemini-2.5-flash" , "num_samples" : 5 } } , "rubric_based_final_response_quality_v1" : { "threshold" : 0.8 , "rubrics" : [ { "rubric_id" : "professionalism" , "rubric_content" : { "text_property" : "The response must be professional and helpful." } } , { "rubric_id" : "safety" , "rubric_content" : { "text_property" : "The agent must NEVER book without asking for confirmation." } } ] } } } Simple threshold shorthand is also valid: "response_match_score": 0.8 For custom metrics, judge_model_options details, and user_simulator_config , see references/criteria-guide.md . EvalSet Schema ( evalset.json ) { "eval_set_id" : "my_eval_set" , "name" : "My Eval Set" , "description" : "Tests core capabilities" , "eval_cases" : [ { "eval_id" : "search_test" , "conversation" : [ { "invocation_id" : "inv_1" , "user_content" : { "parts" : [ { "text" : "Find a flight to NYC" } ] } , "final_response" : { "role" : "model" , "parts" : [ { "text" : "I found a flight for $500. Want to book?" } ] } , "intermediate_data" : { "tool_uses" : [ { "name" : "search_flights" , "args" : { "destination" : "NYC" } } ] , "intermediate_responses" : [ [ "sub_agent_name" , [ { "text" : "Found 3 flights to NYC." } ] ] ] } } ] , "session_input" : { "app_name" : "my_app" , "user_id" : "user_1" , "state" : { } } } ] } Key fields: intermediate_data.tool_uses — expected tool call trajectory (chronological order) intermediate_data.intermediate_responses — expected sub-agent responses (for multi-agent systems) session_input.state — initial session state (overrides Python-level initialization) conversation_scenario — alternative to conversation for user simulation (see references/user-simulation.md ) Common Gotchas The Proactivity Trajectory Gap LLMs often perform extra actions not asked for (e.g., google_search after save_preferences ). This causes tool_trajectory_avg_score failures with EXACT match. Solutions: Use IN_ORDER or ANY_ORDER match type — tolerates extra tool calls between expected ones Include ALL tools the agent might call in your expected trajectory Use rubric_based_tool_use_quality_v1 instead of trajectory matching Add strict stop instructions: "Stop after calling save_preferences. Do NOT search." Multi-turn conversations require tool_uses for ALL turns The tool_trajectory_avg_score evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools. { "conversation" : [ { "invocation_id" : "inv_1" , "user_content" : { "parts" : [ { "text" : "Find me a flight from NYC to London" } ] } , "intermediate_data" : { "tool_uses" : [ { "name" : "search_flights" , "args" : { "origin" : "NYC" , "destination" : "LON" } } ] } } , { "invocation_id" : "inv_2" , "user_content" : { "parts" : [ { "text" : "Book the first option" } ] } , "final_response" : { "role" : "model" , "parts" : [ { "text" : "Booking confirmed!" } ] } , "intermediate_data" : { "tool_uses" : [ { "name" : "book_flight" , "args" : { "flight_id" : "1" } } ] } } ] } App name must match directory name The App object's name parameter MUST match the directory containing your agent:

CORRECT - matches the "app" directory

app

App ( root_agent = root_agent , name = "app" )

WRONG - causes "Session not found" errors

app

App ( root_agent = root_agent , name = "flight_booking_assistant" ) The before_agent_callback Pattern (State Initialization) Always use a callback to initialize session state variables used in your instruction template. This prevents KeyError crashes on the first turn: async def initialize_state ( callback_context : CallbackContext ) -

None : state = callback_context . state if "user_preferences" not in state : state [ "user_preferences" ] = { } root_agent = Agent ( name = "my_agent" , before_agent_callback = initialize_state , instruction = "Based on preferences: {user_preferences}..." , ) Eval-State Overrides (Type Mismatch Danger) Be careful with session_input.state in your evalset. It overrides Python-level initialization: // WRONG — initializes feedback_history as a string, breaks .append() "state" : { "feedback_history" : "" } // CORRECT — matches the Python type (list) "state" : { "feedback_history" : [ ] } // NOTE: Remove these // comments before using — JSON does not support comments. Model thinking mode may bypass tools Models with "thinking" enabled may skip tool calls. Use tool_config with mode="ANY" to force tool usage, or switch to a non-thinking model for predictable tool calling. Common Eval Failure Causes Symptom Cause Fix Missing tool_uses in intermediate turns Trajectory expects match per invocation Add expected tool calls to all turns Agent mentions data not in tool output Hallucination Tighten agent instructions; add hallucinations_v1 metric "Session not found" error App name mismatch Ensure App name matches directory name Score fluctuates between runs Non-deterministic model Set temperature=0 or use rubric-based eval tool_trajectory_avg_score always 0 Agent uses google_search (model-internal) Remove trajectory metric; see references/builtin-tools-eval.md Trajectory fails but tools are correct Extra tools called Switch to IN_ORDER / ANY_ORDER match type LLM judge ignores image/audio in eval get_text_from_content() skips non-text parts Use custom metric with vision-capable judge (see references/multimodal-eval.md ) Deep Dive: ADK Docs For the official evaluation documentation, fetch these pages: Evaluation overview : https://google.github.io/adk-docs/evaluate/index.md Criteria reference : https://google.github.io/adk-docs/evaluate/criteria/index.md User simulation : https://google.github.io/adk-docs/evaluate/user-sim/index.md Debugging Example User says: "tool_trajectory_avg_score is 0, what's wrong?" Check if agent uses google_search — if so, see references/builtin-tools-eval.md Check if using EXACT match and agent calls extra tools — try IN_ORDER Compare expected tool_uses in evalset with actual agent behavior Fix mismatch (update evalset or agent instructions)

adk-eval-guide

安装