Databricks Skills Testing Framework Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement. Quick References Scorers - Available scorers and quality gates YAML Schemas - Manifest and ground truth formats Python API - Programmatic usage examples Workflows - Detailed example workflows Trace Evaluation - Session trace analysis /skill-test Command The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks. Basic Usage /skill-test [subcommand] Subcommands Subcommand Description run Run evaluation against ground truth (default) regression Compare current results against baseline init Initialize test scaffolding for a new skill add Interactive: prompt -> invoke skill -> test -> save add --trace Add test case with trace evaluation review Review pending candidates interactively review --batch Batch approve all pending candidates baseline Save current results as regression baseline mlflow Run full MLflow evaluation with LLM judges trace-eval Evaluate traces against skill expectations list-traces List available traces (MLflow or local) scorers List configured scorers for a skill scorers update Add/remove scorers or update default guidelines sync Sync YAML to Unity Catalog (Phase 2) Quick Examples /skill-test databricks-spark-declarative-pipelines run /skill-test databricks-spark-declarative-pipelines add --trace /skill-test databricks-spark-declarative-pipelines review --batch --filter-success /skill-test my-new-skill init See Workflows for detailed examples of each subcommand. Execution Instructions Environment Setup uv pip install -e .test/ Environment variables for Databricks MLflow: DATABRICKS_CONFIG_PROFILE - Databricks CLI profile (default: "DEFAULT") MLFLOW_TRACKING_URI - Set to "databricks" for Databricks MLflow MLFLOW_EXPERIMENT_NAME - Experiment path (e.g., "/Users/{user}/skill-test") Running Scripts All subcommands have corresponding scripts in .test/scripts/ : uv run python .test/scripts/ { subcommand } .py { skill_name } [ options ] Subcommand Script run run_eval.py regression regression.py init init_skill.py add add.py review review.py baseline baseline.py mlflow mlflow_eval.py scorers scorers.py scorers update scorers_update.py sync sync.py trace-eval trace_eval.py list-traces list_traces.py _routing mlflow routing_eval.py Use --help on any script for available options. Command Handler When /skill-test is invoked, parse arguments and execute the appropriate command. Argument Parsing args[0] = skill_name (required) args[1] = subcommand (optional, default: "run") Subcommand Routing Subcommand Action run Execute run(skill_name, ctx) and display results regression Execute regression(skill_name, ctx) and display comparison init Execute init(skill_name, ctx) to create scaffolding add Prompt for test input, invoke skill, run interactive() review Execute review(skill_name, ctx) to review pending candidates baseline Execute baseline(skill_name, ctx) to save as regression baseline mlflow Execute mlflow_eval(skill_name, ctx) with MLflow logging scorers Execute scorers(skill_name, ctx) to list configured scorers scorers update Execute scorers_update(skill_name, ctx, ...) to modify scorers init Behavior When running /skill-test init : Read the skill's SKILL.md to understand its purpose Create manifest.yaml with appropriate scorers and trace_expectations Create empty ground_truth.yaml and candidates.yaml templates Recommend test prompts based on documentation examples Follow with /skill-test add using recommended prompts. Context Setup Create CLIContext with MCP tools before calling any command. See Python API for details. File Locations Important: All test files are stored at the repository root level, not relative to this skill's directory. File Type Path Ground truth {repo_root}/.test/skills/{skill-name}/ground_truth.yaml Candidates {repo_root}/.test/skills/{skill-name}/candidates.yaml Manifest {repo_root}/.test/skills/{skill-name}/manifest.yaml Routing tests {repo_root}/.test/skills/_routing/ground_truth.yaml Baselines {repo_root}/.test/baselines/{skill-name}/baseline.yaml For example, to test databricks-spark-declarative-pipelines in this repository: /Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml Not relative to the skill definition: /Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG Directory Structure .test/ # At REPOSITORY ROOT (not skill directory) ├── pyproject.toml # Package config (pip install -e ".test/") ├── README.md # Contributor documentation ├── SKILL.md # Source of truth (synced to .claude/skills/) ├── install_skill_test.sh # Sync script ├── scripts/ # Wrapper scripts │ ├── _common.py # Shared utilities │ ├── run_eval.py │ ├── regression.py │ ├── init_skill.py │ ├── add.py │ ├── baseline.py │ ├── mlflow_eval.py │ ├── routing_eval.py │ ├── trace_eval.py # Trace evaluation │ ├── list_traces.py # List available traces │ ├── scorers.py │ ├── scorers_update.py │ └── sync.py ├── src/ │ └── skill_test/ # Python package │ ├── cli/ # CLI commands module │ ├── fixtures/ # Test fixture setup │ ├── scorers/ # Evaluation scorers │ ├── grp/ # Generate-Review-Promote pipeline │ └── runners/ # Evaluation runners ├── skills/ # Per-skill test definitions │ ├── _routing/ # Routing test cases │ └── {skill-name}/ # Skill-specific tests │ ├── ground_truth.yaml │ ├── candidates.yaml │ └── manifest.yaml ├── tests/ # Unit tests ├── references/ # Documentation references └── baselines/ # Regression baselines References Scorers - Available scorers and quality gates YAML Schemas - Manifest and ground truth formats Python API - Programmatic usage examples Workflows - Detailed example workflows Trace Evaluation - Session trace analysis

安装