Databricks Development Guide This skill provides guidance for Databricks SDK, Databricks Connect, CLI, and REST API. SDK Documentation: https://databricks-sdk-py.readthedocs.io/en/latest/ GitHub Repository: https://github.com/databricks/databricks-sdk-py Environment Setup Use existing virtual environment at .venv or use uv to create one For Spark operations: uv pip install databricks-connect For SDK operations: uv pip install databricks-sdk Databricks CLI version should be 0.278.0 or higher Configuration Default profile name: DEFAULT Config file: ~/.databrickscfg Environment variables: DATABRICKS_HOST , DATABRICKS_TOKEN Databricks Connect (Spark Operations) Use databricks-connect for running Spark code locally against a Databricks cluster. from databricks . connect import DatabricksSession

Auto-detects 'DEFAULT' profile from ~/.databrickscfg

spark

DatabricksSession . builder . getOrCreate ( )

With explicit profile

spark

DatabricksSession . builder . profile ( "MY_PROFILE" ) . getOrCreate ( )

Use spark as normal

df

spark . sql ( "SELECT * FROM catalog.schema.table" ) df . show ( ) IMPORTANT: Do NOT set .master("local[*]") - this will cause issues with Databricks Connect. Direct REST API Access For operations not yet in SDK or overly complex via SDK, use direct REST API: from databricks . sdk import WorkspaceClient w = WorkspaceClient ( )

Direct API call using authenticated client

response

w . api_client . do ( method = "GET" , path = "/api/2.0/clusters/list" )

POST with body

response

w . api_client . do ( method = "POST" , path = "/api/2.0/jobs/run-now" , body = { "job_id" : 123 } ) When to use: Prefer SDK methods when available. Use api_client.do for: New API endpoints not yet in SDK Complex operations where SDK abstraction is problematic Debugging/testing raw API responses Databricks CLI

Check version (should be >= 0.278.0)

databricks --version

Use specific profile

databricks --profile MY_PROFILE clusters list

Common commands

databricks clusters list databricks jobs list databricks workspace ls /Users/me SDK Documentation Architecture The SDK documentation follows a predictable URL pattern: Base: https://databricks-sdk-py.readthedocs.io/en/latest/ Workspace APIs: /workspace/{category}/{service}.html Account APIs: /account/{category}/{service}.html Authentication: /authentication.html DBUtils: /dbutils.html Workspace API Categories Category Services compute clusters, cluster_policies, command_execution, instance_pools, libraries catalog catalogs, schemas, tables, volumes, functions, storage_credentials, external_locations jobs jobs sql warehouses, statement_execution, queries, alerts, dashboards serving serving_endpoints vectorsearch vector_search_indexes, vector_search_endpoints pipelines pipelines workspace repos, secrets, workspace, git_credentials files files, dbfs ml experiments, model_registry Authentication Doc: https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html Environment Variables DATABRICKS_HOST = https://your-workspace.cloud.databricks.com DATABRICKS_TOKEN = dapi .. .

Personal Access Token

Code Patterns

Auto-detect credentials from environment

from databricks . sdk import WorkspaceClient w = WorkspaceClient ( )

Explicit token auth

w

WorkspaceClient ( host = "https://your-workspace.cloud.databricks.com" , token = "dapi..." )

Azure Service Principal

w

WorkspaceClient ( host = "https://adb-xxx.azuredatabricks.net" , azure_workspace_resource_id = "/subscriptions/.../resourceGroups/.../providers/Microsoft.Databricks/workspaces/..." , azure_tenant_id = "tenant-id" , azure_client_id = "client-id" , azure_client_secret = "secret" )

Use a named profile from ~/.databrickscfg

w

WorkspaceClient ( profile = "MY_PROFILE" ) Core API Reference Clusters API Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html

List all clusters

for cluster in w . clusters . list ( ) : print ( f" { cluster . cluster_name } : { cluster . state } " )

Get cluster details

cluster

w . clusters . get ( cluster_id = "0123-456789-abcdef" )

Create a cluster (returns Wait object)

wait

w . clusters . create ( cluster_name = "my-cluster" , spark_version = w . clusters . select_spark_version ( latest = True ) , node_type_id = w . clusters . select_node_type ( local_disk = True ) , num_workers = 2 ) cluster = wait . result ( )

Wait for cluster to be running

Or use create_and_wait for blocking call

cluster

w . clusters . create_and_wait ( cluster_name = "my-cluster" , spark_version = "14.3.x-scala2.12" , node_type_id = "i3.xlarge" , num_workers = 2 , timeout = timedelta ( minutes = 30 ) )

Start/stop/delete

w . clusters . start ( cluster_id = "..." ) . result ( ) w . clusters . stop ( cluster_id = "..." ) w . clusters . delete ( cluster_id = "..." ) Jobs API Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/jobs/jobs.html from databricks . sdk . service . jobs import Task , NotebookTask

List jobs

for job in w . jobs . list ( ) : print ( f" { job . job_id } : { job . settings . name } " )

Create a job

created

w . jobs . create ( name = "my-job" , tasks = [ Task ( task_key = "main" , notebook_task = NotebookTask ( notebook_path = "/Users/me/notebook" ) , existing_cluster_id = "0123-456789-abcdef" ) ] )

Run a job now

run

w . jobs . run_now_and_wait ( job_id = created . job_id ) print ( f"Run completed: { run . state . result_state } " )

Get run output

output

w . jobs . get_run_output ( run_id = run . run_id ) SQL Statement Execution Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/sql/statement_execution.html

Execute SQL query

response

w . statement_execution . execute_statement ( warehouse_id = "abc123" , statement = "SELECT * FROM catalog.schema.table LIMIT 10" , wait_timeout = "30s" )

Check status and get results

if response . status . state == StatementState . SUCCEEDED : for row in response . result . data_array : print ( row )

For large results, fetch chunks

chunk

w . statement_execution . get_statement_result_chunk_n ( statement_id = response . statement_id , chunk_index = 0 ) SQL Warehouses Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/sql/warehouses.html

List warehouses

for wh in w . warehouses . list ( ) : print ( f" { wh . name } : { wh . state } " )

Get warehouse

warehouse

w . warehouses . get ( id = "abc123" )

Create warehouse

created

w . warehouses . create_and_wait ( name = "my-warehouse" , cluster_size = "Small" , max_num_clusters = 1 , auto_stop_mins = 15 )

Start/stop

w . warehouses . start ( id = "abc123" ) . result ( ) w . warehouses . stop ( id = "abc123" ) . result ( ) Unity Catalog - Tables Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/tables.html

List tables in a schema

for table in w . tables . list ( catalog_name = "main" , schema_name = "default" ) : print ( f" { table . full_name } : { table . table_type } " )

Get table info

table

w . tables . get ( full_name = "main.default.my_table" ) print ( f"Columns: { [ c . name for c in table . columns ] } " )

Check if table exists

exists

w . tables . exists ( full_name = "main.default.my_table" ) Unity Catalog - Catalogs & Schemas Doc (Catalogs): https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/catalogs.html Doc (Schemas): https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/schemas.html

List catalogs

for catalog in w . catalogs . list ( ) : print ( catalog . name )

Create catalog

w . catalogs . create ( name = "my_catalog" , comment = "Description" )

List schemas

for schema in w . schemas . list ( catalog_name = "main" ) : print ( schema . name )

Create schema

w . schemas . create ( name = "my_schema" , catalog_name = "main" ) Volumes Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/volumes.html from databricks . sdk . service . catalog import VolumeType

List volumes

for vol in w . volumes . list ( catalog_name = "main" , schema_name = "default" ) : print ( f" { vol . full_name } : { vol . volume_type } " )

Create managed volume

w . volumes . create ( catalog_name = "main" , schema_name = "default" , name = "my_volume" , volume_type = VolumeType . MANAGED )

Read volume info

vol

w . volumes . read ( name = "main.default.my_volume" ) Files API Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/files/files.html

Upload file to volume

w . files . upload ( file_path = "/Volumes/main/default/my_volume/data.csv" , contents = open ( "local_file.csv" , "rb" ) )

Download file

with w . files . download ( file_path = "/Volumes/main/default/my_volume/data.csv" ) as f : content = f . read ( )

List directory contents

for entry in w . files . list_directory_contents ( "/Volumes/main/default/my_volume/" ) : print ( f" { entry . name } : { entry . is_directory } " )

Upload/download with progress (parallel)

w . files . upload_from ( file_path = "/Volumes/main/default/my_volume/large.parquet" , source_path = "/local/path/large.parquet" , use_parallel = True ) w . files . download_to ( file_path = "/Volumes/main/default/my_volume/large.parquet" , destination = "/local/output/" , use_parallel = True ) Serving Endpoints (Model Serving) Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/serving/serving_endpoints.html

List endpoints

for ep in w . serving_endpoints . list ( ) : print ( f" { ep . name } : { ep . state } " )

Get endpoint

endpoint

w . serving_endpoints . get ( name = "my-endpoint" )

Query endpoint

response

w . serving_endpoints . query ( name = "my-endpoint" , inputs = { "prompt" : "Hello, world!" } )

For chat/completions endpoints

response

w . serving_endpoints . query ( name = "my-chat-endpoint" , messages = [ { "role" : "user" , "content" : "Hello!" } ] )

Get OpenAI-compatible client

openai_client

w . serving_endpoints . get_open_ai_client ( ) Vector Search Doc (Indexes): https://databricks-sdk-py.readthedocs.io/en/latest/workspace/vectorsearch/vector_search_indexes.html Doc (Endpoints): https://databricks-sdk-py.readthedocs.io/en/latest/workspace/vectorsearch/vector_search_endpoints.html

List vector search indexes

for idx in w . vector_search_indexes . list_indexes ( endpoint_name = "my-vs-endpoint" ) : print ( idx . name )

Query index

results

w . vector_search_indexes . query_index ( index_name = "main.default.my_index" , columns = [ "id" , "text" , "embedding" ] , query_text = "search query" , num_results = 10 ) for doc in results . result . data_array : print ( doc ) Pipelines (Delta Live Tables) Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/pipelines/pipelines.html

List pipelines

for pipeline in w . pipelines . list_pipelines ( ) : print ( f" { pipeline . name } : { pipeline . state } " )

Get pipeline

pipeline

w . pipelines . get ( pipeline_id = "abc123" )

Start pipeline update

w . pipelines . start_update ( pipeline_id = "abc123" )

Stop pipeline

w . pipelines . stop_and_wait ( pipeline_id = "abc123" ) Secrets Doc: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/workspace/secrets.html

List secret scopes

for scope in w . secrets . list_scopes ( ) : print ( scope . name )

Create scope

w . secrets . create_scope ( scope = "my-scope" )

Put secret

w . secrets . put_secret ( scope = "my-scope" , key = "api-key" , string_value = "secret123" )

Get secret (returns GetSecretResponse with value)

secret

w . secrets . get_secret ( scope = "my-scope" , key = "api-key" )

List secrets in scope (metadata only, not values)

for s in w . secrets . list_secrets ( scope = "my-scope" ) : print ( s . key ) DBUtils Doc: https://databricks-sdk-py.readthedocs.io/en/latest/dbutils.html

Access dbutils through WorkspaceClient

dbutils

w . dbutils

File system operations

files

dbutils . fs . ls ( "/" ) dbutils . fs . cp ( "dbfs:/source" , "dbfs:/dest" ) dbutils . fs . rm ( "dbfs:/path" , recurse = True )

Secrets (same as w.secrets but dbutils interface)

value

dbutils . secrets . get ( scope = "my-scope" , key = "my-key" ) Common Patterns CRITICAL: Async Applications (FastAPI, etc.) The Databricks SDK is fully synchronous. All calls block the thread. In async applications (FastAPI, asyncio), you MUST wrap SDK calls with asyncio.to_thread() to avoid blocking the event loop. import asyncio from databricks . sdk import WorkspaceClient w = WorkspaceClient ( )

WRONG - blocks the event loop

async def get_clusters_bad ( ) : return list ( w . clusters . list ( ) )

BLOCKS!

CORRECT - runs in thread pool

async def get_clusters_good ( ) : return await asyncio . to_thread ( lambda : list ( w . clusters . list ( ) ) )

CORRECT - for simple calls

async def get_cluster ( cluster_id : str ) : return await asyncio . to_thread ( w . clusters . get , cluster_id )

CORRECT - FastAPI endpoint

from fastapi import FastAPI app = FastAPI ( ) @app . get ( "/clusters" ) async def list_clusters ( ) : clusters = await asyncio . to_thread ( lambda : list ( w . clusters . list ( ) ) ) return [ { "id" : c . cluster_id , "name" : c . cluster_name } for c in clusters ] @app . post ( "/query" ) async def run_query ( sql : str , warehouse_id : str ) :

Wrap the blocking SDK call

response

await asyncio . to_thread ( w . statement_execution . execute_statement , statement = sql , warehouse_id = warehouse_id , wait_timeout = "30s" ) return response . result . data_array Note: WorkspaceClient().config.host is NOT a network call - it just reads config. No need to wrap property access. Wait for Long-Running Operations from datetime import timedelta

Pattern 1: Use *_and_wait methods

cluster

w . clusters . create_and_wait ( cluster_name = "test" , spark_version = "14.3.x-scala2.12" , node_type_id = "i3.xlarge" , num_workers = 2 , timeout = timedelta ( minutes = 30 ) )

Pattern 2: Use Wait object

wait

w . clusters . create ( . . . ) cluster = wait . result ( )

Blocks until ready

Pattern 3: Manual polling with callback

def progress ( cluster ) : print ( f"State: { cluster . state } " ) cluster = w . clusters . wait_get_cluster_running ( cluster_id = "..." , timeout = timedelta ( minutes = 30 ) , callback = progress ) Pagination

All list methods return iterators that handle pagination automatically

for job in w . jobs . list ( ) :

Fetches all pages

print ( job . settings . name )

For manual control

from databricks . sdk . service . jobs import ListJobsRequest response = w . jobs . list ( limit = 10 ) for job in response : print ( job ) Error Handling from databricks . sdk . errors import NotFound , PermissionDenied , ResourceAlreadyExists try : cluster = w . clusters . get ( cluster_id = "invalid-id" ) except NotFound : print ( "Cluster not found" ) except PermissionDenied : print ( "Access denied" ) When Uncertain If I'm unsure about a method, I should: Check the documentation URL pattern: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/{category}/{service}.html Common categories: Clusters: /workspace/compute/clusters.html Jobs: /workspace/jobs/jobs.html Tables: /workspace/catalog/tables.html Warehouses: /workspace/sql/warehouses.html Serving: /workspace/serving/serving_endpoints.html Fetch and verify before providing guidance on parameters or return types. Quick Reference Links API Documentation URL Authentication https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html Clusters https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html Jobs https://databricks-sdk-py.readthedocs.io/en/latest/workspace/jobs/jobs.html SQL Warehouses https://databricks-sdk-py.readthedocs.io/en/latest/workspace/sql/warehouses.html Statement Execution https://databricks-sdk-py.readthedocs.io/en/latest/workspace/sql/statement_execution.html Tables https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/tables.html Catalogs https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/catalogs.html Schemas https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/schemas.html Volumes https://databricks-sdk-py.readthedocs.io/en/latest/workspace/catalog/volumes.html Files https://databricks-sdk-py.readthedocs.io/en/latest/workspace/files/files.html Serving Endpoints https://databricks-sdk-py.readthedocs.io/en/latest/workspace/serving/serving_endpoints.html Vector Search https://databricks-sdk-py.readthedocs.io/en/latest/workspace/vectorsearch/vector_search_indexes.html Pipelines https://databricks-sdk-py.readthedocs.io/en/latest/workspace/pipelines/pipelines.html Secrets https://databricks-sdk-py.readthedocs.io/en/latest/workspace/workspace/secrets.html DBUtils https://databricks-sdk-py.readthedocs.io/en/latest/dbutils.html

安装

Auto-detects 'DEFAULT' profile from ~/.databrickscfg

spark

With explicit profile

spark

Use spark as normal

df

Direct API call using authenticated client

response

POST with body

response

Check version (should be >= 0.278.0)

Use specific profile

Common commands

Personal Access Token

Auto-detect credentials from environment

Explicit token auth

w

Azure Service Principal

w

Use a named profile from ~/.databrickscfg

w

List all clusters

Get cluster details

cluster

Create a cluster (returns Wait object)

wait

Wait for cluster to be running

Or use create_and_wait for blocking call

cluster

Start/stop/delete

List jobs

Create a job

created

Run a job now

run

Get run output

output

Execute SQL query

response

Check status and get results

For large results, fetch chunks

chunk

List warehouses

Get warehouse

warehouse

Create warehouse

created

Start/stop

List tables in a schema

Get table info

table

Check if table exists

exists

List catalogs

Create catalog

List schemas

Create schema

List volumes

Create managed volume

Read volume info

vol

Upload file to volume

Download file

List directory contents

Upload/download with progress (parallel)

List endpoints

Get endpoint

endpoint

Query endpoint

response

For chat/completions endpoints

response

Get OpenAI-compatible client

openai_client

List vector search indexes

Query index

results

List pipelines

Get pipeline