Runpod Flash Write code locally, test with flash run (dev server at localhost:8888), and flash automatically provisions and deploys to remote GPUs/CPUs in the cloud. Endpoint handles everything. Setup pip install runpod-flash

requires Python >=3.10

flash login

auth option 2: API key via environment variable

export RUNPOD_API_KEY = your_key flash init my-project

scaffold a new project in ./my-project

CLI flash run

start local dev server at localhost:8888

flash run --auto-provision

same, but pre-provision endpoints (no cold start)

flash build

package artifact for deployment (500MB limit)

flash build --exclude pkg1,pkg2

exclude packages from build

flash deploy

build + deploy (auto-selects env if only one)

flash deploy --env staging

build + deploy to "staging" environment

flash deploy --app my-app --env prod

deploy a specific app to an environment

flash deploy --preview

build + launch local preview in Docker

flash env list

list deployment environments

flash env create staging

create "staging" environment

flash env get staging

show environment details + resources

flash env delete staging

delete environment + tear down resources

flash undeploy list

list all active endpoints

flash undeploy my-endpoint

remove a specific endpoint

Endpoint: Three Modes Mode 1: Your Code (Queue-Based Decorator) One function = one endpoint with its own workers. from runpod_flash import Endpoint , GpuGroup @Endpoint ( name = "my-worker" , gpu = GpuGroup . AMPERE_80 , workers = 5 , dependencies = [ "torch" ] ) async def compute ( data ) : import torch

MUST import inside function (cloudpickle)

return { "sum" : torch . tensor ( data , device = "cuda" ) . sum ( ) . item ( ) } result = await compute ( [ 1 , 2 , 3 ] ) Mode 2: Your Code (Load-Balanced Routes) Multiple HTTP routes share one pool of workers. from runpod_flash import Endpoint , GpuGroup api = Endpoint ( name = "my-api" , gpu = GpuGroup . ADA_24 , workers = ( 1 , 5 ) , dependencies = [ "torch" ] ) @api . post ( "/predict" ) async def predict ( data : list [ float ] ) : import torch return { "result" : torch . tensor ( data , device = "cuda" ) . sum ( ) . item ( ) } @api . get ( "/health" ) async def health ( ) : return { "status" : "ok" } Mode 3: External Image (Client) Deploy a pre-built Docker image and call it via HTTP. from runpod_flash import Endpoint , GpuGroup , PodTemplate server = Endpoint ( name = "my-server" , image = "my-org/my-image:latest" , gpu = GpuGroup . AMPERE_80 , workers = 1 , env = { "HF_TOKEN" : "xxx" } , template = PodTemplate ( containerDiskInGb = 100 ) , )

LB-style

result

await server . post ( "/v1/completions" , { "prompt" : "hello" } ) models = await server . get ( "/v1/models" )

QB-style

job

await server . run ( { "prompt" : "hello" } ) await job . wait ( ) print ( job . output ) Connect to an existing endpoint by ID (no provisioning): ep = Endpoint ( id = "abc123" ) job = await ep . runsync ( { "input" : "hello" } ) print ( job . output ) How Mode Is Determined Parameters Mode name= only Decorator (your code) image= set Client (deploys image, then HTTP calls) id= set Client (connects to existing, no provisioning) Endpoint Constructor Endpoint ( name = "endpoint-name" ,

required (unless id= set)

id

None ,

connect to existing endpoint

gpu

GpuGroup . AMPERE_80 ,

single GPU type (default: ANY)

gpu

[ GpuGroup . ADA_24 , GpuGroup . AMPERE_80 ] ,

or list for auto-select by supply

cpu

CpuInstanceType . CPU5C_4_8 ,

CPU type (mutually exclusive with gpu)

workers

5 ,

shorthand for (0, 5)

workers

( 1 , 5 ) ,

explicit (min, max)

idle_timeout

60 ,

seconds before scale-down (default: 60)

dependencies

[ "torch" ] ,

pip packages for remote exec

system_dependencies

[ "ffmpeg" ] ,

apt-get packages

image

"org/image:tag" ,

pre-built Docker image (client mode)

env

{ "KEY" : "val" } ,

environment variables

volume

NetworkVolume ( . . . ) ,

persistent storage

gpu_count

1 ,

GPUs per worker

template

PodTemplate ( containerDiskInGb = 100 ) , flashboot = True ,

fast cold starts

execution_timeout_ms

0 ,

max execution time (0 = unlimited)

) gpu= and cpu= are mutually exclusive workers=5 means (0, 5) . Default is (0, 1) idle_timeout default is 60 seconds flashboot=True (default) -- enables fast cold starts via snapshot restore gpu_count -- GPUs per worker (default 1), use >1 for multi-GPU models NetworkVolume NetworkVolume ( name = "my-vol" , size = 100 )

size in GB, default 100

PodTemplate PodTemplate ( containerDiskInGb = 64 ,

container disk size (default 64)

dockerArgs

"" ,

extra docker arguments

ports

"" ,

exposed ports

startScript

"" ,

script to run on start

) EndpointJob Returned by ep.run() and ep.runsync() in client mode. job = await ep . run ( { "data" : [ 1 , 2 , 3 ] } ) await job . wait ( timeout = 120 )

poll until done

print ( job . id , job . output , job . error , job . done ) await job . cancel ( ) GPU Types (GpuGroup) Enum GPU VRAM ANY any varies AMPERE_16 RTX A4000 16GB AMPERE_24 RTX A5000/L4 24GB AMPERE_48 A40/A6000 48GB AMPERE_80 A100 80GB ADA_24 RTX 4090 24GB ADA_32_PRO RTX 5090 32GB ADA_48_PRO RTX 6000 Ada 48GB ADA_80_PRO H100 PCIe (80GB) / H100 HBM3 (80GB) / H100 NVL (94GB) 80GB+ HOPPER_141 H200 141GB CPU Types (CpuInstanceType) Enum vCPU RAM Max Disk Type CPU3G_1_4 1 4GB 10GB General CPU3G_2_8 2 8GB 20GB General CPU3G_4_16 4 16GB 40GB General CPU3G_8_32 8 32GB 80GB General CPU3C_1_2 1 2GB 10GB Compute CPU3C_2_4 2 4GB 20GB Compute CPU3C_4_8 4 8GB 40GB Compute CPU3C_8_16 8 16GB 80GB Compute CPU5C_1_2 1 2GB 15GB Compute (5th gen) CPU5C_2_4 2 4GB 30GB Compute (5th gen) CPU5C_4_8 4 8GB 60GB Compute (5th gen) CPU5C_8_16 8 16GB 120GB Compute (5th gen) from runpod_flash import Endpoint , CpuInstanceType @Endpoint ( name = "cpu-work" , cpu = CpuInstanceType . CPU5C_4_8 , workers = 5 , dependencies = [ "pandas" ] ) async def process ( data ) : import pandas as pd return pd . DataFrame ( data ) . describe ( ) . to_dict ( ) Common Patterns CPU + GPU Pipeline from runpod_flash import Endpoint , GpuGroup , CpuInstanceType @Endpoint ( name = "preprocess" , cpu = CpuInstanceType . CPU5C_4_8 , workers = 5 , dependencies = [ "pandas" ] ) async def preprocess ( raw ) : import pandas as pd return pd . DataFrame ( raw ) . to_dict ( "records" ) @Endpoint ( name = "infer" , gpu = GpuGroup . AMPERE_80 , workers = 5 , dependencies = [ "torch" ] ) async def infer ( clean ) : import torch t = torch . tensor ( [ [ v for v in r . values ( ) ] for r in clean ] , device = "cuda" ) return { "predictions" : t . mean ( dim = 1 ) . tolist ( ) } async def pipeline ( data ) : return await infer ( await preprocess ( data ) ) Parallel Execution import asyncio results = await asyncio . gather ( compute ( a ) , compute ( b ) , compute ( c ) ) Gotchas Imports outside function -- most common error. Everything inside the decorated function. Forgetting await -- all decorated functions and client methods need await . Missing dependencies -- must list in dependencies=[] . gpu/cpu are exclusive -- pick one per Endpoint. idle_timeout is seconds -- default 60s, not minutes. 10MB payload limit -- pass URLs, not large objects. Client vs decorator -- image= / id= = client. Otherwise = decorator. Auto GPU switching requires workers >= 5 -- pass a list of GPU types (e.g. gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_80] ) and set workers=5 or higher. The platform only auto-switches GPU types based on supply when max workers is at least 5. runsync timeout is 60s -- cold starts can exceed 60s. Use ep.runsync(data, timeout=120) for first requests or use ep.run() + job.wait() instead.

安装

requires Python >=3.10

auth option 1: browser-based login (saves token locally)

auth option 2: API key via environment variable

scaffold a new project in ./my-project

start local dev server at localhost:8888

same, but pre-provision endpoints (no cold start)

package artifact for deployment (500MB limit)

exclude packages from build

build + deploy (auto-selects env if only one)

build + deploy to "staging" environment

deploy a specific app to an environment

build + launch local preview in Docker

list deployment environments

create "staging" environment

show environment details + resources

delete environment + tear down resources

list all active endpoints

remove a specific endpoint

MUST import inside function (cloudpickle)

LB-style

result

QB-style

job

required (unless id= set)

id

connect to existing endpoint

gpu

single GPU type (default: ANY)

gpu

or list for auto-select by supply

cpu

CPU type (mutually exclusive with gpu)

workers

shorthand for (0, 5)

workers

explicit (min, max)

idle_timeout

seconds before scale-down (default: 60)

dependencies

pip packages for remote exec

system_dependencies

apt-get packages

image

pre-built Docker image (client mode)

env

environment variables

volume

persistent storage

gpu_count

GPUs per worker

template

fast cold starts

execution_timeout_ms

max execution time (0 = unlimited)

size in GB, default 100

container disk size (default 64)

dockerArgs

extra docker arguments

ports

exposed ports

startScript

script to run on start

poll until done