varg sdk - declarative ai video orchestration
jsx-based ai video generation. describe scenes declaratively, render videos automatically.
what is this?
varg sdk is a declarative video orchestration framework. instead of manually calling apis, stitching clips, and managing async workflows, you describe what you want in jsx and the runtime handles:
parallel generation of images/videos/audio automatic caching (re-renders reuse cached assets) ffmpeg composition under the hood provider abstraction (fal, elevenlabs, replicate)
think of it like react for video - you declare the structure, the engine figures out the execution.
terminology
term meaning
Render root container - sets dimensions (1080x1920 for tiktok, 1920x1080 for youtube)
Clip timeline segment with duration, contains visual/audio layers
Image static image - generated from prompt or loaded from file
Video video clip - text-to-video OR image-to-video animation
Music background audio - generated from prompt or loaded from file
Speech text-to-speech with voice selection
Title/Subtitle text overlays with positioning
Captions auto-generated captions from Speech element
Grid layout helper for multi-image/video grids
core concepts
no imports needed - the render runtime auto-imports all components (Render, Clip, Video, Image, Music, Speech, Title, Grid, etc.) and providers (fal, elevenlabs). just write jsx and export default.
Video component
Video handles both text-to-video and image-to-video:
// text-to-video - generate from scratch
// image-to-video - animate an image with motion description
] }} />
character consistency
generate an image first, pass it to Video/Animate for consistent characters across scenes:
const hero =
for reference-based generation (keeping a real person's likeness):
prompting guide (CRITICAL)
detailed prompts = better results. wan 2.5 video model responds dramatically better to rich, structured prompts. vague prompts produce generic, low-quality output.
the 4-dimensional formula
every video prompt should include these dimensions:
[Subject Description] + [Scene Description] + [Motion/Action] + [Cinematic Controls]
subject description
describe the main focus with rich detail:
appearance, clothing, features, posture materials, textures, colors emotional state, expression
bad: "a woman" good: "a young woman with shoulder-length auburn hair, wearing a worn leather jacket over a white t-shirt, her green eyes reflecting determination, jaw set with quiet resolve"
scene description
layer the environment details:
location, background, foreground lighting type (soft light, hard light, edge light, top light) time period (golden hour, daytime, night, moonlight) atmosphere (warm tones, low saturation, high contrast)
bad: "in a city" good: "on a rain-slicked tokyo street at night, neon signs reflecting pink and blue on wet pavement, steam rising from a nearby food stall, warm yellow light spilling from a ramen shop window"
motion/action description
specify movement with intensity and manner:
amplitude (small movements, large movements) rate (slowly, quickly, explosively) effects (breaking glass, hair flowing, dust particles)
bad: "walking" good: "walking slowly through the crowd, shoulders hunched against the rain, one hand raised to shield her eyes, water droplets catching the neon light as they fall from her fingertips"
cinematic controls
use film language for professional results:
camera movements:
camera push-in / camera pull-out tracking shot / dolly shot pan left/right / tilt up/down static camera / fixed camera handheld / steadicam
shot types:
extreme close-up / close-up / medium close-up medium shot / medium wide shot wide shot / extreme wide shot over-the-shoulder shot
composition:
center composition / rule of thirds left-side composition / right-side composition symmetrical composition low angle / high angle / eye-level
lighting keywords:
soft light / hard light / edge light rim light / backlight / side light golden hour / blue hour / moonlight practical light / mixed light style keywords
add visual style for artistic direction:
cinematic / film grain / anamorphic cyberpunk / noir / post-apocalyptic ghibli style / anime / photorealistic vintage / retro / futuristic tilt-shift photography / time-lapse audio/dialogue (for videos with speech)
wan 2.5 supports native audio generation:
format dialogue: Character says: "Your line here." keep lines short (under 10 words per 5-second clip) for best lip-sync specify delivery tone: "speaking quietly", "calling out over wind" for silence: include "no dialogue" or "actors not speaking" image-to-video prompts
when using reference images, focus on motion + camera:
the image already defines subject/scene/style describe dynamics: "running forward, hair flowing behind" add camera movement: "slow push-in on face" control intensity: "subtle movement" vs "dramatic action" example: detailed prompt
bad prompt (vague):
good prompt (detailed):
example 1: cinematic character video
consistent character across multiple dramatic scenes with epic music.
warrior-princess.tsx (complete file - no imports needed):
// Warrior Princess - Cinematic Character Video // run: bunx vargai@latest render warrior-princess.tsx --verbose
const hero = (
export default (
example 2: tiktok product video with talking head
animated influencer character with speech, captions, and background music.
skincare-promo.tsx (complete file - no imports needed):
// Skincare Product TikTok - Talking Head with Captions // run: bunx vargai@latest render skincare-promo.tsx --verbose
const influencer = (
const speech = (
export default (
<Clip duration={3}>
<Video
prompt={{
text: "eyes widen dramatically in genuine surprise, eyebrows shoot up, mouth opens into excited smile, subtle forward lean toward camera as if sharing a secret. natural blinking, authentic micro-expressions",
images: [influencer]
}}
model={fal.videoModel("wan-2.5")}
/>
{speech}
</Clip>
<Clip duration={4}>
<Video
prompt="product shot: sleek skincare bottle rotating slowly on marble surface, soft key light from above, pink and gold gradient background, light rays catching the glass, water droplets on bottle suggesting freshness. smooth 360 rotation, professional product photography, beauty commercial aesthetic"
model={fal.videoModel("kling-v2.5")}
/>
<Title position="bottom" color="#ffffff">LINK IN BIO</Title>
</Clip>
<Clip duration={3}>
<Video
prompt={{
text: "enthusiastic nodding while speaking, hands come into frame gesturing excitedly, genuine smile reaching her eyes, occasional hair flip, pointing at camera on 'go get it'",
images: [influencer]
}}
model={fal.videoModel("wan-2.5")}
/>
</Clip>
<Captions src={speech} style="tiktok" activeColor="#ff00ff" />
);
example 3: multi-scene video grid with elements
4-panel nature video grid showcasing different elements.
four-elements.tsx (complete file - no imports needed):
// Four Elements - 2x2 Video Grid // run: bunx vargai@latest render four-elements.tsx --verbose
export default (
<Music prompt="ambient electronic, peaceful synthesizer pads, gentle nature sounds mixed in, meditative atmosphere, spa music vibes but more cinematic, slow build with subtle percussion" duration={8} />
);
render preview first (recommended workflow)
use --preview to generate only images/thumbnails without rendering full videos. this is faster and cheaper for iteration:
bunx vargai@latest render video.tsx --preview --verbose
once you're happy with the preview frames, render the full video:
bunx vargai@latest render video.tsx --verbose
open the result
after rendering, open the video with the system default player:
open output/video.mp4
setup
requires api keys in .env:
FAL_KEY - image/video generation (https://fal.ai/dashboard/keys) ELEVENLABS_API_KEY - music/voice (https://elevenlabs.io/app/settings/api-keys) available models // image generation fal.imageModel("flux-schnell") // fast, good quality fal.imageModel("flux-pro") // highest quality fal.imageModel("nano-banana-pro/edit") // reference-based (character consistency)
// video generation fal.videoModel("wan-2.5") // balanced speed/quality fal.videoModel("kling-v2.5") // highest quality, 10s max fal.videoModel("sync-v2") // lipsync
// audio elevenlabs.speechModel("turbo") // fast TTS elevenlabs.musicModel() // music generation
key props reference
component key props