- tavily crawl
- Crawl a website and extract content from multiple pages. Supports saving each page as a local markdown file.
- Prerequisites
- Requires the Tavily CLI. See
- tavily-cli
- for install and auth setup.
- Quick install:
- curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login
- When to use
- You need content from many pages on a site (e.g., all
- /docs/
- )
- You want to download documentation for offline use
- Step 4 in the
- workflow
- search → extract → map → crawl → research Quick start
Basic crawl
tvly crawl "https://docs.example.com" --json
Save each page as a markdown file
tvly crawl "https://docs.example.com" --output-dir ./docs/
Deeper crawl with limits
tvly crawl "https://docs.example.com" --max-depth 2 --limit 50 --json
Filter to specific paths
tvly crawl "https://example.com" --select-paths "/api/.,/guides/." --exclude-paths "/blog/.*" --json
Semantic focus (returns relevant chunks, not full pages)
tvly crawl "https://docs.example.com" --instructions "Find authentication docs" --chunks-per-source 3 --json Options Option Description --max-depth Levels deep (1-5, default: 1) --max-breadth Links per page (default: 20) --limit Total pages cap (default: 50) --instructions Natural language guidance for semantic focus --chunks-per-source Chunks per page (1-5, requires --instructions ) --extract-depth basic (default) or advanced --format markdown (default) or text --select-paths Comma-separated regex patterns to include --exclude-paths Comma-separated regex patterns to exclude --select-domains Comma-separated regex for domains to include --exclude-domains Comma-separated regex for domains to exclude --allow-external / --no-external Include external links (default: allow) --include-images Include images --timeout Max wait (10-150 seconds) -o, --output Save JSON output to file --output-dir Save each page as a .md file in directory --json Structured JSON output Crawl for context vs. data collection For agentic use (feeding results to an LLM): Always use --instructions + --chunks-per-source . Returns only relevant chunks instead of full pages — prevents context explosion. tvly crawl "https://docs.example.com" --instructions "API authentication" --chunks-per-source 3 --json For data collection (saving to files): Use --output-dir without --chunks-per-source to get full pages as markdown files. tvly crawl "https://docs.example.com" --max-depth 2 --output-dir ./docs/ Tips Start conservative — --max-depth 1 , --limit 20 — and scale up. Use --select-paths to focus on the section you need. Use map first to understand site structure before a full crawl. Always set --limit to prevent runaway crawls. See also tavily-map — discover URLs before deciding to crawl tavily-extract — extract individual pages tavily-search — find pages when you don't have a URL