Crawl4AI Overview Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both CLI (recommended for quick tasks) and Python SDK (for programmatic control). Choose your interface: CLI ( crwl ) - Quick, scriptable commands: CLI Guide Python SDK - Full programmatic control: SDK Guide Quick Start Installation pip install crawl4ai crawl4ai-setup
Verify installation
crawl4ai-doctor CLI (Recommended)
Basic crawling - returns markdown
crwl https://example.com
Get markdown output
crwl https://example.com -o markdown
JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache
See more examples
- crwl
- --example
- Python SDK
- import
- asyncio
- from
- crawl4ai
- import
- AsyncWebCrawler
- async
- def
- main
- (
- )
- :
- async
- with
- AsyncWebCrawler
- (
- )
- as
- crawler
- :
- result
- =
- await
- crawler
- .
- arun
- (
- "https://example.com"
- )
- (
- result
- .
- markdown
- [
- :
- 500
- ]
- )
- asyncio
- .
- run
- (
- main
- (
- )
- )
- For SDK configuration details:
- SDK Guide - Configuration
- (lines 61-150)
- Core Concepts
- Configuration Layers
- Both CLI and SDK use the same underlying configuration:
- Concept
- CLI
- SDK
- Browser settings
- -B browser.yml
- or
- -b "param=value"
- BrowserConfig(...)
- Crawl settings
- -C crawler.yml
- or
- -c "param=value"
- CrawlerRunConfig(...)
- Extraction
- -e extract.yml -s schema.json
- extraction_strategy=...
- Content filter
- -f filter.yml
- markdown_generator=...
- Key Parameters
- Browser Configuration:
- headless
-
- Run with/without GUI
- viewport_width/height
-
- Browser dimensions
- user_agent
-
- Custom user agent
- proxy_config
-
- Proxy settings
- Crawler Configuration:
- page_timeout
-
- Max page load time (ms)
- wait_for
-
- CSS selector or JS condition to wait for
- cache_mode
-
- bypass, enabled, disabled
- js_code
-
- JavaScript to execute
- css_selector
- Focus on specific element For complete parameters: CLI Config | SDK Config Output Content Every crawl returns: markdown - Clean, formatted markdown html - Raw HTML links - Internal and external links discovered media - Images, videos, audio found extracted_content - Structured data (if extraction configured) Markdown Generation (Primary Use Case) Crawl4AI excels at generating clean, well-formatted markdown: CLI
Basic markdown
crwl https://docs.example.com -o markdown
Filtered markdown (removes noise)
crwl https://docs.example.com -o markdown-fit
With content filter
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit Filter configuration:
filter_bm25.yml (relevance-based)
type : "bm25" query : "machine learning tutorials" threshold : 1.0 Python SDK from crawl4ai . content_filter_strategy import BM25ContentFilter from crawl4ai . markdown_generation_strategy import DefaultMarkdownGenerator bm25_filter = BM25ContentFilter ( user_query = "machine learning" , bm25_threshold = 1.0 ) md_generator = DefaultMarkdownGenerator ( content_filter = bm25_filter ) config = CrawlerRunConfig ( markdown_generator = md_generator ) result = await crawler . arun ( url , config = config ) print ( result . markdown . fit_markdown )
Filtered
print ( result . markdown . raw_markdown )
Original
For content filters: Content Processing (lines 2481-3101) Data Extraction 1. Schema-Based CSS Extraction (Most Efficient) No LLM required - fast, deterministic, cost-free. CLI:
Generate schema once (uses LLM)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
Use schema for extraction (no LLM)
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json Schema format: { "name" : "products" , "baseSelector" : ".product-card" , "fields" : [ { "name" : "title" , "selector" : "h2" , "type" : "text" } , { "name" : "price" , "selector" : ".price" , "type" : "text" } , { "name" : "link" , "selector" : "a" , "type" : "attribute" , "attribute" : "href" } ] } 2. LLM-Based Extraction For complex or irregular content: CLI:
extract_llm.yml
type : "llm" provider : "openai/gpt-4o-mini" instruction : "Extract product names and prices" api_token : "your-token" crwl https://shop.com -e extract_llm.yml -o json For extraction details: Extraction Strategies (lines 4522-5429) Advanced Patterns Dynamic Content (JavaScript-Heavy Sites) CLI: crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000" Crawler config:
crawler.yml
wait_for : "css:.ajax-content" scan_full_page : true page_timeout : 60000 delay_before_return_html : 2.0 Multi-URL Processing CLI (sequential): for url in url1 url2 url3 ; do crwl " $url " -o markdown ; done Python SDK (concurrent): urls = [ "https://site1.com" , "https://site2.com" , "https://site3.com" ] results = await crawler . arun_many ( urls , config = config ) For batch processing: arun_many() Reference (lines 1057-1224) Session & Authentication CLI:
login_crawler.yml
session_id : "user_session" js_code : | document.querySelector('#username').value = 'user'; document.querySelector('#password').value = 'pass'; document.querySelector('#submit').click(); wait_for : "css:.dashboard"
Login
crwl https://site.com/login -C login_crawler.yml
Access protected content (session reused)
crwl https://site.com/protected -c "session_id=user_session" For session management: Advanced Features (lines 5429-5940) Anti-Detection & Proxies CLI:
browser.yml
headless : true proxy_config : server : "http://proxy:8080" username : "user" password : "pass" user_agent_mode : "random" crwl https://example.com -B browser.yml Common Use Cases Documentation to Markdown crwl https://docs.example.com -o markdown
docs.md E-commerce Product Monitoring
Generate schema once
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
Monitor (no LLM costs)
crwl https://shop.com -e extract_css.yml -s schema.json -o json News Aggregation
Multiple sources with filtering
for url in news1.com news2.com news3.com ; do crwl "https:// $url " -f filter_bm25.yml -o markdown-fit done Interactive Q&A
First view content
crwl https://example.com -o markdown
Then ask questions
crwl https://example.com -q "What are the main conclusions?" crwl https://example.com -q "Summarize the key points" Resources Provided Scripts scripts/extraction_pipeline.py - Schema generation and extraction scripts/basic_crawler.py - Simple markdown extraction scripts/batch_crawler.py - Multi-URL processing Reference Documentation Document Purpose CLI Guide Command-line interface reference SDK Guide Python SDK quick reference Complete SDK Reference Full API documentation (5900+ lines) Best Practices Start with CLI for quick tasks, SDK for automation Use schema-based extraction - 10-100x more efficient than LLM Enable caching during development - --bypass-cache only when needed Set appropriate timeouts - 30s normal, 60s+ for JS-heavy sites Use content filters for cleaner, focused markdown Respect rate limits - Add delays between requests Troubleshooting JavaScript Not Loading crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000" Bot Detection Issues crwl https://example.com -B browser.yml
browser.yml
headless : false viewport_width : 1920 viewport_height : 1080 user_agent : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" Content Not Extracted
Debug: see full output
crwl https://example.com -o all -v
Try different wait strategy
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null" Session Issues
Verify session
crwl https://site.com -c "session_id=test" -o all | grep -i session For comprehensive API documentation, see Complete SDK Reference .