- Firecrawl Web Scraper Skill
- Status
-
- Production Ready
- Last Updated
-
- 2026-01-20
- Official Docs
- :
- https://docs.firecrawl.dev
- API Version
-
- v2
- SDK Versions
- firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+ What is Firecrawl? Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles: JavaScript rendering - Executes client-side JavaScript to capture dynamic content Anti-bot bypass - Gets past CAPTCHA and bot detection systems Format conversion - Outputs as markdown, HTML, JSON, screenshots, summaries Document parsing - Processes PDFs, DOCX files, and images Autonomous agents - AI-powered web data gathering without URLs Change tracking - Monitor content changes over time Branding extraction - Extract color schemes, typography, logos API Endpoints Overview Endpoint Purpose Use Case /scrape Single page Extract article, product page /crawl Full site Index docs, archive sites /map URL discovery Find all pages, plan strategy /search Web search + scrape Research with live data /extract Structured data Product prices, contacts /agent Autonomous gathering No URLs needed, AI navigates /batch-scrape Multiple URLs Bulk processing 1. Scrape Endpoint ( /v2/scrape ) Scrapes a single webpage and returns clean, structured content. Basic Usage from firecrawl import Firecrawl import os app = Firecrawl ( api_key = os . environ . get ( "FIRECRAWL_API_KEY" ) )
Basic scrape
doc
app . scrape ( url = "https://example.com/article" , formats = [ "markdown" , "html" ] , only_main_content = True ) print ( doc . markdown ) print ( doc . metadata ) import FirecrawlApp from '@mendable/firecrawl-js' ; const app = new FirecrawlApp ( { apiKey : process . env . FIRECRAWL_API_KEY } ) ; const result = await app . scrapeUrl ( 'https://example.com/article' , { formats : [ 'markdown' , 'html' ] , onlyMainContent : true } ) ; console . log ( result . markdown ) ; Output Formats Format Description markdown LLM-optimized content html Full HTML rawHtml Unprocessed HTML screenshot Page capture (with viewport options) links All URLs on page json Structured data extraction summary AI-generated summary branding Design system data changeTracking Content change detection Advanced Options doc = app . scrape ( url = "https://example.com" , formats = [ "markdown" , "screenshot" ] , only_main_content = True , remove_base64_images = True , wait_for = 5000 ,
Wait 5s for JS
timeout
30000 ,
Location & language
location
{ "country" : "AU" , "languages" : [ "en-AU" ] } ,
Cache control
max_age
0 ,
Fresh content (no cache)
store_in_cache
True ,
Stealth mode for complex sites
stealth
True ,
Custom headers
headers
{ "User-Agent" : "Custom Bot 1.0" } ) Browser Actions Perform interactions before scraping: doc = app . scrape ( url = "https://example.com" , actions = [ { "type" : "click" , "selector" : "button.load-more" } , { "type" : "wait" , "milliseconds" : 2000 } , { "type" : "scroll" , "direction" : "down" } , { "type" : "write" , "selector" : "input#search" , "text" : "query" } , { "type" : "press" , "key" : "Enter" } , { "type" : "screenshot" }
Capture state mid-action
] ) JSON Mode (Structured Extraction)
With schema
doc
app . scrape ( url = "https://example.com/product" , formats = [ "json" ] , json_options = { "schema" : { "type" : "object" , "properties" : { "title" : { "type" : "string" } , "price" : { "type" : "number" } , "in_stock" : { "type" : "boolean" } } } } )
Without schema (prompt-only)
doc
app . scrape ( url = "https://example.com/product" , formats = [ "json" ] , json_options = { "prompt" : "Extract the product name, price, and availability" } ) Branding Extraction Extract design system and brand identity: doc = app . scrape ( url = "https://example.com" , formats = [ "branding" ] )
Returns:
- Color schemes and palettes
- Typography (fonts, sizes, weights)
- Spacing and layout metrics
- UI component styles
- Logo and imagery URLs
- Brand personality traits
- Crawl Endpoint ( /v2/crawl ) Crawls all accessible pages from a starting URL. result = app . crawl ( url = "https://docs.example.com" , limit = 100 , max_depth = 3 , allowed_domains = [ "docs.example.com" ] , exclude_paths = [ "/api/" , "/admin/" ] , scrape_options = { "formats" : [ "markdown" ] , "only_main_content" : True } ) for page in result . data : print ( f"Scraped: { page . metadata . source_url } " ) print ( f"Content: { page . markdown [ : 200] } ..." ) Async Crawl with Webhooks
Start crawl (returns immediately)
job
app . start_crawl ( url = "https://docs.example.com" , limit = 1000 , webhook = "https://your-domain.com/webhook" ) print ( f"Job ID: { job . id } " )
Or poll for status
status
app . check_crawl_status ( job . id ) 3. Map Endpoint ( /v2/map ) Rapidly discover all URLs on a website without scraping content. urls = app . map ( url = "https://example.com" ) print ( f"Found { len ( urls ) } pages" ) for url in urls [ : 10 ] : print ( url ) Use for: sitemap discovery, crawl planning, website audits. 4. Search Endpoint ( /search ) - NEW Perform web searches and optionally scrape the results in one operation.
Basic search
results
app . search ( query = "best practices for React server components" , limit = 10 ) for result in results : print ( f" { result . title } : { result . url } " )
Search + scrape results
results
app . search ( query = "React server components tutorial" , limit = 5 , scrape_options = { "formats" : [ "markdown" ] , "only_main_content" : True } ) for result in results : print ( f" { result . title } " ) print ( result . markdown [ : 500 ] ) Search Options results = app . search ( query = "machine learning papers" , limit = 20 ,
Filter by source type
sources
[ "web" , "news" , "images" ] ,
Filter by category
categories
[ "github" , "research" , "pdf" ] ,
Location
location
{ "country" : "US" } ,
Time filter
tbs
"qdr:m" ,
Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year)
timeout
- 30000
- )
- Cost
- 2 credits per 10 results + scraping costs if enabled. 5. Extract Endpoint ( /v2/extract ) AI-powered structured data extraction from single pages, multiple pages, or entire domains. Single Page from pydantic import BaseModel class Product ( BaseModel ) : name : str price : float description : str in_stock : bool result = app . extract ( urls = [ "https://example.com/product" ] , schema = Product , system_prompt = "Extract product information" ) print ( result . data ) Multi-Page / Domain Extraction
Extract from entire domain using wildcard
result
app . extract ( urls = [ "example.com/*" ] ,
All pages on domain
schema
Product , system_prompt = "Extract all products" )
Enable web search for additional context
result
app . extract ( urls = [ "example.com/products" ] , schema = Product , enable_web_search = True
Follow external links
) Prompt-Only Extraction (No Schema) result = app . extract ( urls = [ "https://example.com/about" ] , prompt = "Extract the company name, founding year, and key executives" )
LLM determines output structure
- Agent Endpoint ( /agent ) - NEW Autonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.
Basic agent usage
result
app . agent ( prompt = "Find the pricing plans for the top 3 headless CMS platforms and compare their features" ) print ( result . data )
With schema for structured output
from pydantic import BaseModel from typing import List class CMSPricing ( BaseModel ) : name : str free_tier : bool starter_price : float features : List [ str ] result = app . agent ( prompt = "Find pricing for Contentful, Sanity, and Strapi" , schema = CMSPricing )
Optional: focus on specific URLs
result
app . agent ( prompt = "Extract the enterprise pricing details" , urls = [ "https://contentful.com/pricing" , "https://sanity.io/pricing" ] ) Agent Models Model Best For Cost spark-1-mini (default) Simple extractions, high volume Standard spark-1-pro Complex analysis, ambiguous data 60% more result = app . agent ( prompt = "Analyze competitive positioning..." , model = "spark-1-pro"
For complex tasks
) Async Agent
Start agent (returns immediately)
job
app . start_agent ( prompt = "Research market trends..." )
Poll for results
status
- app
- .
- check_agent_status
- (
- job
- .
- id
- )
- if
- status
- .
- status
- ==
- "completed"
- :
- (
- status
- .
- data
- )
- Note
- Agent is in Research Preview. 5 free daily requests, then credit-based billing. 7. Batch Scrape - NEW Process multiple URLs efficiently in a single operation. Synchronous (waits for completion) results = app . batch_scrape ( urls = [ "https://example.com/page1" , "https://example.com/page2" , "https://example.com/page3" ] , formats = [ "markdown" ] , only_main_content = True ) for page in results . data : print ( f" { page . metadata . source_url } : { len ( page . markdown ) } chars" ) Asynchronous (with webhooks) job = app . start_batch_scrape ( urls = url_list , formats = [ "markdown" ] , webhook = "https://your-domain.com/webhook" )
Webhook receives events: started, page, completed, failed
const job = await app . startBatchScrape ( urls , { formats : [ 'markdown' ] , webhook : 'https://your-domain.com/webhook' } ) ; // Poll for status const status = await app . checkBatchScrapeStatus ( job . id ) ; 8. Change Tracking - NEW Monitor content changes over time by comparing scrapes.
Enable change tracking
doc
app . scrape ( url = "https://example.com/pricing" , formats = [ "markdown" , "changeTracking" ] )
Response includes:
print ( doc . change_tracking . status )
new, same, changed, removed
print ( doc . change_tracking . previous_scrape_at ) print ( doc . change_tracking . visibility )
visible, hidden
Comparison Modes
Git-diff mode (default)
doc
app . scrape ( url = "https://example.com/docs" , formats = [ "markdown" , "changeTracking" ] , change_tracking_options = { "mode" : "diff" } ) print ( doc . change_tracking . diff )
Line-by-line changes
JSON mode (structured comparison)
doc
app . scrape ( url = "https://example.com/pricing" , formats = [ "markdown" , "changeTracking" ] , change_tracking_options = { "mode" : "json" , "schema" : { "type" : "object" , "properties" : { "price" : { "type" : "number" } } } } )
Costs 5 credits per page
Change States : new - Page not seen before same - No changes since last scrape changed - Content modified removed - Page no longer accessible Authentication
Get API key from https://www.firecrawl.dev/app
Store in environment
FIRECRAWL_API_KEY
fc-your-api-key-here Never hardcode API keys! Cloudflare Workers Integration The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly: interface Env { FIRECRAWL_API_KEY : string ; } export default { async fetch ( request : Request , env : Env ) : Promise < Response
{ const { url } = await request . json < { url : string }
( ) ; const response = await fetch ( 'https://api.firecrawl.dev/v2/scrape' , { method : 'POST' , headers : { 'Authorization' :
Bearer ${ env . FIRECRAWL_API_KEY }, 'Content-Type' : 'application/json' , } , body : JSON . stringify ( { url , formats : [ 'markdown' ] , onlyMainContent : true } ) } ) ; const result = await response . json ( ) ; return Response . json ( result ) ; } } ; Rate Limits & Pricing Warning: Stealth Mode Pricing Change (May 2025) Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails. Recommended pattern :
Use auto mode (default) - only charges 5 credits if stealth is needed
doc
app . scrape ( url , formats = [ "markdown" ] )
Or conditionally enable stealth for specific errors
- if
- error_status_code
- in
- [
- 401
- ,
- 403
- ,
- 500
- ]
- :
- doc
- =
- app
- .
- scrape
- (
- url
- ,
- formats
- =
- [
- "markdown"
- ]
- ,
- proxy
- =
- "stealth"
- )
- Unified Billing (November 2025)
- Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).
- Pricing Tiers
- Tier
- Credits/Month
- Notes
- Free
- 500
- Good for testing
- Hobby
- 3,000
- $19/month
- Standard
- 100,000
- $99/month
- Growth
- 500,000
- $399/month
- Credit Costs
- :
- Scrape: 1 credit (basic), 5 credits (stealth)
- Crawl: 1 credit per page
- Search: 2 credits per 10 results
- Extract: 5 credits per page (changed from tokens in v2.6.0)
- Agent: Dynamic (complexity-based)
- Change Tracking JSON mode: +5 credits
- Common Issues & Solutions
- Issue
- Cause
- Solution
- Empty content
- JS not loaded
- Add
- wait_for: 5000
- or use
- actions
- Rate limit exceeded
- Over quota
- Check dashboard, upgrade plan
- Timeout error
- Slow page
- Increase
- timeout
- , use
- stealth: true
- Bot detection
- Anti-scraping
- Use
- stealth: true
- , add
- location
- Invalid API key
- Wrong format
- Must start with
- fc-
- Known Issues Prevention
- This skill prevents
- 10
- documented issues:
- Issue #1: Stealth Mode Pricing Change (May 2025)
- Error
-
- Unexpected credit costs when using stealth mode
- Source
- :
- Stealth Mode Docs
- |
- Changelog
- Why It Happens
-
- Starting May 8th, 2025, Stealth Mode proxy requests cost
- 5 credits per request
- (previously included in standard pricing). This is a significant billing change.
- Prevention
- Use auto mode (default) which only charges stealth credits if basic fails
RECOMMENDED: Use auto mode (default)
doc
app . scrape ( url , formats = [ 'markdown' ] )
Auto retries with stealth (5 credits) only if basic fails
Or conditionally enable based on error status
- try
- :
- doc
- =
- app
- .
- scrape
- (
- url
- ,
- formats
- =
- [
- 'markdown'
- ]
- ,
- proxy
- =
- 'basic'
- )
- except
- Exception
- as
- e
- :
- if
- e
- .
- status_code
- in
- [
- 401
- ,
- 403
- ,
- 500
- ]
- :
- doc
- =
- app
- .
- scrape
- (
- url
- ,
- formats
- =
- [
- 'markdown'
- ]
- ,
- proxy
- =
- 'stealth'
- )
- Stealth Mode Options
- :
- auto
- (default): Charges 5 credits only if stealth succeeds after basic fails
- basic
-
- Standard proxies, 1 credit cost
- stealth
-
- 5 credits per request when actively used
- Issue #2: v2.0.0 Breaking Changes - Method Renames
- Error
- :
- AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'
- Source
- :
- v2.0.0 Release
- |
- Migration Guide
- Why It Happens
-
- v2.0.0 (August 2025) renamed SDK methods across all languages
- Prevention
- Use new method names JavaScript/TypeScript : scrapeUrl() → scrape() crawlUrl() → crawl() or startCrawl() asyncCrawlUrl() → startCrawl() checkCrawlStatus() → getCrawlStatus() Python : scrape_url() → scrape() crawl_url() → crawl() or start_crawl()
OLD (v1)
doc
app . scrape_url ( "https://example.com" )
NEW (v2)
doc
- app
- .
- scrape
- (
- "https://example.com"
- )
- Issue #3: v2.0.0 Breaking Changes - Format Changes
- Error
- :
- 'extract' is not a valid format
- Source
- :
- v2.0.0 Release
- Why It Happens
-
- Old
- "extract"
- format renamed to
- "json"
- in v2.0.0
- Prevention
- Use new object format for JSON extraction
OLD (v1)
doc
app . scrape_url ( url = "https://example.com" , params = { "formats" : [ "extract" ] , "extract" : { "prompt" : "Extract title" } } )
NEW (v2)
doc
app . scrape ( url = "https://example.com" , formats = [ { "type" : "json" , "prompt" : "Extract title" } ] )
With schema
doc
app . scrape ( url = "https://example.com" , formats = [ { "type" : "json" , "prompt" : "Extract product info" , "schema" : { "type" : "object" , "properties" : { "title" : { "type" : "string" } , "price" : { "type" : "number" } } } } ] ) Screenshot format also changed :
NEW: Screenshot as object
formats
- [
- {
- "type"
- :
- "screenshot"
- ,
- "fullPage"
- :
- True
- ,
- "quality"
- :
- 80
- ,
- "viewport"
- :
- {
- "width"
- :
- 1920
- ,
- "height"
- :
- 1080
- }
- }
- ]
- Issue #4: v2.0.0 Breaking Changes - Crawl Options
- Error
- :
- 'allowBackwardCrawling' is not a valid parameter
- Source
- :
- v2.0.0 Release
- Why It Happens
-
- Several crawl parameters renamed or removed in v2.0.0
- Prevention
- Use new parameter names Parameter Changes : allowBackwardCrawling → Use crawlEntireDomain instead maxDepth → Use maxDiscoveryDepth instead ignoreSitemap (bool) → sitemap ("only", "skip", "include")
OLD (v1)
app . crawl_url ( url = "https://docs.example.com" , params = { "allowBackwardCrawling" : True , "maxDepth" : 3 , "ignoreSitemap" : False } )
NEW (v2)
app . crawl ( url = "https://docs.example.com" , crawl_entire_domain = True , max_discovery_depth = 3 , sitemap = "include"
"only", "skip", or "include"
- )
- Issue #5: v2.0.0 Default Behavior Changes
- Error
-
- Stale cached content returned unexpectedly
- Source
- :
- v2.0.0 Release
- Why It Happens
-
- v2.0.0 changed several defaults
- Prevention
- Be aware of new defaults Default Changes : maxAge now defaults to 2 days (cached by default) blockAds , skipTlsVerification , removeBase64Images enabled by default
Force fresh data if needed
doc
app . scrape ( url , formats = [ 'markdown' ] , max_age = 0 )
Disable cache entirely
doc
- app
- .
- scrape
- (
- url
- ,
- formats
- =
- [
- 'markdown'
- ]
- ,
- store_in_cache
- =
- False
- )
- Issue #6: Job Status Race Condition
- Error
- :
- "Job not found"
- when checking crawl status immediately after creation
- Source
- :
- GitHub Issue #2662
- Why It Happens
-
- Database replication delay between job creation and status endpoint availability
- Prevention
- Wait 1-3 seconds before first status check, or implement retry logic import time
Start crawl
job
app . start_crawl ( url = "https://docs.example.com" ) print ( f"Job ID: { job . id } " )
REQUIRED: Wait before first status check
time . sleep ( 2 )
1-3 seconds recommended
Now status check succeeds
status
app . get_crawl_status ( job . id )
Or implement retry logic
- def
- get_status_with_retry
- (
- job_id
- ,
- max_retries
- =
- 3
- ,
- delay
- =
- 1
- )
- :
- for
- attempt
- in
- range
- (
- max_retries
- )
- :
- try
- :
- return
- app
- .
- get_crawl_status
- (
- job_id
- )
- except
- Exception
- as
- e
- :
- if
- "Job not found"
- in
- str
- (
- e
- )
- and
- attempt
- <
- max_retries
- -
- 1
- :
- time
- .
- sleep
- (
- delay
- )
- continue
- raise
- status
- =
- get_status_with_retry
- (
- job
- .
- id
- )
- Issue #7: DNS Errors Return HTTP 200
- Error
-
- DNS resolution failures return
- success: false
- with HTTP 200 status instead of 4xx
- Source
- :
- GitHub Issue #2402
- | Fixed in v2.7.0
- Why It Happens
-
- Changed in v2.7.0 for consistent error handling
- Prevention
-
- Check
- success
- field and
- code
- field, don't rely on HTTP status alone
- const
- result
- =
- await
- app
- .
- scrape
- (
- 'https://nonexistent-domain-xyz.com'
- )
- ;
- // DON'T rely on HTTP status code
- // Response: HTTP 200 with
- // DO check success field
- if
- (
- !
- result
- .
- success
- )
- {
- if
- (
- result
- .
- code
- ===
- 'SCRAPE_DNS_RESOLUTION_ERROR'
- )
- {
- console
- .
- error
- (
- 'DNS resolution failed'
- )
- ;
- }
- throw
- new
- Error
- (
- result
- .
- error
- )
- ;
- }
- Note
-
- DNS resolution errors still charge 1 credit despite failure.
- Issue #8: Bot Detection Still Charges Credits
- Error
-
- Cloudflare error page returned as "successful" scrape, credits charged
- Source
- :
- GitHub Issue #2413
- Why It Happens
-
- Fire-1 engine charges credits even when bot detection prevents access
- Prevention
- Validate content isn't an error page before processing; use stealth mode for protected sites
First attempt without stealth
doc
app . scrape ( url = "https://protected-site.com" , formats = [ "markdown" ] )
Validate content isn't an error page
if "cloudflare" in doc . markdown . lower ( ) or "access denied" in doc . markdown . lower ( ) :
Retry with stealth (costs 5 credits if successful)
doc
- app
- .
- scrape
- (
- url
- ,
- formats
- =
- [
- "markdown"
- ]
- ,
- stealth
- =
- True
- )
- Cost Impact
-
- Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.
- Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness
- Error
- :
- "All scraping engines failed!"
- (SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures
- Source
- :
- GitHub Issue #2257
- Why It Happens
-
- Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service
- Prevention
- Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy
Self-hosted fails on Cloudflare-protected sites
curl -X POST 'http://localhost:3002/v2/scrape' \ -H 'Authorization: Bearer YOUR_API_KEY' \ -d '{ "url": "https://www.example.com/", "pageOptions": { "engine": "playwright" } }'
Error: "All scraping engines failed!"
Workaround: Use cloud service instead
Cloud service has better anti-fingerprinting
- Note
-
- This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
- Issue #10: Cache Performance Best Practices (Community-sourced)
- Suboptimal
-
- Not leveraging cache can make requests 500% slower
- Source
- :
- Fast Scraping Docs
- |
- Blog Post
- Why It Matters
-
- Default
- maxAge
- is 2 days in v2+, but many use cases need different strategies
- Prevention
- Use appropriate cache strategy for your content type
Fresh data (real-time pricing, stock prices)
doc
app . scrape ( url , formats = [ "markdown" ] , max_age = 0 )
10-minute cache (news, blogs)
doc
app . scrape ( url , formats = [ "markdown" ] , max_age = 600000 )
milliseconds
Use default cache (2 days) for static content
doc
app . scrape ( url , formats = [ "markdown" ] )
maxAge defaults to 172800000
Don't store in cache (one-time scrape)
doc
app . scrape ( url , formats = [ "markdown" ] , store_in_cache = False )
Require minimum age before re-scraping (v2.7.0+)
doc
app . scrape ( url , formats = [ "markdown" ] , min_age = 3600000 )
1 hour minimum
- Performance Impact
- :
- Cached response: Milliseconds
- Fresh scrape: Seconds
- Speed difference:
- Up to 500%
- Package Versions
- Package
- Version
- Last Checked
- firecrawl-py
- 4.13.0+
- 2026-01-20
- @mendable/firecrawl-js
- 4.11.1+
- 2026-01-20
- API Version
- v2
- Current
- Official Documentation
- Docs
- :
- https://docs.firecrawl.dev
- Python SDK
- :
- https://docs.firecrawl.dev/sdks/python
- Node.js SDK
- :
- https://docs.firecrawl.dev/sdks/node
- API Reference
- :
- https://docs.firecrawl.dev/api-reference
- GitHub
- :
- https://github.com/mendableai/firecrawl
- Dashboard
- :
- https://www.firecrawl.dev/app
- Token Savings
-
- ~65% vs manual integration
- Error Prevention
-
- 10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization)
- Production Ready
-
- Yes
- Last verified
-
- 2026-01-21 |
- Skill version
-
- 2.0.0 |
- Changes
- Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model