firecrawl-scraper

安装量: 526
排名: #2055

安装

npx skills add https://github.com/jezweb/claude-skills --skill firecrawl-scraper
Firecrawl Web Scraper Skill
Status
Production Ready
Last Updated
2026-01-20
Official Docs
:
https://docs.firecrawl.dev
API Version
v2
SDK Versions
firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+ What is Firecrawl? Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles: JavaScript rendering - Executes client-side JavaScript to capture dynamic content Anti-bot bypass - Gets past CAPTCHA and bot detection systems Format conversion - Outputs as markdown, HTML, JSON, screenshots, summaries Document parsing - Processes PDFs, DOCX files, and images Autonomous agents - AI-powered web data gathering without URLs Change tracking - Monitor content changes over time Branding extraction - Extract color schemes, typography, logos API Endpoints Overview Endpoint Purpose Use Case /scrape Single page Extract article, product page /crawl Full site Index docs, archive sites /map URL discovery Find all pages, plan strategy /search Web search + scrape Research with live data /extract Structured data Product prices, contacts /agent Autonomous gathering No URLs needed, AI navigates /batch-scrape Multiple URLs Bulk processing 1. Scrape Endpoint ( /v2/scrape ) Scrapes a single webpage and returns clean, structured content. Basic Usage from firecrawl import Firecrawl import os app = Firecrawl ( api_key = os . environ . get ( "FIRECRAWL_API_KEY" ) )

Basic scrape

doc

app . scrape ( url = "https://example.com/article" , formats = [ "markdown" , "html" ] , only_main_content = True ) print ( doc . markdown ) print ( doc . metadata ) import FirecrawlApp from '@mendable/firecrawl-js' ; const app = new FirecrawlApp ( { apiKey : process . env . FIRECRAWL_API_KEY } ) ; const result = await app . scrapeUrl ( 'https://example.com/article' , { formats : [ 'markdown' , 'html' ] , onlyMainContent : true } ) ; console . log ( result . markdown ) ; Output Formats Format Description markdown LLM-optimized content html Full HTML rawHtml Unprocessed HTML screenshot Page capture (with viewport options) links All URLs on page json Structured data extraction summary AI-generated summary branding Design system data changeTracking Content change detection Advanced Options doc = app . scrape ( url = "https://example.com" , formats = [ "markdown" , "screenshot" ] , only_main_content = True , remove_base64_images = True , wait_for = 5000 ,

Wait 5s for JS

timeout

30000 ,

Location & language

location

{ "country" : "AU" , "languages" : [ "en-AU" ] } ,

Cache control

max_age

0 ,

Fresh content (no cache)

store_in_cache

True ,

Stealth mode for complex sites

stealth

True ,

Custom headers

headers

{ "User-Agent" : "Custom Bot 1.0" } ) Browser Actions Perform interactions before scraping: doc = app . scrape ( url = "https://example.com" , actions = [ { "type" : "click" , "selector" : "button.load-more" } , { "type" : "wait" , "milliseconds" : 2000 } , { "type" : "scroll" , "direction" : "down" } , { "type" : "write" , "selector" : "input#search" , "text" : "query" } , { "type" : "press" , "key" : "Enter" } , { "type" : "screenshot" }

Capture state mid-action

] ) JSON Mode (Structured Extraction)

With schema

doc

app . scrape ( url = "https://example.com/product" , formats = [ "json" ] , json_options = { "schema" : { "type" : "object" , "properties" : { "title" : { "type" : "string" } , "price" : { "type" : "number" } , "in_stock" : { "type" : "boolean" } } } } )

Without schema (prompt-only)

doc

app . scrape ( url = "https://example.com/product" , formats = [ "json" ] , json_options = { "prompt" : "Extract the product name, price, and availability" } ) Branding Extraction Extract design system and brand identity: doc = app . scrape ( url = "https://example.com" , formats = [ "branding" ] )

Returns:

- Color schemes and palettes

- Typography (fonts, sizes, weights)

- Spacing and layout metrics

- UI component styles

- Logo and imagery URLs

- Brand personality traits

  1. Crawl Endpoint ( /v2/crawl ) Crawls all accessible pages from a starting URL. result = app . crawl ( url = "https://docs.example.com" , limit = 100 , max_depth = 3 , allowed_domains = [ "docs.example.com" ] , exclude_paths = [ "/api/" , "/admin/" ] , scrape_options = { "formats" : [ "markdown" ] , "only_main_content" : True } ) for page in result . data : print ( f"Scraped: { page . metadata . source_url } " ) print ( f"Content: { page . markdown [ : 200] } ..." ) Async Crawl with Webhooks

Start crawl (returns immediately)

job

app . start_crawl ( url = "https://docs.example.com" , limit = 1000 , webhook = "https://your-domain.com/webhook" ) print ( f"Job ID: { job . id } " )

Or poll for status

status

app . check_crawl_status ( job . id ) 3. Map Endpoint ( /v2/map ) Rapidly discover all URLs on a website without scraping content. urls = app . map ( url = "https://example.com" ) print ( f"Found { len ( urls ) } pages" ) for url in urls [ : 10 ] : print ( url ) Use for: sitemap discovery, crawl planning, website audits. 4. Search Endpoint ( /search ) - NEW Perform web searches and optionally scrape the results in one operation.

Basic search

results

app . search ( query = "best practices for React server components" , limit = 10 ) for result in results : print ( f" { result . title } : { result . url } " )

Search + scrape results

results

app . search ( query = "React server components tutorial" , limit = 5 , scrape_options = { "formats" : [ "markdown" ] , "only_main_content" : True } ) for result in results : print ( f" { result . title } " ) print ( result . markdown [ : 500 ] ) Search Options results = app . search ( query = "machine learning papers" , limit = 20 ,

Filter by source type

sources

[ "web" , "news" , "images" ] ,

Filter by category

categories

[ "github" , "research" , "pdf" ] ,

Location

location

{ "country" : "US" } ,

Time filter

tbs

"qdr:m" ,

Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year)

timeout

30000
)
Cost
2 credits per 10 results + scraping costs if enabled. 5. Extract Endpoint ( /v2/extract ) AI-powered structured data extraction from single pages, multiple pages, or entire domains. Single Page from pydantic import BaseModel class Product ( BaseModel ) : name : str price : float description : str in_stock : bool result = app . extract ( urls = [ "https://example.com/product" ] , schema = Product , system_prompt = "Extract product information" ) print ( result . data ) Multi-Page / Domain Extraction

Extract from entire domain using wildcard

result

app . extract ( urls = [ "example.com/*" ] ,

All pages on domain

schema

Product , system_prompt = "Extract all products" )

Enable web search for additional context

result

app . extract ( urls = [ "example.com/products" ] , schema = Product , enable_web_search = True

Follow external links

) Prompt-Only Extraction (No Schema) result = app . extract ( urls = [ "https://example.com/about" ] , prompt = "Extract the company name, founding year, and key executives" )

LLM determines output structure

  1. Agent Endpoint ( /agent ) - NEW Autonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.

Basic agent usage

result

app . agent ( prompt = "Find the pricing plans for the top 3 headless CMS platforms and compare their features" ) print ( result . data )

With schema for structured output

from pydantic import BaseModel from typing import List class CMSPricing ( BaseModel ) : name : str free_tier : bool starter_price : float features : List [ str ] result = app . agent ( prompt = "Find pricing for Contentful, Sanity, and Strapi" , schema = CMSPricing )

Optional: focus on specific URLs

result

app . agent ( prompt = "Extract the enterprise pricing details" , urls = [ "https://contentful.com/pricing" , "https://sanity.io/pricing" ] ) Agent Models Model Best For Cost spark-1-mini (default) Simple extractions, high volume Standard spark-1-pro Complex analysis, ambiguous data 60% more result = app . agent ( prompt = "Analyze competitive positioning..." , model = "spark-1-pro"

For complex tasks

) Async Agent

Start agent (returns immediately)

job

app . start_agent ( prompt = "Research market trends..." )

Poll for results

status

app
.
check_agent_status
(
job
.
id
)
if
status
.
status
==
"completed"
:
print
(
status
.
data
)
Note
Agent is in Research Preview. 5 free daily requests, then credit-based billing. 7. Batch Scrape - NEW Process multiple URLs efficiently in a single operation. Synchronous (waits for completion) results = app . batch_scrape ( urls = [ "https://example.com/page1" , "https://example.com/page2" , "https://example.com/page3" ] , formats = [ "markdown" ] , only_main_content = True ) for page in results . data : print ( f" { page . metadata . source_url } : { len ( page . markdown ) } chars" ) Asynchronous (with webhooks) job = app . start_batch_scrape ( urls = url_list , formats = [ "markdown" ] , webhook = "https://your-domain.com/webhook" )

Webhook receives events: started, page, completed, failed

const job = await app . startBatchScrape ( urls , { formats : [ 'markdown' ] , webhook : 'https://your-domain.com/webhook' } ) ; // Poll for status const status = await app . checkBatchScrapeStatus ( job . id ) ; 8. Change Tracking - NEW Monitor content changes over time by comparing scrapes.

Enable change tracking

doc

app . scrape ( url = "https://example.com/pricing" , formats = [ "markdown" , "changeTracking" ] )

Response includes:

print ( doc . change_tracking . status )

new, same, changed, removed

print ( doc . change_tracking . previous_scrape_at ) print ( doc . change_tracking . visibility )

visible, hidden

Comparison Modes

Git-diff mode (default)

doc

app . scrape ( url = "https://example.com/docs" , formats = [ "markdown" , "changeTracking" ] , change_tracking_options = { "mode" : "diff" } ) print ( doc . change_tracking . diff )

Line-by-line changes

JSON mode (structured comparison)

doc

app . scrape ( url = "https://example.com/pricing" , formats = [ "markdown" , "changeTracking" ] , change_tracking_options = { "mode" : "json" , "schema" : { "type" : "object" , "properties" : { "price" : { "type" : "number" } } } } )

Costs 5 credits per page

Change States : new - Page not seen before same - No changes since last scrape changed - Content modified removed - Page no longer accessible Authentication

Get API key from https://www.firecrawl.dev/app

Store in environment

FIRECRAWL_API_KEY

fc-your-api-key-here Never hardcode API keys! Cloudflare Workers Integration The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly: interface Env { FIRECRAWL_API_KEY : string ; } export default { async fetch ( request : Request , env : Env ) : Promise < Response

{ const { url } = await request . json < { url : string }

( ) ; const response = await fetch ( 'https://api.firecrawl.dev/v2/scrape' , { method : 'POST' , headers : { 'Authorization' : Bearer ${ env . FIRECRAWL_API_KEY } , 'Content-Type' : 'application/json' , } , body : JSON . stringify ( { url , formats : [ 'markdown' ] , onlyMainContent : true } ) } ) ; const result = await response . json ( ) ; return Response . json ( result ) ; } } ; Rate Limits & Pricing Warning: Stealth Mode Pricing Change (May 2025) Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails. Recommended pattern :

Use auto mode (default) - only charges 5 credits if stealth is needed

doc

app . scrape ( url , formats = [ "markdown" ] )

Or conditionally enable stealth for specific errors

if
error_status_code
in
[
401
,
403
,
500
]
:
doc
=
app
.
scrape
(
url
,
formats
=
[
"markdown"
]
,
proxy
=
"stealth"
)
Unified Billing (November 2025)
Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).
Pricing Tiers
Tier
Credits/Month
Notes
Free
500
Good for testing
Hobby
3,000
$19/month
Standard
100,000
$99/month
Growth
500,000
$399/month
Credit Costs
:
Scrape: 1 credit (basic), 5 credits (stealth)
Crawl: 1 credit per page
Search: 2 credits per 10 results
Extract: 5 credits per page (changed from tokens in v2.6.0)
Agent: Dynamic (complexity-based)
Change Tracking JSON mode: +5 credits
Common Issues & Solutions
Issue
Cause
Solution
Empty content
JS not loaded
Add
wait_for: 5000
or use
actions
Rate limit exceeded
Over quota
Check dashboard, upgrade plan
Timeout error
Slow page
Increase
timeout
, use
stealth: true
Bot detection
Anti-scraping
Use
stealth: true
, add
location
Invalid API key
Wrong format
Must start with
fc-
Known Issues Prevention
This skill prevents
10
documented issues:
Issue #1: Stealth Mode Pricing Change (May 2025)
Error
Unexpected credit costs when using stealth mode
Source
:
Stealth Mode Docs
|
Changelog
Why It Happens
Starting May 8th, 2025, Stealth Mode proxy requests cost
5 credits per request
(previously included in standard pricing). This is a significant billing change.
Prevention
Use auto mode (default) which only charges stealth credits if basic fails

RECOMMENDED: Use auto mode (default)

doc

app . scrape ( url , formats = [ 'markdown' ] )

Auto retries with stealth (5 credits) only if basic fails

Or conditionally enable based on error status

try
:
doc
=
app
.
scrape
(
url
,
formats
=
[
'markdown'
]
,
proxy
=
'basic'
)
except
Exception
as
e
:
if
e
.
status_code
in
[
401
,
403
,
500
]
:
doc
=
app
.
scrape
(
url
,
formats
=
[
'markdown'
]
,
proxy
=
'stealth'
)
Stealth Mode Options
:
auto
(default): Charges 5 credits only if stealth succeeds after basic fails
basic
Standard proxies, 1 credit cost
stealth
5 credits per request when actively used
Issue #2: v2.0.0 Breaking Changes - Method Renames
Error
:
AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'
Source
:
v2.0.0 Release
|
Migration Guide
Why It Happens
v2.0.0 (August 2025) renamed SDK methods across all languages
Prevention
Use new method names JavaScript/TypeScript : scrapeUrl() → scrape() crawlUrl() → crawl() or startCrawl() asyncCrawlUrl() → startCrawl() checkCrawlStatus() → getCrawlStatus() Python : scrape_url() → scrape() crawl_url() → crawl() or start_crawl()

OLD (v1)

doc

app . scrape_url ( "https://example.com" )

NEW (v2)

doc

app
.
scrape
(
"https://example.com"
)
Issue #3: v2.0.0 Breaking Changes - Format Changes
Error
:
'extract' is not a valid format
Source
:
v2.0.0 Release
Why It Happens
Old
"extract"
format renamed to
"json"
in v2.0.0
Prevention
Use new object format for JSON extraction

OLD (v1)

doc

app . scrape_url ( url = "https://example.com" , params = { "formats" : [ "extract" ] , "extract" : { "prompt" : "Extract title" } } )

NEW (v2)

doc

app . scrape ( url = "https://example.com" , formats = [ { "type" : "json" , "prompt" : "Extract title" } ] )

With schema

doc

app . scrape ( url = "https://example.com" , formats = [ { "type" : "json" , "prompt" : "Extract product info" , "schema" : { "type" : "object" , "properties" : { "title" : { "type" : "string" } , "price" : { "type" : "number" } } } } ] ) Screenshot format also changed :

NEW: Screenshot as object

formats

[
{
"type"
:
"screenshot"
,
"fullPage"
:
True
,
"quality"
:
80
,
"viewport"
:
{
"width"
:
1920
,
"height"
:
1080
}
}
]
Issue #4: v2.0.0 Breaking Changes - Crawl Options
Error
:
'allowBackwardCrawling' is not a valid parameter
Source
:
v2.0.0 Release
Why It Happens
Several crawl parameters renamed or removed in v2.0.0
Prevention
Use new parameter names Parameter Changes : allowBackwardCrawling → Use crawlEntireDomain instead maxDepth → Use maxDiscoveryDepth instead ignoreSitemap (bool) → sitemap ("only", "skip", "include")

OLD (v1)

app . crawl_url ( url = "https://docs.example.com" , params = { "allowBackwardCrawling" : True , "maxDepth" : 3 , "ignoreSitemap" : False } )

NEW (v2)

app . crawl ( url = "https://docs.example.com" , crawl_entire_domain = True , max_discovery_depth = 3 , sitemap = "include"

"only", "skip", or "include"

)
Issue #5: v2.0.0 Default Behavior Changes
Error
Stale cached content returned unexpectedly
Source
:
v2.0.0 Release
Why It Happens
v2.0.0 changed several defaults
Prevention
Be aware of new defaults Default Changes : maxAge now defaults to 2 days (cached by default) blockAds , skipTlsVerification , removeBase64Images enabled by default

Force fresh data if needed

doc

app . scrape ( url , formats = [ 'markdown' ] , max_age = 0 )

Disable cache entirely

doc

app
.
scrape
(
url
,
formats
=
[
'markdown'
]
,
store_in_cache
=
False
)
Issue #6: Job Status Race Condition
Error
:
"Job not found"
when checking crawl status immediately after creation
Source
:
GitHub Issue #2662
Why It Happens
Database replication delay between job creation and status endpoint availability
Prevention
Wait 1-3 seconds before first status check, or implement retry logic import time

Start crawl

job

app . start_crawl ( url = "https://docs.example.com" ) print ( f"Job ID: { job . id } " )

REQUIRED: Wait before first status check

time . sleep ( 2 )

1-3 seconds recommended

Now status check succeeds

status

app . get_crawl_status ( job . id )

Or implement retry logic

def
get_status_with_retry
(
job_id
,
max_retries
=
3
,
delay
=
1
)
:
for
attempt
in
range
(
max_retries
)
:
try
:
return
app
.
get_crawl_status
(
job_id
)
except
Exception
as
e
:
if
"Job not found"
in
str
(
e
)
and
attempt
<
max_retries
-
1
:
time
.
sleep
(
delay
)
continue
raise
status
=
get_status_with_retry
(
job
.
id
)
Issue #7: DNS Errors Return HTTP 200
Error
DNS resolution failures return
success: false
with HTTP 200 status instead of 4xx
Source
:
GitHub Issue #2402
| Fixed in v2.7.0
Why It Happens
Changed in v2.7.0 for consistent error handling
Prevention
Check
success
field and
code
field, don't rely on HTTP status alone
const
result
=
await
app
.
scrape
(
'https://nonexistent-domain-xyz.com'
)
;
// DON'T rely on HTTP status code
// Response: HTTP 200 with
// DO check success field
if
(
!
result
.
success
)
{
if
(
result
.
code
===
'SCRAPE_DNS_RESOLUTION_ERROR'
)
{
console
.
error
(
'DNS resolution failed'
)
;
}
throw
new
Error
(
result
.
error
)
;
}
Note
DNS resolution errors still charge 1 credit despite failure.
Issue #8: Bot Detection Still Charges Credits
Error
Cloudflare error page returned as "successful" scrape, credits charged
Source
:
GitHub Issue #2413
Why It Happens
Fire-1 engine charges credits even when bot detection prevents access
Prevention
Validate content isn't an error page before processing; use stealth mode for protected sites

First attempt without stealth

doc

app . scrape ( url = "https://protected-site.com" , formats = [ "markdown" ] )

Validate content isn't an error page

if "cloudflare" in doc . markdown . lower ( ) or "access denied" in doc . markdown . lower ( ) :

Retry with stealth (costs 5 credits if successful)

doc

app
.
scrape
(
url
,
formats
=
[
"markdown"
]
,
stealth
=
True
)
Cost Impact
Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.
Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness
Error
:
"All scraping engines failed!"
(SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures
Source
:
GitHub Issue #2257
Why It Happens
Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service
Prevention
Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy

Self-hosted fails on Cloudflare-protected sites

curl -X POST 'http://localhost:3002/v2/scrape' \ -H 'Authorization: Bearer YOUR_API_KEY' \ -d '{ "url": "https://www.example.com/", "pageOptions": { "engine": "playwright" } }'

Error: "All scraping engines failed!"

Workaround: Use cloud service instead

Cloud service has better anti-fingerprinting

Note
This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
Issue #10: Cache Performance Best Practices (Community-sourced)
Suboptimal
Not leveraging cache can make requests 500% slower
Source
:
Fast Scraping Docs
|
Blog Post
Why It Matters
Default
maxAge
is 2 days in v2+, but many use cases need different strategies
Prevention
Use appropriate cache strategy for your content type

Fresh data (real-time pricing, stock prices)

doc

app . scrape ( url , formats = [ "markdown" ] , max_age = 0 )

10-minute cache (news, blogs)

doc

app . scrape ( url , formats = [ "markdown" ] , max_age = 600000 )

milliseconds

Use default cache (2 days) for static content

doc

app . scrape ( url , formats = [ "markdown" ] )

maxAge defaults to 172800000

Don't store in cache (one-time scrape)

doc

app . scrape ( url , formats = [ "markdown" ] , store_in_cache = False )

Require minimum age before re-scraping (v2.7.0+)

doc

app . scrape ( url , formats = [ "markdown" ] , min_age = 3600000 )

1 hour minimum

Performance Impact
:
Cached response: Milliseconds
Fresh scrape: Seconds
Speed difference:
Up to 500%
Package Versions
Package
Version
Last Checked
firecrawl-py
4.13.0+
2026-01-20
@mendable/firecrawl-js
4.11.1+
2026-01-20
API Version
v2
Current
Official Documentation
Docs
:
https://docs.firecrawl.dev
Python SDK
:
https://docs.firecrawl.dev/sdks/python
Node.js SDK
:
https://docs.firecrawl.dev/sdks/node
API Reference
:
https://docs.firecrawl.dev/api-reference
GitHub
:
https://github.com/mendableai/firecrawl
Dashboard
:
https://www.firecrawl.dev/app
Token Savings
~65% vs manual integration
Error Prevention
10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization)
Production Ready
Yes
Last verified
2026-01-21 |
Skill version
2.0.0 |
Changes
Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model
返回排行榜