firecrawl-scraping

安装量: 154
排名: #5583

安装

npx skills add https://github.com/casper-studios/casper-marketplace --skill firecrawl-scraping

Firecrawl Scraping Overview Scrape individual web pages and convert them to clean, LLM-ready markdown. Handles JavaScript rendering, anti-bot protection, and dynamic content. Quick Decision Tree What are you scraping? │ ├── Single page (article, blog, docs) │ └── references/single-page.md │ └── Script: scripts/firecrawl_scrape.py │ └── Entire website (multiple pages, crawling) └── references/website-crawler.md └── (Use Apify Website Content Crawler for multi-page) Environment Setup

Required in .env

FIRECRAWL_API_KEY

fc-your-api-key-here
Get your API key:
https://firecrawl.dev/app/api-keys
Common Usage
Simple Scrape
python scripts/firecrawl_scrape.py
"https://example.com/article"
With Options
python scripts/firecrawl_scrape.py
"https://wsj.com/article"
\
--proxy
stealth
\
--format
markdown summary
\
--timeout
60000
Proxy Modes
Mode
Use Case
basic
Standard sites, fastest
stealth
Anti-bot protection, premium content (WSJ, NYT)
auto
Let Firecrawl decide (recommended)
Output Formats
markdown
- Clean markdown content (default)
html
- Raw HTML
summary
- AI-generated summary
screenshot
- Page screenshot
links
- All links on page
Cost
~1 credit per page. Stealth proxy may use additional credits.
Security Notes
Credential Handling
Store
FIRECRAWL_API_KEY
in
.env
file (never commit to git)
API keys can be regenerated at
https://firecrawl.dev/app/api-keys
Never log or print API keys in script output
Use environment variables, not hardcoded values
Data Privacy
Only scrapes publicly accessible web pages
Scraped content is processed by Firecrawl servers temporarily
Markdown output stored locally in
.tmp/
directory
Screenshots (if requested) are stored locally
No persistent data retention by Firecrawl after request
Access Scopes
API key provides full access to scraping features
No granular permission scopes available
Monitor usage via Firecrawl dashboard
Compliance Considerations
Robots.txt
Firecrawl respects robots.txt by default
Public Content Only
Only scrape publicly accessible pages
Terms of Service
Respect target site ToS
Rate Limiting
Built-in rate limiting prevents abuse
Stealth Proxy
Use stealth mode only when necessary (paywalled news, not auth bypass)
GDPR
Scraped content may contain PII - handle accordingly
Copyright
Respect intellectual property rights of scraped content Troubleshooting Common Issues Issue: Credits exhausted Symptoms: API returns "insufficient credits" or quota exceeded error Cause: Account credits depleted Solution: Check credit balance at https://firecrawl.dev/app Upgrade plan or purchase additional credits Reduce scraping frequency Use basic proxy mode to conserve credits Issue: Page not rendering correctly Symptoms: Empty content or partial HTML returned Cause: JavaScript-heavy page not fully loading Solution: Enable JavaScript rendering with --js-render flag Increase timeout with --timeout 60000 (60 seconds) Try stealth proxy mode for protected sites Wait for specific elements with --wait-for selector Issue: 403 Forbidden error Symptoms: Script returns 403 status code Cause: Site blocking automated access Solution: Enable stealth proxy mode Add delay between requests Try at different times (some sites rate limit by time) Check if site requires login (not supported) Issue: Empty markdown output Symptoms: Scrape succeeds but markdown is empty or malformed Cause: Dynamic content loaded after page load, or unusual page structure Solution: Increase wait time for JavaScript to execute Use --wait-for to wait for specific content Try html format to see raw content Check if content is in an iframe (not always supported) Issue: Timeout errors Symptoms: Request times out before completion Cause: Slow page load or large page content Solution: Increase timeout value (up to 120000ms) Use basic proxy for faster response Target specific page sections if possible Check if site is experiencing issues Resources references/single-page.md - Single page scraping details references/website-crawler.md - Multi-page website crawling Integration Patterns Scrape and Analyze Skills: firecrawl-scraping → parallel-research Use case: Scrape competitor pages, then analyze content strategy Flow: Scrape competitor website pages with Firecrawl Convert to clean markdown Use parallel-research to analyze positioning, messaging, features Scrape and Document Skills: firecrawl-scraping → content-generation Use case: Create summary documents from web research Flow: Scrape multiple article pages on a topic Combine markdown content Generate summary document via content-generation Scrape and Enrich CRM Skills: firecrawl-scraping → attio-crm Use case: Enrich company records with website data Flow: Scrape company website (about page, team page, product pages) Extract key information (funding, team size, products) Update company record in Attio CRM with enriched data
返回排行榜