crawler

安装量: 35
排名: #19693

安装

npx skills add https://github.com/gn00678465/crawler-skill --skill crawler

Crawler Skill Converts any URL into clean markdown using a robust 3-tier fallback chain. Quick start uv run scripts/crawl.py --url https://example.com --output reports/example.md Markdown is saved to the file specified by --output . Progress/errors go to stderr . Exit code 0 on success, 1 if all scrapers fail. How it works The script tries each tier in order and returns the first success: Tier Module Requires 1 Firecrawl ( firecrawl_scraper.py ) FIRECRAWL_API_KEY env var (optional; falls back if missing) 2 Jina Reader ( jina_reader.py ) Nothing — free, no key needed 3 Scrapling ( scrapling_scraper.py ) Local headless browser (auto-installs via pip) File layout crawler-skill/ ├── SKILL.md ← this file ├── scripts/ │ ├── crawl.py ← main CLI entry point (PEP 723 inline deps) │ └── src/ │ ├── domain_router.py ← URL-to-tier routing rules │ ├── firecrawl_scraper.py ← Tier 1: Firecrawl API │ ├── jina_reader.py ← Tier 2: Jina r.jina.ai proxy │ └── scrapling_scraper.py ← Tier 3: local headless scraper └── tests/ └── test_crawl.py ← 70 pytest tests (all passing) Usage examples

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

Always prefer using --output to avoid terminal encoding issues

uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md

If no --output is provided, markdown goes to stdout (not recommended on Windows)

uv run scripts/crawl.py --url https://example.com

With a Firecrawl API key for best results

FIRECRAWL_API_KEY

fc- .. . uv run scripts/crawl.py --url https://example.com --output reports/example.md URL requirements Only http:// and https:// URLs are accepted. Passing any other scheme ( ftp:// , file:// , javascript: , a bare path, etc.) exits with code 1 and prints a clear error — no scraping is attempted. Saving Reports When the user asks to save the crawled content or a summary to a file, ALWAYS use the --output argument and save the file into the reports/ directory at the project root (for example, {project_root}/reports ). If the directory does not exist, the script will create it. Example: If asked to "save to result.md", you should run: uv run scripts/crawl.py --url --output reports/result.md Point at a self-hosted Firecrawl instance FIRECRAWL_API_URL = http://localhost:3002 uv run scripts/crawl.py --url https://example.com Content validation Each scraper validates its output before returning success: Minimum 100 characters of content (rejects empty/error pages) Detection of CAPTCHA / bot-verification pages (Firecrawl) Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher) Detection of Jina error page indicators ( Error: , Access Denied , etc.) Domain routing Certain hostnames bypass one or more scraper tiers to avoid known compatibility issues. The logic lives in scripts/src/domain_router.py . Domain Skipped tiers Active chain medium.com (and subdomains) firecrawl jina → scrapling mp.weixin.qq.com firecrawl + jina scrapling only everything else — firecrawl → jina → scrapling Sub-domain matching follows a suffix rule: blog.medium.com matches the medium.com rule because its hostname ends with .medium.com . An exact sub-domain like other.weixin.qq.com does not match mp.weixin.qq.com . Running tests uv run pytest tests/ -v All 70 tests use mocking — no network calls, no API keys required. Dependencies (auto-installed by uv run ) firecrawl-py>=2.0 — Firecrawl Python SDK httpx>=0.27 — HTTP client for Jina Reader scrapling>=0.2 — Headless scraping with stealth support html2text>=2024.2.26 — HTML-to-markdown conversion When to invoke this skill Invoke crawl.py whenever you need the text content of a web page: result = subprocess . run ( [ "uv" , "run" , "scripts/crawl.py" , "--url" , url ] , capture_output = True , text = True ) if result . returncode == 0 : markdown = result . stdout Or simply run it directly from the terminal as shown in Quick start above.

返回排行榜