cli-web-scrape

安装量: 34
排名: #19966

安装

npx skills add https://github.com/molechowski/claude-skills --skill cli-web-scrape

Scrapling CLI Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction. Prerequisites

Install with all extras (CLI needs click, fetchers need playwright/camoufox)

uv tool install 'scrapling[all]'

Install fetcher browser engines (one-time)

scrapling install Verify: scrapling --help Fetcher Selection Tier Command Engine Speed Stealth JS Use When HTTP extract get/post/put/delete httpx + TLS impersonation Fast Medium No Static pages, APIs, most sites Dynamic extract fetch Playwright (headless browser) Medium Low Yes JS-rendered SPAs, wait-for-element Stealthy extract stealthy-fetch Camoufox (patched Firefox) Slow High Yes Cloudflare, aggressive anti-bot Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests. Output Format Determined by output file extension: Extension Output Best For .html Raw HTML Parsing, further processing .md HTML converted to Markdown Reading, LLM context .txt Text content only Clean text extraction Always use /tmp/scrapling-*.{md,txt,html} for output files. Read the file after extraction. Core Commands HTTP Tier: GET scrapling extract get URL OUTPUT_FILE [ OPTIONS ] Flag Purpose Example -s, --css-selector Extract matching elements only -s ".article-body" --impersonate Force specific browser --impersonate firefox -H, --headers Custom headers (repeatable) -H "Authorization: Bearer tok" --cookies Cookie string --cookies "session=abc123" --proxy Proxy URL --proxy "http://user:pass@host:port" -p, --params Query params (repeatable) -p "page=2" -p "limit=50" --timeout Seconds (default: 30) --timeout 60 --no-verify Skip SSL verification For self-signed certs --no-follow-redirects Don't follow redirects For redirect inspection --no-stealthy-headers Disable stealth headers For debugging Examples:

Basic page fetch as markdown

scrapling extract get "https://example.com" /tmp/scrapling-out.md

Extract only article content

scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"

Multiple CSS selectors

scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"

With auth header

scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"

Impersonate Firefox

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox

Random browser impersonation from list

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"

With proxy

scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080" HTTP Tier: POST scrapling extract post URL OUTPUT_FILE [ OPTIONS ] Additional options over GET: Flag Purpose Example -d, --data Form data -d "param1=value1&param2=value2" -j, --json JSON body -j '{"key": "value"}'

POST with form data

scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"

POST with JSON

scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}' PUT and DELETE share the same interface as POST and GET respectively. Dynamic Tier: fetch For JS-rendered pages. Launches headless Playwright browser. scrapling extract fetch URL OUTPUT_FILE [ OPTIONS ] Flag Purpose Default --headless/--no-headless Headless mode True --disable-resources Drop images/CSS/fonts for speed False --network-idle Wait for network idle False --timeout Milliseconds 30000 --wait Extra wait after load (ms) 0 -s, --css-selector CSS selector extraction — --wait-selector Wait for element before proceeding — --real-chrome Use installed Chrome instead of bundled False --proxy Proxy URL — -H, --extra-headers Extra headers (repeatable) —

Fetch JS-rendered SPA

scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md

Wait for specific element to load

scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"

Fast mode: skip images/CSS, wait for network idle

scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle

Extra wait for slow-loading content

scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000 Stealthy Tier: stealthy-fetch Maximum anti-detection. Uses Camoufox (patched Firefox). scrapling extract stealthy-fetch URL OUTPUT_FILE [ OPTIONS ] Additional options over fetch : Flag Purpose Default --solve-cloudflare Solve Cloudflare challenges False --block-webrtc Block WebRTC (prevents IP leak) False --hide-canvas Add noise to canvas fingerprinting False --block-webgl Block WebGL fingerprinting False (allowed)

Bypass Cloudflare

scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare

Maximum stealth

scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md \ --solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Stealthy with CSS selector

scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt \ --solve-cloudflare -s ".content" Auto-Escalation Protocol ALL scrapling usage must follow this protocol. Never use extract get alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare extract get . Step 1: HTTP Tier scrapling extract get "URL" /tmp/scrapling-out.md Read /tmp/scrapling-out.md and validate content before proceeding. Step 2: Validate Content Check the scraped output for thin content indicators — signs that the site requires JS rendering: Indicator Pattern Example JS disabled warning "JavaScript", "enable JavaScript", "JS wyłączony" iSpot.pl, many SPAs No product/price data Output has navigation and footer but no prices, specs, or product names E-commerce SPAs Mostly nav links 80%+ of content is menu items, category links, cookie banners React/Angular/Vue apps Very short content Less than ~20 meaningful lines after stripping nav/footer Hydration-dependent pages Login/loading wall "Loading...", "Please wait", skeleton UI text Dashboard apps If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success. Step 3: Dynamic Tier (if content validation fails) scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources Read and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate. Step 4: Stealthy Tier (if Dynamic tier fails) scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare If still blocked, add maximum stealth flags: scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \ --solve-cloudflare --block-webrtc --hide-canvas --block-webgl Consumer Skill Integration When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above , not just the HTTP tier. The pattern: extract get → Read → Validate content Content thin? → extract fetch --network-idle --disable-resources → Read → Validate Still blocked? → extract stealthy-fetch --solve-cloudflare → Read All tiers fail? → Skip and label "scrapling blocked" Known JS-rendered sites (always start at Dynamic tier): iSpot.pl — React SPA, HTTP tier returns only nav shell Single-page apps with client-side routing (hash or history API URLs) Interactive Shell

Launch REPL

scrapling shell

One-liner evaluation

scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")' Troubleshooting Issue Fix ModuleNotFoundError: click Reinstall: uv tool install --force 'scrapling[all]' fetch/stealthy-fetch fails Run scrapling install to install browser engines Cloudflare still blocks Add --block-webrtc --hide-canvas to stealthy-fetch Timeout Increase --timeout (seconds for HTTP, milliseconds for fetch/stealthy) SSL error Add --no-verify (HTTP tier only) Empty output with selector Try without -s first to verify page loads, then refine selector Constraints Output file path is required — scrapling writes to file, not stdout CSS selectors return ALL matches concatenated HTTP tier timeout is in seconds , fetch/stealthy-fetch timeout is in milliseconds --impersonate only available on HTTP tier (fetch/stealthy handle it internally) --solve-cloudflare only on stealthy-fetch tier Stealth headers enabled by default on HTTP tier — disable with --no-stealthy-headers for debugging

返回排行榜