Website-to-Vite Scraper V2 Multi-provider website scraper with AI-powered extraction for any website type. Scraping Methods Method Best For Anti-Bot JS Rendering Cost Playwright General sites, Next.js/React apps ❌ ✅ Full FREE Apify RAG Browser LLM/RAG-optimized content ✅ ✅ Adaptive Credits Crawl4AI AI training data, clean extraction ✅ ✅ Credits Firecrawl Protected sites, anti-bot bypass ✅✅ ✅ $16/mo Quick Start GitHub Actions (Recommended)
Go to: Actions → Website Scraper V2 → Run workflow
Options:
- URL: https://www.reventure.app/
- Project name: reventure-clone
- Method: all (tries all providers)
- Deploy: true
API MEGA LIBRARY Integration The following APIs from our library enhance this scraper: API Purpose Status APIFY_API_TOKEN RAG Browser, Crawl4AI, Web Scraper ✅ Configured FIRECRAWL_API_KEY Anti-bot bypass, stealth mode ✅ Configured BROWSERLESS_API_KEY Alternative headless browser 🔄 Available MCP Server Integration Connect Claude Desktop/Cursor to Apify MCP for AI-powered scraping: { "mcpServers" : { "apify" : { "command" : "npx" , "args" : [ "@apify/actors-mcp-server" ] , "env" : { "APIFY_TOKEN" : "your-apify-api-token" } } } } Or use hosted: https://mcp.apify.com?token=YOUR_TOKEN Apify Actors Used apify/rag-web-browser Purpose: LLM-optimized web content extraction Output: Markdown, HTML, text Features: Playwright adaptive (handles JS) Clean content extraction Link following Metadata extraction raizen/ai-web-scraper (Crawl4AI) Purpose: AI training data collection Output: Cleaned markdown, structured links Features: Excludes boilerplate (headers, footers, nav) Word count thresholding External link filtering Firecrawl Purpose: Anti-bot protected sites Output: Markdown, HTML, screenshots Features: Anti-detection technology JavaScript rendering Main content extraction 5-second wait for dynamic content Output Structure project-name/ ├── dist/ │ ├── index.html # Best merged HTML │ ├── screenshot.png # Full page capture │ ├── meta.json # Scrape metadata │ └── assets/ │ ├── images/ # Downloaded images │ ├── css/ # Stylesheets │ └── js/ # Scripts └── results/ ├── playwright/ # Raw Playwright output ├── apify-rag/ # RAG Browser output ├── crawl4ai/ # Crawl4AI output └── firecrawl/ # Firecrawl output Handling CSR/SPA Sites Sites like Next.js, React, Vue that render client-side require JavaScript execution: Playwright waits for networkidle + 5 seconds Apify RAG uses adaptive crawler (Playwright when needed) Firecrawl has built-in JS rendering For NEXT_DATA extraction (Next.js sites): Playwright automatically extracts and saves to next_data.json Can be parsed to reconstruct static pages Workflow Parameters Parameter Type Default Description url string required Website URL to scrape project_name string required Output folder/Cloudflare project name scrape_method choice playwright Method to use extract_assets boolean true Download images/CSS/JS deploy_cloudflare boolean true Deploy to Cloudflare Pages Cost Optimization Scenario Recommended Method Simple static site Playwright (FREE) JS-heavy SPA Playwright → Apify RAG fallback Protected site (Cloudflare) Firecrawl AI/RAG pipeline Apify RAG or Crawl4AI Maximum coverage all method Security Assessment Per API_MEGA_LIBRARY guidelines: API Security Score Recommendation Apify 85/100 ✅ ADOPT Firecrawl 82/100 ✅ ADOPT Playwright 90/100 ✅ ADOPT (local) Troubleshooting Site returns blank page Try scrape_method: all to use multiple providers Increase wait time in Playwright Check if site blocks datacenter IPs → use Firecrawl Assets not downloading Some sites block direct asset requests Use relative paths from original HTML Check for CORS restrictions Cloudflare protection detected Use Firecrawl (has anti-bot bypass) Or use Apify with residential proxies