Web Spider
Crawl and analyze websites.
Prerequisites
curl for fetching
curl --version
Gemini for analysis
pip install google-generativeai export GEMINI_API_KEY=your_api_key
Optional: better HTML parsing
npm install -g puppeteer pip install beautifulsoup4
Basic Crawling Fetch Page
Simple fetch
curl -s "https://example.com"
With headers
curl -s -H "User-Agent: Mozilla/5.0" "https://example.com"
Follow redirects
curl -sL "https://example.com"
Save to file
curl -s "https://example.com" -o page.html
Get headers only
curl -sI "https://example.com"
Get headers and body
curl -sD - "https://example.com"
Extract Links
Extract all links from a page
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//'
Filter to specific domain
curl -s "https://example.com" | grep -oE 'href="https?://example.com[^"]*"'
Unique links
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u
Quick Site Scan
!/bin/bash
URL=$1 echo "=== Scanning: $URL ==="
echo "" echo "### Response Info ###" curl -sI "$URL" | head -20
echo "" echo "### Links Found ###" curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20
echo "" echo "### Scripts ###" curl -s "$URL" | grep -oE 'src="[^"].js[^"]"' | sort -u
echo "" echo "### Meta Tags ###" curl -s "$URL" | grep -oE ']*>' | head -10
AI-Powered Analysis Page Analysis CONTENT=$(curl -s "https://example.com")
gemini -m pro -o text -e "" "Analyze this webpage:
$CONTENT
Provide: 1. What is this page about? 2. Key information/content 3. Navigation structure 4. Technical observations (framework, libraries) 5. SEO observations"
Security Scan URL="https://example.com"
Gather information
HEADERS=$(curl -sI "$URL") CONTENT=$(curl -s "$URL" | head -1000)
gemini -m pro -o text -e "" "Security scan this website:
URL: $URL
HEADERS: $HEADERS
CONTENT SAMPLE: $CONTENT
Check for: 1. Missing security headers (CSP, HSTS, X-Frame-Options) 2. Exposed sensitive information 3. Potential vulnerabilities in scripts/forms 4. Cookie security settings 5. HTTPS configuration"
Extract Structured Data CONTENT=$(curl -s "https://example.com/products")
gemini -m pro -o text -e "" "Extract structured data from this page:
$CONTENT
Extract into JSON format: - Product names - Prices - Descriptions - Any available metadata"
Multi-Page Crawling Crawl and Analyze
!/bin/bash
BASE_URL=$1 MAX_PAGES=${2:-10}
Get initial links
LINKS=$(curl -s "$BASE_URL" | grep -oE "href=\"$BASE_URL[^\"]*\"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)
echo "Found $(echo "$LINKS" | wc -l) pages"
for link in $LINKS; do echo "" echo "=== $link ===" TITLE=$(curl -s "$link" | grep -oE '<title>[^<]</title>' | sed 's/<[^>]>//g') echo "Title: $TITLE" done
Sitemap Processing
Fetch and parse sitemap
curl -s "https://example.com/sitemap.xml" | grep -oE '
Crawl pages from sitemap
curl -s "https://example.com/sitemap.xml" | \
grep -oE '
Specific Extractions Extract Text Content
Remove HTML tags (basic)
curl -s "https://example.com" | sed 's/<[^>]*>//g' | tr -s ' \n'
Using python
curl -s "https://example.com" | python3 -c " from html.parser import HTMLParser import sys
class TextExtractor(HTMLParser): def init(self): super().init() self.text = [] def handle_data(self, data): self.text.append(data.strip())
p = TextExtractor() p.feed(sys.stdin.read()) print(' '.join(filter(None, p.text))) "
Extract API Endpoints CONTENT=$(curl -s "https://example.com/app.js")
Find API calls in JS
echo "$CONTENT" | grep -oE "(fetch|axios|http)(['\"][^'\"]*['\"]" | sort -u
AI extraction
gemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:
$CONTENT
List all: - API URLs - HTTP methods used - Request patterns"
Monitor Changes
!/bin/bash
URL=$1 HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"
CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)
if [ -f "$HASH_FILE" ]; then PREVIOUS=$(cat "$HASH_FILE") if [ "$CURRENT" != "$PREVIOUS" ]; then echo "Page changed: $URL" # Notify or log fi fi
echo "$CURRENT" > "$HASH_FILE"
Best Practices Respect robots.txt - Check before crawling Rate limit - Don't overwhelm servers Set User-Agent - Identify your crawler Handle errors - Sites go down, pages 404 Cache responses - Don't re-fetch unnecessarily Be ethical - Only crawl what you're allowed to Check ToS - Some sites prohibit scraping