data-scraper-agent

安装量: 90
排名: #8913

安装

npx skills add https://github.com/affaan-m/everything-claude-code --skill data-scraper-agent

Data Scraper Agent Build a production-ready, AI-powered data collection agent for any public data source. Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time. Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase When to Activate User wants to scrape or monitor any public website or API User says "build a bot that checks...", "monitor X for me", "collect data from..." User wants to track jobs, prices, news, repos, sports scores, events, listings User asks how to automate data collection without paying for hosting User wants an agent that gets smarter over time based on their decisions Core Concepts The Three Layers Every data scraper agent has three layers: COLLECT → ENRICH → STORE │ │ │ Scraper AI (LLM) Database runs on scores/ Notion / schedule summarises Sheets / & classifies Supabase Free Stack Layer Tool Why Scraping requests + BeautifulSoup No cost, covers 80% of public sites JS-rendered sites playwright (free) When HTML scraping fails AI enrichment Gemini Flash via REST API 500 req/day, 1M tokens/day — free Storage Notion API Free tier, great UI for review Schedule GitHub Actions cron Free for public repos Learning JSON feedback file in repo Zero infra, persists in git AI Model Fallback Chain Build agents to auto-fallback across Gemini models on quota exhaustion: gemini-2.0-flash-lite (30 RPM) → gemini-2.0-flash (15 RPM) → gemini-2.5-flash (10 RPM) → gemini-flash-lite-latest (fallback) Batch API Calls for Efficiency Never call the LLM once per item. Always batch:

BAD: 33 API calls for 33 items

for item in items : result = call_ai ( item )

33 calls → hits rate limit

GOOD: 7 API calls for 33 items (batch size 5)

for batch in chunks ( items , size = 5 ) : results = call_ai ( batch )

7 calls → stays within free tier

Workflow Step 1: Understand the Goal Ask the user: What to collect: "What data source? URL / API / RSS / public endpoint?" What to extract: "What fields matter? Title, price, URL, date, score?" How to store: "Where should results go? Notion, Google Sheets, Supabase, or local file?" How to enrich: "Do you want AI to score, summarise, classify, or match each item?" Frequency: "How often should it run? Every hour, daily, weekly?" Common examples to prompt: Job boards → score relevance to resume Product prices → alert on drops GitHub repos → summarise new releases News feeds → classify by topic + sentiment Sports results → extract stats to tracker Events calendar → filter by interest Step 2: Design the Agent Architecture Generate this directory structure for the user: my-agent/ ├── config.yaml # User customises this (keywords, filters, preferences) ├── profile/ │ └── context.md # User context the AI uses (resume, interests, criteria) ├── scraper/ │ ├── init.py │ ├── main.py # Orchestrator: scrape → enrich → store │ ├── filters.py # Rule-based pre-filter (fast, before AI) │ └── sources/ │ ├── init.py │ └── source_name.py # One file per data source ├── ai/ │ ├── init.py │ ├── client.py # Gemini REST client with model fallback │ ├── pipeline.py # Batch AI analysis │ ├── jd_fetcher.py # Fetch full content from URLs (optional) │ └── memory.py # Learn from user feedback ├── storage/ │ ├── init.py │ └── notion_sync.py # Or sheets_sync.py / supabase_sync.py ├── data/ │ └── feedback.json # User decision history (auto-updated) ├── .env.example ├── setup.py # One-time DB/schema creation ├── enrich_existing.py # Backfill AI scores on old rows ├── requirements.txt └── .github/ └── workflows/ └── scraper.yml # GitHub Actions schedule Step 3: Build the Scraper Source Template for any data source:

scraper/sources/my_source.py

""" [Source Name] — scrapes [what] from [where]. Method: [REST API / HTML scraping / RSS feed] """ import requests from bs4 import BeautifulSoup from datetime import datetime , timezone from scraper . filters import is_relevant HEADERS = { "User-Agent" : "Mozilla/5.0 (compatible; research-bot/1.0)" , } def fetch ( ) -

list [ dict ] : """ Returns a list of items with consistent schema. Each item must have at minimum: name, url, date_found. """ results = [ ]

---- REST API source ----

resp

requests . get ( "https://api.example.com/items" , headers = HEADERS , timeout = 15 ) if resp . status_code == 200 : for item in resp . json ( ) . get ( "results" , [ ] ) : if not is_relevant ( item . get ( "title" , "" ) ) : continue results . append ( _normalise ( item ) ) return results def _normalise ( raw : dict ) -

dict : """Convert raw API/HTML data to the standard schema.""" return { "name" : raw . get ( "title" , "" ) , "url" : raw . get ( "link" , "" ) , "source" : "MySource" , "date_found" : datetime . now ( timezone . utc ) . date ( ) . isoformat ( ) ,

add domain-specific fields here

} HTML scraping pattern: soup = BeautifulSoup ( resp . text , "lxml" ) for card in soup . select ( "[class*='listing']" ) : title = card . select_one ( "h2, h3" ) . get_text ( strip = True ) link = card . select_one ( "a" ) [ "href" ] if not link . startswith ( "http" ) : link = f"https://example.com { link } " RSS feed pattern: import xml . etree . ElementTree as ET root = ET . fromstring ( resp . text ) for item in root . findall ( ".//item" ) : title = item . findtext ( "title" , "" ) link = item . findtext ( "link" , "" ) Step 4: Build the Gemini AI Client

ai/client.py

import os , json , time , requests _last_call = 0.0 MODEL_FALLBACK = [ "gemini-2.0-flash-lite" , "gemini-2.0-flash" , "gemini-2.5-flash" , "gemini-flash-lite-latest" , ] def generate ( prompt : str , model : str = "" , rate_limit : float = 7.0 ) -

dict : """Call Gemini with auto-fallback on 429. Returns parsed JSON or {}.""" global _last_call api_key = os . environ . get ( "GEMINI_API_KEY" , "" ) if not api_key : return { } elapsed = time . time ( ) - _last_call if elapsed < rate_limit : time . sleep ( rate_limit - elapsed ) models = [ model ] + [ m for m in MODEL_FALLBACK if m != model ] if model else MODEL_FALLBACK _last_call = time . time ( ) for m in models : url = f"https://generativelanguage.googleapis.com/v1beta/models/ { m } :generateContent?key= { api_key } " payload = { "contents" : [ { "parts" : [ { "text" : prompt } ] } ] , "generationConfig" : { "responseMimeType" : "application/json" , "temperature" : 0.3 , "maxOutputTokens" : 2048 , } , } try : resp = requests . post ( url , json = payload , timeout = 30 ) if resp . status_code == 200 : return _parse ( resp ) if resp . status_code in ( 429 , 404 ) : time . sleep ( 1 ) continue return { } except requests . RequestException : return { } return { } def _parse ( resp ) -

dict : try : text = ( resp . json ( ) . get ( "candidates" , [ { } ] ) [ 0 ] . get ( "content" , { } ) . get ( "parts" , [ { } ] ) [ 0 ] . get ( "text" , "" ) . strip ( ) ) if text . startswith ( "" ) : text = text . split ( "\n" , 1 ) [ - 1 ] . rsplit ( "" , 1 ) [ 0 ] return json . loads ( text ) except ( json . JSONDecodeError , KeyError ) : return { } Step 5: Build the AI Pipeline (Batch)

ai/pipeline.py

import json import yaml from pathlib import Path from ai . client import generate def analyse_batch ( items : list [ dict ] , context : str = "" , preference_prompt : str = "" ) -

list [ dict ] : """Analyse items in batches. Returns items enriched with AI fields.""" config = yaml . safe_load ( ( Path ( file ) . parent . parent / "config.yaml" ) . read_text ( ) ) model = config . get ( "ai" , { } ) . get ( "model" , "gemini-2.5-flash" ) rate_limit = config . get ( "ai" , { } ) . get ( "rate_limit_seconds" , 7.0 ) min_score = config . get ( "ai" , { } ) . get ( "min_score" , 0 ) batch_size = config . get ( "ai" , { } ) . get ( "batch_size" , 5 ) batches = [ items [ i : i + batch_size ] for i in range ( 0 , len ( items ) , batch_size ) ] print ( f" [AI] { len ( items ) } items → { len ( batches ) } API calls" ) enriched = [ ] for i , batch in enumerate ( batches ) : print ( f" [AI] Batch { i + 1 } / { len ( batches ) } ..." ) prompt = build_prompt ( batch , context , preference_prompt , config ) result = generate ( prompt , model = model , rate_limit = rate_limit ) analyses = result . get ( "analyses" , [ ] ) for j , item in enumerate ( batch ) : ai = analyses [ j ] if j < len ( analyses ) else { } if ai : score = max ( 0 , min ( 100 , int ( ai . get ( "score" , 0 ) ) ) ) if min_score and score < min_score : continue enriched . append ( { ** item , "ai_score" : score , "ai_summary" : ai . get ( "summary" , "" ) , "ai_notes" : ai . get ( "notes" , "" ) } ) else : enriched . append ( item ) return enriched def _build_prompt ( batch , context , preference_prompt , config ) : priorities = config . get ( "priorities" , [ ] ) items_text = "\n\n" . join ( f"Item { i + 1 } : { json . dumps ( { k : v for k , v in item . items ( ) if not k . startswith ( '' ) } ) } " for i , item in enumerate ( batch ) ) return f"""Analyse these { len ( batch ) } items and return a JSON object.

Items

{ items_text }

User Context

{ context [ : 800] if context else "Not provided" }

User Priorities

{ chr ( 10 ) . join ( f"- { p } " for p in priorities ) } { preference_prompt }

Instructions

Return: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": ""}} for each item in order]}} Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak.""" Step 6: Build the Feedback Learning System

ai/memory.py

"""Learn from user decisions to improve future scoring.""" import json from pathlib import Path FEEDBACK_PATH = Path ( file ) . parent . parent / "data" / "feedback.json" def load_feedback ( ) -

dict : if FEEDBACK_PATH . exists ( ) : try : return json . loads ( FEEDBACK_PATH . read_text ( ) ) except ( json . JSONDecodeError , OSError ) : pass return { "positive" : [ ] , "negative" : [ ] } def save_feedback ( fb : dict ) : FEEDBACK_PATH . parent . mkdir ( parents = True , exist_ok = True ) FEEDBACK_PATH . write_text ( json . dumps ( fb , indent = 2 ) ) def build_preference_prompt ( feedback : dict , max_examples : int = 15 ) -

str : """Convert feedback history into a prompt bias section.""" lines = [ ] if feedback . get ( "positive" ) : lines . append ( "# Items the user LIKED (positive signal):" ) for e in feedback [ "positive" ] [ - max_examples : ] : lines . append ( f"- { e } " ) if feedback . get ( "negative" ) : lines . append ( "\n# Items the user SKIPPED/REJECTED (negative signal):" ) for e in feedback [ "negative" ] [ - max_examples : ] : lines . append ( f"- { e } " ) if lines : lines . append ( "\nUse these patterns to bias scoring on new items." ) return "\n" . join ( lines ) Integration with your storage layer: after each run, query your DB for items with positive/negative status and call save_feedback() with the extracted patterns. Step 7: Build Storage (Notion example)

storage/notion_sync.py

import os from notion_client import Client from notion_client . errors import APIResponseError _client = None def get_client ( ) : global _client if _client is None : _client = Client ( auth = os . environ [ "NOTION_TOKEN" ] ) return _client def get_existing_urls ( db_id : str ) -

set [ str ] : """Fetch all URLs already stored — used for deduplication.""" client , seen , cursor = get_client ( ) , set ( ) , None while True : resp = client . databases . query ( database_id = db_id , page_size = 100 , ** { "start_cursor" : cursor } if cursor else { } ) for page in resp [ "results" ] : url = page [ "properties" ] . get ( "URL" , { } ) . get ( "url" , "" ) if url : seen . add ( url ) if not resp [ "has_more" ] : break cursor = resp [ "next_cursor" ] return seen def push_item ( db_id : str , item : dict ) -

bool : """Push one item to Notion. Returns True on success.""" props = { "Name" : { "title" : [ { "text" : { "content" : item . get ( "name" , "" ) [ : 100 ] } } ] } , "URL" : { "url" : item . get ( "url" ) } , "Source" : { "select" : { "name" : item . get ( "source" , "Unknown" ) } } , "Date Found" : { "date" : { "start" : item . get ( "date_found" ) } } , "Status" : { "select" : { "name" : "New" } } , }

AI fields

if item . get ( "ai_score" ) is not None : props [ "AI Score" ] = { "number" : item [ "ai_score" ] } if item . get ( "ai_summary" ) : props [ "Summary" ] = { "rich_text" : [ { "text" : { "content" : item [ "ai_summary" ] [ : 2000 ] } } ] } if item . get ( "ai_notes" ) : props [ "Notes" ] = { "rich_text" : [ { "text" : { "content" : item [ "ai_notes" ] [ : 2000 ] } } ] } try : get_client ( ) . pages . create ( parent = { "database_id" : db_id } , properties = props ) return True except APIResponseError as e : print ( f"[notion] Push failed: { e } " ) return False def sync ( db_id : str , items : list [ dict ] ) -

tuple [ int , int ] : existing = get_existing_urls ( db_id ) added = skipped = 0 for item in items : if item . get ( "url" ) in existing : skipped += 1 ; continue if push_item ( db_id , item ) : added += 1 ; existing . add ( item [ "url" ] ) else : skipped += 1 return added , skipped Step 8: Orchestrate in main.py

scraper/main.py

import os , sys , yaml from pathlib import Path from dotenv import load_dotenv load_dotenv ( ) from scraper . sources import my_source

add your sources

NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase",

replace this import with storage.sheets_sync or storage.supabase_sync and update

the env var and sync() call accordingly.

from storage . notion_sync import sync SOURCES = [ ( "My Source" , my_source . fetch ) , ] def ai_enabled ( ) : return bool ( os . environ . get ( "GEMINI_API_KEY" ) ) def main ( ) : config = yaml . safe_load ( ( Path ( file ) . parent . parent / "config.yaml" ) . read_text ( ) ) provider = config . get ( "storage" , { } ) . get ( "provider" , "notion" )

Resolve the storage target identifier from env based on provider

if provider == "notion" : db_id = os . environ . get ( "NOTION_DATABASE_ID" ) if not db_id : print ( "ERROR: NOTION_DATABASE_ID not set" ) ; sys . exit ( 1 ) else :

Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc.

print ( f"ERROR: provider ' { provider } ' not yet wired in main.py" ) ; sys . exit ( 1 ) config = yaml . safe_load ( ( Path ( file ) . parent . parent / "config.yaml" ) . read_text ( ) ) all_items = [ ] for name , fetch_fn in SOURCES : try : items = fetch_fn ( ) print ( f"[ { name } ] { len ( items ) } items" ) all_items . extend ( items ) except Exception as e : print ( f"[ { name } ] FAILED: { e } " )

Deduplicate by URL

seen , deduped = set ( ) , [ ] for item in all_items : if ( url := item . get ( "url" , "" ) ) and url not in seen : seen . add ( url ) ; deduped . append ( item ) print ( f"Unique items: { len ( deduped ) } " ) if ai_enabled ( ) and deduped : from ai . memory import load_feedback , build_preference_prompt from ai . pipeline import analyse_batch

load_feedback() reads data/feedback.json written by your feedback sync script.

To keep it current, implement a separate feedback_sync.py that queries your

storage provider for items with positive/negative statuses and calls save_feedback().

feedback

load_feedback ( ) preference = build_preference_prompt ( feedback ) context_path = Path ( file ) . parent . parent / "profile" / "context.md" context = context_path . read_text ( ) if context_path . exists ( ) else "" deduped = analyse_batch ( deduped , context = context , preference_prompt = preference ) else : print ( "[AI] Skipped — GEMINI_API_KEY not set" ) added , skipped = sync ( db_id , deduped ) print ( f"Done — { added } new, { skipped } existing" ) if name == "main" : main ( ) Step 9: GitHub Actions Workflow

.github/workflows/scraper.yml

name : Data Scraper Agent on : schedule : - cron : "0 /3 * * "

every 3 hours — adjust to your needs

workflow_dispatch :

allow manual trigger

permissions : contents : write

required for the feedback-history commit step

jobs : scrape : runs-on : ubuntu - latest timeout-minutes : 20 steps : - uses : actions/checkout@v4 - uses : actions/setup - python@v5 with : python-version : "3.11" cache : "pip" - run : pip install - r requirements.txt

Uncomment if Playwright is enabled in requirements.txt

- name: Install Playwright browsers

run: python -m playwright install chromium --with-deps

- name : Run agent env : NOTION_TOKEN : $ { { secrets.NOTION_TOKEN } } NOTION_DATABASE_ID : $ { { secrets.NOTION_DATABASE_ID } } GEMINI_API_KEY : $ { { secrets.GEMINI_API_KEY } } run : python - m scraper.main - name : Commit feedback history run : | git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" git add data/feedback.json || true git diff --cached --quiet || git commit -m "chore: update feedback history" git push Step 10: config.yaml Template

Customise this file — no code changes needed

What to collect (pre-filter before AI)

filters : required_keywords : [ ]

item must contain at least one

blocked_keywords : [ ]

item must not contain any

Your priorities — AI uses these for scoring

priorities : - "example priority 1" - "example priority 2"

Storage

storage : provider : "notion"

notion | sheets | supabase | sqlite

Feedback learning

feedback : positive_statuses : [ "Saved" , "Applied" , "Interested" ] negative_statuses : [ "Skip" , "Rejected" , "Not relevant" ]

AI settings

ai : enabled : true model : "gemini-2.5-flash" min_score : 0

filter out items below this score

rate_limit_seconds : 7

seconds between API calls

batch_size : 5

items per API call

Common Scraping Patterns Pattern 1: REST API (easiest) resp = requests . get ( url , params = { "q" : query } , headers = HEADERS , timeout = 15 ) items = resp . json ( ) . get ( "results" , [ ] ) Pattern 2: HTML Scraping soup = BeautifulSoup ( resp . text , "lxml" ) for card in soup . select ( ".listing-card" ) : title = card . select_one ( "h2" ) . get_text ( strip = True ) href = card . select_one ( "a" ) [ "href" ] Pattern 3: RSS Feed import xml . etree . ElementTree as ET root = ET . fromstring ( resp . text ) for item in root . findall ( ".//item" ) : title = item . findtext ( "title" , "" ) link = item . findtext ( "link" , "" ) pub_date = item . findtext ( "pubDate" , "" ) Pattern 4: Paginated API page = 1 while True : resp = requests . get ( url , params = { "page" : page , "limit" : 50 } , timeout = 15 ) data = resp . json ( ) items = data . get ( "results" , [ ] ) if not items : break for item in items : results . append ( _normalise ( item ) ) if not data . get ( "has_more" ) : break page += 1 Pattern 5: JS-Rendered Pages (Playwright) from playwright . sync_api import sync_playwright with sync_playwright ( ) as p : browser = p . chromium . launch ( ) page = browser . new_page ( ) page . goto ( url ) page . wait_for_selector ( ".listing" ) html = page . content ( ) browser . close ( ) soup = BeautifulSoup ( html , "lxml" ) Anti-Patterns to Avoid Anti-pattern Problem Fix One LLM call per item Hits rate limits instantly Batch 5 items per call Hardcoded keywords in code Not reusable Move all config to config.yaml Scraping without rate limit IP ban Add time.sleep(1) between requests Storing secrets in code Security risk Always use .env + GitHub Secrets No deduplication Duplicate rows pile up Always check URL before pushing Ignoring robots.txt Legal/ethical risk Respect crawl rules; use public APIs when available JS-rendered sites with requests Empty response Use Playwright or look for the underlying API maxOutputTokens too low Truncated JSON, parse error Use 2048+ for batch responses Free Tier Limits Reference Service Free Limit Typical Usage Gemini Flash Lite 30 RPM, 1500 RPD ~56 req/day at 3-hr intervals Gemini 2.0 Flash 15 RPM, 1500 RPD Good fallback Gemini 2.5 Flash 10 RPM, 500 RPD Use sparingly GitHub Actions Unlimited (public repos) ~20 min/day Notion API Unlimited ~200 writes/day Supabase 500MB DB, 2GB transfer Fine for most agents Google Sheets API 300 req/min Works for small agents Requirements Template requests==2.31.0 beautifulsoup4==4.12.3 lxml==5.1.0 python-dotenv==1.0.1 pyyaml==6.0.2 notion-client==2.2.1 # if using Notion

playwright==1.40.0 # uncomment for JS-rendered sites

Quality Checklist Before marking the agent complete: config.yaml controls all user-facing settings — no hardcoded values profile/context.md holds user-specific context for AI matching Deduplication by URL before every storage push Gemini client has model fallback chain (4 models) Batch size ≤ 5 items per API call maxOutputTokens ≥ 2048 .env is in .gitignore .env.example provided for onboarding setup.py creates DB schema on first run enrich_existing.py backfills AI scores on old rows GitHub Actions workflow commits feedback.json after each run README covers: setup in < 5 minutes, required secrets, customisation Real-World Examples "Build me an agent that monitors Hacker News for AI startup funding news" "Scrape product prices from 3 e-commerce sites and alert when they drop" "Track new GitHub repos tagged with 'llm' or 'agents' — summarise each one" "Collect Chief of Staff job listings from LinkedIn and Cutshort into Notion" "Monitor a subreddit for posts mentioning my company — classify sentiment" "Scrape new academic papers from arXiv on a topic I care about daily" "Track sports fixture results and keep a running table in Google Sheets" "Build a real estate listing watcher — alert on new properties under ₹1 Cr" Reference Implementation A complete working agent built with this exact architecture would scrape 4+ sources, batch Gemini calls, learn from Applied/Rejected decisions stored in Notion, and run 100% free on GitHub Actions. Follow Steps 1–9 above to build your own.

返回排行榜