ai-tech-fulltext-fetch

安装量: 35
排名: #19666

安装

npx skills add https://github.com/tiangong-ai/skills --skill ai-tech-fulltext-fetch

AI Tech Fulltext Fetch Core Goal Reuse the same SQLite database populated by ai-tech-rss-fetch . Fetch article body text from each RSS entry URL. Persist extraction status and text in a companion table ( entry_content ). Support incremental runs and safe retries without creating duplicate fulltext rows. Triggering Conditions Receive a request to fetch article body/full text for entries already in ai_rss.db . Receive a request to build a second-stage pipeline after RSS metadata sync. Need a stable, resumable queue over existing entries rows. Need URL-based fulltext persistence before chunking, indexing, or summarization. Workflow Ensure metadata table exists first. Run ai-tech-rss-fetch and populate entries in SQLite before using this skill. This skill requires the entries table to exist. In multi-agent runtimes, pin DB to the same absolute path used by ai-tech-rss-fetch : export AI_RSS_DB_PATH = "/absolute/path/to/workspace-rss-bot/ai_rss.db" Initialize fulltext table. python3 scripts/fulltext_fetch.py init-db --db " $AI_RSS_DB_PATH " Run incremental fulltext sync. Default behavior fetches rows that are missing full text or currently failed. python3 scripts/fulltext_fetch.py sync \ --db " $AI_RSS_DB_PATH " \ --limit 50 \ --timeout 20 \ --min-chars 300 Fetch one entry on demand. python3 scripts/fulltext_fetch.py fetch-entry \ --db " $AI_RSS_DB_PATH " \ --entry-id 1234 Inspect extracted content state. python3 scripts/fulltext_fetch.py list-content \ --db " $AI_RSS_DB_PATH " \ --status ready \ --limit 100 Data Contract Reads from existing entries table: id , canonical_url , url , title . Writes to entry_content table: entry_id (unique, one row per entry) source_url , final_url , http_status extractor ( trafilatura , html-parser , or none ) content_text , content_hash , content_length status ( ready or failed ) retry_count , last_error , timestamps. Extraction and Update Rules URL source priority: canonical_url first, fallback to url . Attempt trafilatura extraction when dependency is available, fallback to built-in HTML parser. Upsert by entry_id : Success: write/update full text and reset retry_count to 0 . Failure with existing ready content: keep old text, keep status ready , record last_error . Failure without ready content: status becomes failed , increment retry_count , set next_retry_at . Failed retries are capped by --max-retries (default 3 ) and paced by --retry-backoff-minutes . --force allows refetching already ready rows. --refetch-days N allows refreshing rows older than N days. Configurable Parameters --db AI_RSS_DB_PATH (recommended absolute path in multi-agent runtime) --limit --force --only-failed --refetch-days --oldest-first --timeout --max-bytes --min-chars --max-retries --retry-backoff-minutes --user-agent --disable-trafilatura --fail-on-errors Error Handling Missing entries table: return actionable error and stop. Network/HTTP/parse errors: store failure state and continue processing other entries. Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry. Extraction too short ( --min-chars ): treat as failure to avoid low-quality body text. References references/schema.md references/fetch-rules.md Assets assets/config.example.json Scripts scripts/fulltext_fetch.py

返回排行榜