article-extractor

安装量: 39
排名: #18456

安装

npx skills add https://github.com/michalparkola/tapestry-skills-for-claude-code --skill article-extractor

Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

When to Use This Skill

Activate when the user:

Provides an article/blog URL and wants the text content Asks to "download this article" Wants to "extract the content from [URL]" Asks to "save this blog post as text" Needs clean article text without distractions How It Works Priority Order: Check if tools are installed (reader or trafilatura) Download and extract article using best available tool Clean up the content (remove extra whitespace, format properly) Save to file with article title as filename Confirm location and show preview Installation Check

Check for article extraction tools in this order:

Option 1: reader (Recommended - Mozilla's Readability) command -v reader

If not installed:

npm install -g @mozilla/readability-cli

or

npm install -g reader-cli

Option 2: trafilatura (Python-based, very good) command -v trafilatura

If not installed:

pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

Extraction Methods Method 1: Using reader (Best for most articles)

Extract article

reader "URL" > article.txt

Pros:

Based on Mozilla's Readability algorithm Excellent at removing clutter Preserves article structure Method 2: Using trafilatura (Best for blogs/news)

Extract article

trafilatura --URL "URL" --output-format txt > article.txt

Or with more options

trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

Pros:

Very accurate extraction Good with various site structures Handles multiple languages

Options:

--no-comments: Skip comment sections --no-tables: Skip data tables --precision: Favor precision over recall --recall: Extract more content (may include some noise) Method 3: Fallback (curl + basic parsing)

Download and extract basic content

curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys

class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None

def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
    self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\n\n'.join(self.content)

parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt

Note: This is less reliable but works without dependencies.

Getting Article Title

Extract title for filename:

Using reader:

reader outputs markdown with title at top

TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

Using trafilatura:

Get metadata including title

TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

Using curl (fallback): TITLE=$(curl -s "URL" | grep -oP '\K[^<]+' | sed 's/ - .<em>//' | sed 's/ | .</em>//')</p> <p>Filename Creation</p> <p>Clean title for filesystem:</p> <h1 id="get-title">Get title</h1> <p>TITLE="Article Title from Website"</p> <h1 id="clean-for-filesystem-remove-special-chars-limit-length">Clean for filesystem (remove special chars, limit length)</h1> <p>FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')</p> <h1 id="add-extension">Add extension</h1> <p>FILENAME="${FILENAME}.txt"</p> <p>Complete Workflow ARTICLE_URL="https://example.com/article"</p> <h1 id="check-for-tools">Check for tools</h1> <p>if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi</p> <h1 id="extract-article_2">Extract article</h1> <p>case $TOOL in reader) # Get content reader "$ARTICLE_URL" > temp_article.txt</p> <pre class="codehilite"><code> # Get title (first line after # in markdown) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # Get title from metadata METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # Get clean content trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # Get title TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # Remove site name TITLE=${TITLE%% | *} # Remove site name (alternate) # Get content (basic extraction) curl -s "$ARTICLE_URL" | python3 -c " </code></pre> <p>from html.parser import HTMLParser import sys</p> <p>class ArticleExtractor(HTMLParser): def <strong>init</strong>(self): super().<strong>init</strong>() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}</p> <pre class="codehilite"><code>def handle_starttag(self, tag, attrs): if tag not in self.skip_tags: if tag in {'p', 'article', 'main'}: self.in_content = True if tag in {'h1', 'h2', 'h3'}: self.content.append('\n') def handle_data(self, data): if self.in_content and data.strip(): self.content.append(data.strip()) def get_content(self): return '\n\n'.join(self.content) </code></pre> <p>parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac</p> <h1 id="clean-filename">Clean filename</h1> <p>FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ <em>$//' | sed 's/^ </em>//') FILENAME="${FILENAME}.txt"</p> <h1 id="move-to-final-filename">Move to final filename</h1> <p>mv temp_article.txt "$FILENAME"</p> <h1 id="show-result">Show result</h1> <p>echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"</p> <p>Error Handling Common Issues</p> <ol> <li>Tool not installed</li> </ol> <p>Try alternate tool (reader → trafilatura → fallback) Offer to install: "Install reader with: npm install -g reader-cli"</p> <ol> <li>Paywall or login required</li> </ol> <p>Extraction tools may fail Inform user: "This article requires authentication. Cannot extract."</p> <ol> <li>Invalid URL</li> </ol> <p>Check URL format Try with and without redirects</p> <ol> <li>No content extracted</li> </ol> <p>Site may use heavy JavaScript Try fallback method Inform user if extraction fails</p> <ol> <li>Special characters in title</li> </ol> <p>Clean title for filesystem Remove: /, :, ?, ", <, >, | Replace with - or remove Output Format Saved File Contains: Article title (if available) Author (if available from tool) Main article text Section headings No navigation, ads, or clutter What Gets Removed: Navigation menus Ads and promotional content Newsletter signup forms Related articles sidebars Comment sections (optional) Social media buttons Cookie notices Tips for Best Results</p> <ol> <li>Use reader for most articles</li> </ol> <p>Best all-around tool Based on Firefox Reader View Works on most news sites and blogs</p> <ol> <li>Use trafilatura for:</li> </ol> <p>Academic articles News sites Blogs with complex layouts Non-English content</p> <ol> <li>Fallback method limitations:</li> </ol> <p>May include some noise Less accurate paragraph detection Better than nothing for simple sites</p> <ol> <li>Check extraction quality:</li> </ol> <p>Always show preview to user Ask if it looks correct Offer to try different tool if needed Example Usage</p> <p>Simple extraction:</p> <h1 id="user-extract-httpsexamplecomarticle">User: "Extract https://example.com/article"</h1> <p>reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"</p> <p>With error handling:</p> <p>if ! reader "$URL" > temp.txt 2>/dev/null; then if command -v trafilatura &> /dev/null; then trafilatura --URL "$URL" --output-format txt > temp.txt else echo "Error: Could not extract article. Install reader or trafilatura." exit 1 fi fi</p> <p>Best Practices ✅ Always show preview after extraction (first 10 lines) ✅ Verify extraction succeeded before saving ✅ Clean filename for filesystem compatibility ✅ Try fallback method if primary fails ✅ Inform user which tool was used ✅ Keep filename length reasonable (< 100 chars) After Extraction</p> <p>Display to user:</p> <p>"✓ Extracted: [Article Title]" "✓ Saved to: [filename]" Show preview (first 10-15 lines) File size and location</p> <p>Ask if needed:</p> <p>"Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill) "Should I extract another article?"</p> </article> <a href="/" class="back-link">← <span data-i18n="detail.backToLeaderboard">返回排行榜</span></a> </div> <aside class="sidebar"> <section class="related-skills" id="relatedSkillsSection"> <h2 class="related-title" data-i18n="detail.relatedSkills">相关 Skills</h2> <div class="related-list" id="relatedSkillsList"> <div class="skeleton-card"></div> <div class="skeleton-card"></div> <div class="skeleton-card"></div> </div> </section> </aside> </div> </div> <script src="https://unpkg.com/i18next@23.11.5/i18next.min.js" defer></script> <script src="https://unpkg.com/i18next-browser-languagedetector@7.2.1/i18nextBrowserLanguageDetector.min.js" defer></script> <script defer> // Language resources - same pattern as index page const resources = { 'zh-CN': null, 'en': null, 'ja': null, 'ko': null, 'zh-TW': null, 'es': null, 'fr': null }; // Load language files (only current + fallback for performance) async function loadLanguageResources() { const savedLang = localStorage.getItem('i18nextLng') || 'en'; const langsToLoad = new Set([savedLang, 'en']); // current + fallback await Promise.all([...langsToLoad].map(async (lang) => { try { const response = await fetch(`/locales/${lang}.json`); if (response.ok) { resources[lang] = { translation: await response.json() }; } } catch (error) { console.warn(`Failed to load ${lang} language file:`, error); } })); } // Load a single language on demand (for language switching) async function loadLanguage(lang) { if (resources[lang]) return; try { const response = await fetch(`/locales/${lang}.json`); if (response.ok) { resources[lang] = { translation: await response.json() }; i18next.addResourceBundle(lang, 'translation', resources[lang].translation); } } catch (error) { console.warn(`Failed to load ${lang} language file:`, error); } } // Initialize i18next async function initI18n() { try { await loadLanguageResources(); // Filter out null values from resources const validResources = {}; for (const [lang, data] of Object.entries(resources)) { if (data !== null) { validResources[lang] = data; } } console.log('Loaded languages:', Object.keys(validResources)); console.log('zh-CN resource:', validResources['zh-CN']); console.log('detail.home in resource:', validResources['zh-CN']?.translation?.detail?.home); // 检查是否有保存的语言偏好 const savedLang = localStorage.getItem('i18nextLng'); // 如果没有保存的语言偏好,默认使用英文 const defaultLang = savedLang && ['zh-CN', 'en', 'ja', 'ko', 'zh-TW', 'es', 'fr'].includes(savedLang) ? savedLang : 'en'; await i18next .use(i18nextBrowserLanguageDetector) .init({ lng: defaultLang, // 强制设置初始语言 fallbackLng: 'en', supportedLngs: ['zh-CN', 'en', 'ja', 'ko', 'zh-TW', 'es', 'fr'], resources: validResources, detection: { order: ['localStorage'], // 只使用 localStorage,不检测浏览器语言 caches: ['localStorage'], lookupLocalStorage: 'i18nextLng' }, interpolation: { escapeValue: false } }); console.log('i18next initialized, language:', i18next.language); console.log('Test translation:', i18next.t('detail.home')); // Set initial language in selector const langSwitcher = document.getElementById('langSwitcher'); langSwitcher.value = i18next.language; // Update page language updatePageLanguage(); // Language switch event langSwitcher.addEventListener('change', async (e) => { await loadLanguage(e.target.value); // load on demand i18next.changeLanguage(e.target.value).then(() => { updatePageLanguage(); localStorage.setItem('i18nextLng', e.target.value); }); }); } catch (error) { console.error('i18next init failed:', error); } } // Translation helper function t(key, options = {}) { return i18next.t(key, options); } // Update all translatable elements function updatePageLanguage() { // Update HTML lang attribute document.documentElement.lang = i18next.language; // Update elements with data-i18n attribute document.querySelectorAll('[data-i18n]').forEach(el => { const key = el.getAttribute('data-i18n'); el.textContent = t(key); }); } // Copy command function function copyCommand() { const command = document.getElementById('installCommand').textContent; const btn = document.getElementById('copyBtn'); navigator.clipboard.writeText(command).then(() => { btn.textContent = t('copied'); btn.classList.add('copied'); setTimeout(() => { btn.textContent = t('copy'); btn.classList.remove('copied'); }, 2000); }).catch(() => { // Fallback for non-HTTPS const textArea = document.createElement('textarea'); textArea.value = command; textArea.style.position = 'fixed'; textArea.style.left = '-9999px'; document.body.appendChild(textArea); textArea.select(); document.execCommand('copy'); document.body.removeChild(textArea); btn.textContent = t('copied'); btn.classList.add('copied'); setTimeout(() => { btn.textContent = t('copy'); btn.classList.remove('copied'); }, 2000); }); } // Initialize document.getElementById('copyBtn').addEventListener('click', copyCommand); initI18n(); // 异步加载相关 Skills async function loadRelatedSkills() { const owner = 'nicepkg'; const skillName = 'article-extractor'; const currentLang = 'es'; const listContainer = document.getElementById('relatedSkillsList'); const section = document.getElementById('relatedSkillsSection'); try { const response = await fetch(`/api/related-skills/${encodeURIComponent(owner)}/${encodeURIComponent(skillName)}?limit=6`); if (!response.ok) { throw new Error('Failed to load'); } const data = await response.json(); const relatedSkills = data.related_skills || []; if (relatedSkills.length === 0) { // 没有相关推荐时隐藏整个区域 section.style.display = 'none'; return; } // 渲染相关 Skills listContainer.innerHTML = relatedSkills.map(skill => { const desc = skill.description || ''; const truncatedDesc = desc.length > 60 ? desc.substring(0, 60) + '...' : desc; return ` <a href="${currentLang === 'en' ? '' : '/' + currentLang}/skill/${skill.owner}/${skill.repo}/${skill.skill_name}" class="related-card"> <div class="related-name">${escapeHtml(skill.skill_name)}</div> <div class="related-meta"> <span class="related-owner">${escapeHtml(skill.owner)}</span> <span class="related-installs">${skill.installs}</span> </div> <div class="related-desc">${escapeHtml(truncatedDesc)}</div> </a> `; }).join(''); } catch (error) { console.error('Failed to load related skills:', error); // 加载失败时显示提示或隐藏 listContainer.innerHTML = '<div class="related-empty">暂无相关推荐</div>'; } } // HTML 转义 function escapeHtml(text) { const div = document.createElement('div'); div.textContent = text; return div.innerHTML; } // 页面加载完成后异步加载相关 Skills if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', loadRelatedSkills); } else { loadRelatedSkills(); } </script> </body> </html>