安装
npx skills add https://github.com/lingzhi227/agent-research-skills --skill github-research
复制
GitHub Research Skill
Trigger
Activate this skill when the user wants to:
"Find repos for [topic]", "GitHub research on [topic]"
"Analyze open-source code for [topic]"
"Find implementations of [paper/technique]"
"Which repos implement [algorithm]?"
Uses
/github-research
slash command
Overview
This skill systematically discovers, evaluates, and deeply analyzes GitHub repositories related to a research topic. It reads
deep-research
output (paper database, phase reports, code references) and produces an actionable integration blueprint for reusing open-source code.
Installation
:
~/.claude/skills/github-research/
— scripts, references, and this skill definition.
Output
:
./github-research-output/{slug}/
relative to the current working directory.
Input
A deep-research output directory (containing
paper_db.jsonl
, phase reports,
code_repos.md
, etc.)
6-Phase Pipeline
Phase 1: Intake → Extract refs, URLs, keywords from deep-research output
Phase 2: Discovery → Multi-source broad GitHub search (50-200 repos)
Phase 3: Filtering → Score & rank → select top 15-30 repos
Phase 4: Deep Dive → Clone & deeply analyze top 8-15 repos (code reading)
Phase 5: Analysis → Per-repo reports + cross-repo comparison
Phase 6: Blueprint → Integration/reuse plan for research topic
Output Directory Structure
github-research-output/{slug}/
├── repo_db.jsonl # Master repo database
├── phase1_intake/
│ ├── extracted_refs.jsonl # URLs, keywords, paper-repo links
│ └── intake_summary.md
├── phase2_discovery/
│ ├── search_results/ # Raw JSONL from each search
│ └── discovery_log.md
├── phase3_filtering/
│ ├── ranked_repos.jsonl # Scored & ranked subset
│ └── filtering_report.md
├── phase4_deep_dive/
│ ├── repos/ # Cloned repos (shallow)
│ ├── analyses/ # Per-repo analysis .md files
│ └── deep_dive_summary.md
├── phase5_analysis/
│ ├── comparison_matrix.md # Cross-repo comparison
│ ├── technique_map.md # Paper concept → code mapping
│ └── analysis_report.md
└── phase6_blueprint/
├── integration_plan.md # How to combine repos
├── reuse_catalog.md # Reusable components catalog
├── final_report.md # Complete compiled report
└── blueprint_summary.md
Scripts Reference
All scripts are Python 3, stdlib-only, located in
~/.claude/skills/github-research/scripts/
.
Script
Purpose
Key Flags
extract_research_refs.py
Parse deep-research output for GitHub URLs, paper refs, keywords
--research-dir
,
--output
search_github.py
Search GitHub repos via
gh api
--query
,
--language
,
--min-stars
,
--sort
,
--max-results
,
--topic
,
--output
search_github_code.py
Search GitHub code for implementations
--query
,
--language
,
--filename
,
--max-results
,
--output
search_paperswithcode.py
Search Papers With Code for paper→repo mappings
--paper-title
,
--arxiv-id
,
--query
,
--output
repo_db.py
JSONL repo database management
subcommands:
merge
,
filter
,
score
,
search
,
tag
,
stats
,
export
,
rank
repo_metadata.py
Fetch detailed metadata via
gh api
--repos
,
--input
,
--output
,
--delay
clone_repo.py
Shallow-clone repos for analysis
--repo
,
--output-dir
,
--depth
,
--branch
analyze_repo_structure.py
Map file tree, key files, LOC stats
--repo-dir
,
--output
extract_dependencies.py
Extract and parse dependency files
--repo-dir
,
--output
find_implementations.py
Search cloned repo for specific code patterns
--repo-dir
,
--patterns
,
--output
repo_readme_fetch.py
Fetch README without cloning
--repos
,
--input
,
--output
,
--max-chars
compare_repos.py
Generate comparison matrix across repos
--input
,
--output
compile_github_report.py
Assemble final report from all phases
--topic-dir
Phase 1: Intake
Goal
Extract all relevant references, URLs, and keywords from the deep-research output.
Steps
Create output directory structure
:
SLUG
=
$(
echo
"
$TOPIC
"
|
tr
'[:upper:]'
'[:lower:]'
|
tr
' '
'-'
|
tr
-cd
'a-z0-9-'
)
mkdir
-p
github-research-output/
$SLUG
/
{
phase1_intake,phase2_discovery/search_results,phase3_filtering,phase4_deep_dive/
{
repos,analyses
}
,phase5_analysis,phase6_blueprint
}
Extract references from deep-research output
:
python ~/.claude/skills/github-research/scripts/extract_research_refs.py
\
--research-dir
<
deep-research-output-dir
>
\
--output
github-research-output/
$SLUG
/phase1_intake/extracted_refs.jsonl
Review extracted refs
Read the generated JSONL. Note:
GitHub URLs found directly in reports
Paper titles and arxiv IDs (for Papers With Code lookup)
Research keywords and themes (for GitHub search queries)
Write intake summary
Create
phase1_intake/intake_summary.md
with:
Number of direct GitHub URLs found
Number of papers with potential code links
Key research themes extracted
Planned search queries for Phase 2
Checkpoint
extracted_refs.jsonl
exists with entries
intake_summary.md
written
Search strategy documented
Phase 2: Discovery
Goal
Cast a wide net to find 50-200 candidate repos from multiple sources.
Steps
Search by direct URLs
Any GitHub URLs from Phase 1 → fetch metadata:
python ~/.claude/skills/github-research/scripts/repo_metadata.py
\
--repos
owner1/name1 owner2/name2
..
.
\
--output
github-research-output/
$SLUG
/phase2_discovery/search_results/direct_urls.jsonl
Search Papers With Code
For each paper with an arxiv ID:
python ~/.claude/skills/github-research/scripts/search_paperswithcode.py
\
--arxiv-id
2401.12345
\
--output
github-research-output/
$SLUG
/phase2_discovery/search_results/pwc_2401.12345.jsonl
Search GitHub by keywords
(3-8 queries based on research themes):
python ~/.claude/skills/github-research/scripts/search_github.py
\
--query
"multi-agent LLM coordination"
\
--min-stars
10
--sort
stars --max-results
50
\
--output
github-research-output/
$SLUG
/phase2_discovery/search_results/gh_query1.jsonl
Search GitHub code
(for specific implementations):
python ~/.claude/skills/github-research/scripts/search_github_code.py
\
--query
"class MultiAgentOrchestrator"
\
--language
python --max-results
30
\
--output
github-research-output/
$SLUG
/phase2_discovery/search_results/code_query1.jsonl
Fetch READMEs
for repos that lack descriptions:
python ~/.claude/skills/github-research/scripts/repo_readme_fetch.py
\
--input
<
repos.jsonl
>
\
--output
github-research-output/
$SLUG
/phase2_discovery/search_results/readmes.jsonl
Merge all results
into master database:
python ~/.claude/skills/github-research/scripts/repo_db.py merge
\
--inputs
github-research-output/
$SLUG
/phase2_discovery/search_results/*.jsonl
\
--output
github-research-output/
$SLUG
/repo_db.jsonl
Write discovery log
Create
phase2_discovery/discovery_log.md
with search queries used, results per source, total unique repos found.
Rate Limits
GitHub search API: 30 requests/minute (authenticated)
Papers With Code API: No strict limit but be respectful (1 req/sec)
Add
--delay 1.0
to batch operations when needed
Checkpoint
repo_db.jsonl
populated with 50-200 repos
discovery_log.md
with search details
Phase 3: Filtering
Goal
Score and rank repos, select top 15-30 for deeper analysis.
Steps
Enrich metadata
for all repos:
python ~/.claude/skills/github-research/scripts/repo_metadata.py
\
--input
github-research-output/
$SLUG
/repo_db.jsonl
\
--output
github-research-output/
$SLUG
/repo_db.jsonl
\
--delay
0.5
Score repos
(quality + activity scores):
python ~/.claude/skills/github-research/scripts/repo_db.py score
\
--input
github-research-output/
$SLUG
/repo_db.jsonl
\
--output
github-research-output/
$SLUG
/repo_db.jsonl
LLM relevance scoring
Read through the top ~50 repos (by quality_score) and assign
relevance_score
(0.0-1.0) based on:
Direct relevance to research topic
Implementation completeness
Code quality signals (from README, description)
Update the relevance scores:
python ~/.claude/skills/github-research/scripts/repo_db.py tag
\
--input
github-research-output/
$SLUG
/repo_db.jsonl
\
--ids
owner/name
--tags
"relevance:0.85"
Compute composite scores and rank
:
python ~/.claude/skills/github-research/scripts/repo_db.py score
\
--input
github-research-output/
$SLUG
/repo_db.jsonl
\
--output
github-research-output/
$SLUG
/repo_db.jsonl
python ~/.claude/skills/github-research/scripts/repo_db.py rank
\
--input
github-research-output/
$SLUG
/repo_db.jsonl
\
--output
github-research-output/
$SLUG
/phase3_filtering/ranked_repos.jsonl
\
--by
composite_score
Select top repos
Filter to top 15-30:
python ~/.claude/skills/github-research/scripts/repo_db.py filter
\
--input
github-research-output/
$SLUG
/phase3_filtering/ranked_repos.jsonl
\
--output
github-research-output/
$SLUG
/phase3_filtering/ranked_repos.jsonl
\
--max-repos
30
--not-archived
Write filtering report
Create
phase3_filtering/filtering_report.md
:
Stats before/after filtering
Score distributions
Top 30 repos with scores and rationale
Scoring Formula
activity_score = sigmoid((days_since_push < 90) * 0.4 + has_recent_commits * 0.3 + open_issues_ratio * 0.3)
quality_score = normalize(log(stars+1) * 0.3 + log(forks+1) * 0.2 + has_license * 0.15 + has_readme * 0.15 + not_archived * 0.2)
composite_score = relevance * 0.4 + quality * 0.35 + activity * 0.25
Checkpoint
ranked_repos.jsonl
with 15-30 repos
filtering_report.md
with scoring details
Phase 4: Deep Dive
Goal
Clone and deeply analyze the top 8-15 repos.
Steps
Select repos for deep dive
Take top 8-15 from ranked list.
Clone each repo
(shallow):
python ~/.claude/skills/github-research/scripts/clone_repo.py
\
--repo
owner/name
\
--output-dir github-research-output/
$SLUG
/phase4_deep_dive/repos/
Analyze structure
for each cloned repo:
python ~/.claude/skills/github-research/scripts/analyze_repo_structure.py
\
--repo-dir github-research-output/
$SLUG
/phase4_deep_dive/repos/name/
\
--output
github-research-output/
$SLUG
/phase4_deep_dive/analyses/name_structure.json
Extract dependencies
:
python ~/.claude/skills/github-research/scripts/extract_dependencies.py
\
--repo-dir github-research-output/
$SLUG
/phase4_deep_dive/repos/name/
\
--output
github-research-output/
$SLUG
/phase4_deep_dive/analyses/name_deps.json
Find implementations
Search for key algorithms/concepts from research:
python ~/.claude/skills/github-research/scripts/find_implementations.py
\
--repo-dir github-research-output/
$SLUG
/phase4_deep_dive/repos/name/
\
--patterns
"class Transformer"
"def forward"
"attention"
\
--output
github-research-output/
$SLUG
/phase4_deep_dive/analyses/name_impls.jsonl
Deep code reading
For each repo, READ the key source files identified by structure analysis. Write a per-repo analysis in
phase4_deep_dive/analyses/{name}_analysis.md
:
Architecture overview
Key algorithms implemented
Code quality assessment
API / interface design
Dependencies and requirements
Strengths and limitations
Reusability assessment (how easy to extract components)
Write deep dive summary
:
phase4_deep_dive/deep_dive_summary.md
IMPORTANT: Actually Read Code
Do NOT just summarize READMEs. You must:
Read the main source files (entry points, core modules)
Understand the actual implementation approach
Identify specific functions/classes that implement research concepts
Note code patterns, design decisions, and trade-offs
Checkpoint
Repos cloned in
repos/
Per-repo analysis files in
analyses/
deep_dive_summary.md
written
Phase 5: Analysis
Goal
Cross-repo comparison and technique-to-code mapping.
Steps
Generate comparison matrix
:
python ~/.claude/skills/github-research/scripts/compare_repos.py
\
--input
github-research-output/
$SLUG
/phase4_deep_dive/analyses/
\
--output
github-research-output/
$SLUG
/phase5_analysis/comparison.json
Write comparison matrix
Create
phase5_analysis/comparison_matrix.md
:
Table comparing repos across dimensions (language, LOC, stars, framework, license, tests)
Dependency overlap analysis
Strengths/weaknesses per repo
Write technique map
Create
phase5_analysis/technique_map.md
:
Map each paper concept / research technique → specific repo + file + function
Identify gaps (techniques with no implementation found)
Note alternative implementations of the same concept
Write analysis report
:
phase5_analysis/analysis_report.md
:
Executive summary of findings
Key insights from code analysis
Recommendations for which repos to use for which purposes
Checkpoint
comparison_matrix.md
with repo comparison table
technique_map.md
mapping concepts to code
analysis_report.md
with findings
Phase 6: Blueprint
Goal
Produce an actionable integration and reuse plan.
Steps
Write integration plan
:
phase6_blueprint/integration_plan.md
:
Recommended architecture for combining repos
Step-by-step integration approach
Dependency resolution strategy
Potential conflicts and how to resolve them
Write reuse catalog
:
phase6_blueprint/reuse_catalog.md
:
For each reusable component: source repo, file path, function/class, what it does, how to extract it
License compatibility matrix
Effort estimates (easy/medium/hard to integrate)
Compile final report
:
python ~/.claude/skills/github-research/scripts/compile_github_report.py
\
--topic-dir github-research-output/
$SLUG
/
Write blueprint summary
:
phase6_blueprint/blueprint_summary.md
:
One-page executive summary
Top 5 repos and why
Recommended next steps
Checkpoint
integration_plan.md
complete
reuse_catalog.md
with component catalog
final_report.md
compiled
blueprint_summary.md
as executive summary
Quality Conventions
Repos are ranked by composite score
:
relevance × 0.4 + quality × 0.35 + activity × 0.25
Deep dive requires reading actual code
, not just READMEs
Integration blueprint must map paper concepts → specific code files/functions
Incremental saves
Each phase writes to disk immediately
Checkpoint recovery
Can resume from any phase by checking what outputs exist
All scripts are stdlib-only Python
— no pip installs needed
gh
CLI is required
for GitHub API access (must be authenticated)
Deduplication
by
repo_id
(owner/name) across all searches
Rate limit awareness
Respect GitHub search API limits (30 req/min)
Error Handling
If
gh
is not installed: warn user and provide installation instructions
If a repo is archived/deleted: skip gracefully, note in log
If clone fails: skip, note in log, continue with remaining repos
If Papers With Code API is down: skip, rely on GitHub search only
Always write partial progress to disk so work is not lost
References
See
references/phase-guide.md
for detailed phase execution guidance
Deep-research skill:
~/.claude/skills/deep-research/SKILL.md
Paper database pattern:
~/.claude/skills/deep-research/scripts/paper_db.py
← 返回排行榜