tooluniverse-sequence-retrieval

安装量: 1.3K
排名: #1096

安装

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-sequence-retrieval
Biological Sequence Retrieval
Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.
IMPORTANT
Always use English terms in tool calls (gene names, organism names, sequence descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language. Workflow Overview Phase 0: Clarify (if needed) ↓ Phase 1: Disambiguate Gene/Organism ↓ Phase 2: Search & Retrieve (Internal) ↓ Phase 3: Report Sequence Profile Phase 0: Clarification (When Needed) Ask the user ONLY if: Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?) Sequence type unclear (mRNA, genomic, protein?) Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.) Skip clarification for: Specific accession numbers (NC_ , NM_ , U*, etc.) Clear organism + gene combinations Complete genome requests with organism specified Phase 1: Gene/Organism Disambiguation 1.1 Resolve Identifiers from tooluniverse import ToolUniverse tu = ToolUniverse ( ) tu . load_tools ( )

Strategy depends on input type

if user_provided_accession :

Direct retrieval based on accession type

accession

user_provided_accession elif user_provided_gene_and_organism :

Search NCBI Nucleotide

result

tu
.
tools
.
NCBI_search_nucleotide
(
operation
=
"search"
,
organism
=
organism
,
gene
=
gene
,
limit
=
10
)
1.2 Accession Type Decision Tree
CRITICAL
Accession prefix determines which tools to use. Prefix Type Use With NC_ RefSeq chromosome NCBI only NM_ RefSeq mRNA NCBI only NR_ RefSeq ncRNA NCBI only NP_ RefSeq protein NCBI only XM_ RefSeq predicted mRNA NCBI only U, M, K, X GenBank NCBI or ENA CP, NZ_* GenBank genome NCBI or ENA EMBL format EMBL ENA preferred 1.3 Identity Resolution Checklist Organism confirmed (scientific name) Gene symbol/name identified Sequence type determined (genomic/mRNA/protein) Strain specified (if relevant) Accession prefix identified → tool selection Phase 2: Data Retrieval (Internal) Retrieve silently. Do NOT narrate the search process. 2.1 Search for Sequences

Search NCBI Nucleotide

result

tu . tools . NCBI_search_nucleotide ( operation = "search" , organism = organism , gene = gene , strain = strain ,

Optional

keywords

keywords ,

Optional

seq_type

seq_type ,

complete_genome, mrna, refseq

limit

10 )

Get accession numbers from UIDs

accessions

tu . tools . NCBI_fetch_accessions ( operation = "fetch_accession" , uids = result [ "data" ] [ "uids" ] ) 2.2 Retrieve Sequence Data

Get sequence in desired format

sequence

tu . tools . NCBI_get_sequence ( operation = "fetch_sequence" , accession = accession , format = "fasta"

or "genbank"

)

GenBank format for annotations

annotations

tu . tools . NCBI_get_sequence ( operation = "fetch_sequence" , accession = accession , format = "genbank" ) 2.3 ENA Alternative (for GenBank/EMBL accessions)

Only for non-RefSeq accessions!

if not accession . startswith ( ( "NC_" , "NM_" , "NR_" , "NP_" , "XM_" , "XR_" ) ) :

ENA entry info

entry

tu . tools . ena_get_entry ( accession = accession )

ENA FASTA

fasta

tu . tools . ena_get_sequence_fasta ( accession = accession )

ENA summary

summary

tu
.
tools
.
ena_get_entry_summary
(
accession
=
accession
)
Fallback Chains
Primary
Fallback
Notes
NCBI_get_sequence
ENA (if GenBank format)
NCBI unavailable
ENA_get_entry
NCBI_get_sequence
ENA doesn't have RefSeq
NCBI_search_nucleotide
Try broader keywords
No results
Critical Rule
Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors. Phase 3: Report Sequence Profile Output Structure Present as a Sequence Profile Report . Hide search process.

Sequence Profile: [Gene/Organism] ** Search Summary ** - Query: [gene] in [organism] - Database: NCBI Nucleotide - Results: [N] sequences found


Primary Sequence

| Attribute | Value | |


|

|
|
**
Accession
**
|
accession
|
|
**
Type
**
|
RefSeq / GenBank
|
|
**
Organism
**
|
[scientific name]
|
|
**
Strain
**
|
[strain if applicable]
|
|
**
Length
**
|
[X,XXX bp / aa]
|
|
**
Molecule
**
|
DNA / mRNA / Protein
|
|
**
Topology
**
|
Linear / Circular
|
**
Curation Level
**
●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party

Sequence Statistics | Statistic | Value | |


|

| | ** Length ** | [X,XXX] bp | | ** GC Content ** | [XX.X]% | | ** Genes ** | [N] (if genome) | | ** CDS ** | [N] (if annotated) |

Sequence Preview ```fasta

[ accession ] [ definition ] ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA ... [truncated, full sequence in download] Annotations Summary (from GenBank format) Feature Count Examples CDS [N] [gene names] tRNA [N] - rRNA [N] 16S, 23S Regulatory [N] promoters Alternative Sequences Ranked by relevance and curation level: Accession Type Length Description ENA Compatible NC_000913.3 RefSeq 4.6 Mb E. coli K-12 reference ✗ U00096.3 GenBank 4.6 Mb E. coli K-12 ✓ CP001509.3 GenBank 4.6 Mb E. coli DH10B ✓ Cross-Database References Database Accession Link RefSeq [NC_] [NCBI link] GenBank [U] [NCBI link] ENA/EMBL [same as GenBank] [ENA link] BioProject [PRJNA] [link] BioSample [SAMN] [link] Download Options Formats Available Format Description Use Case FASTA Sequence only BLAST, alignment GenBank Sequence + annotations Gene analysis GFF3 Annotations only Genome browsers Direct Commands

FASTA format

tu . tools . NCBI_get_sequence ( operation = "fetch_sequence" , accession = "accession" , format = "fasta" )

GenBank format (with annotations)

tu . tools . NCBI_get_sequence ( operation = "fetch_sequence" , accession = "accession" , format = "genbank" ) Related Sequences Other Strains/Isolates Accession Strain Similarity Notes [acc1] [strain1] 99.9% [notes] [acc2] [strain2] 99.5% [notes] Protein Products (if applicable) Protein Accession Product Name Length [NP_*] [protein name] [X] aa Retrieved: [date] Database: NCBI Nucleotide


Curation Level Tiers

Tier Symbol Accession Prefix Description
RefSeq Reference ●●●● NC_, NM_, NP_ NCBI-curated, gold standard
RefSeq Predicted ●●●○ XM_, XP_, XR_ Computationally predicted
GenBank Validated ●●○○ Various Submitted, some curation
GenBank Direct ●○○○ Various Direct submission
Third Party ○○○○ TPA_ Third-party annotation
Include in report:
```markdown
Curation Level: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use
Completeness Checklist
Every sequence report MUST include:
Per Sequence (Required)
Accession number
Organism (scientific name)
Sequence type (DNA/RNA/protein)
Length
Curation level
Database source
Search Summary (Required)
Query parameters
Number of results
Ranking rationale
Include Even If Limited
Alternative sequences (or "Only one sequence found")
Cross-database references (or "No cross-references available")
Download instructions
Common Use Cases
Reference Genome
User: "Get E. coli K-12 complete genome"
result
=
tu
.
tools
.
NCBI_search_nucleotide
(
operation
=
"search"
,
organism
=
"Escherichia coli"
,
strain
=
"K-12"
,
seq_type
=
"complete_genome"
,
limit
=
3
)
# Return NC_000913.3 (RefSeq reference)
Gene Sequence
User: "Find human BRCA1 mRNA"
result
=
tu
.
tools
.
NCBI_search_nucleotide
(
operation
=
"search"
,
organism
=
"Homo sapiens"
,
gene
=
"BRCA1"
,
seq_type
=
"mrna"
,
limit
=
10
)
Specific Accession
User: "Get sequence for NC_045512.2"
→ Direct retrieval with full metadata
Strain Comparison
User: "Compare E. coli K-12 and O157:H7 genomes"
→ Search both strains, provide comparison table
Error Handling
Error
Response
"No search criteria provided"
Add organism, gene, or keywords
"ENA 404 error"
Accession is likely RefSeq → use NCBI only
"No results found"
Broaden search, check spelling, try synonyms
"Sequence too large"
Note size, provide download link instead of preview
"API rate limit"
Tools auto-retry; if persistent, wait briefly
Tool Reference
NCBI Tools (All Accessions)
Tool
Purpose
NCBI_search_nucleotide
Search by gene/organism
NCBI_fetch_accessions
Convert UIDs to accessions
NCBI_get_sequence
Retrieve sequence data
ENA Tools (GenBank/EMBL Only)
Tool
Purpose
ena_get_entry
Entry metadata
ena_get_sequence_fasta
FASTA sequence
ena_get_entry_summary
Summary info
Search Parameters Reference
NCBI_search_nucleotide
Parameter
Description
Example
operation
Always "search"
"search"
organism
Scientific name
"Homo sapiens"
gene
Gene symbol
"BRCA1"
strain
Specific strain
"K-12"
keywords
Free text
"complete genome"
seq_type
Sequence type
"complete_genome", "mrna", "refseq"
limit
Max results
10
NCBI_get_sequence
Parameter
Description
Example
operation
Always "fetch_sequence"
"fetch_sequence"
accession
Accession number
"NC_000913.3"
format
Output format
"fasta", "genbank"
返回排行榜