gget

安装量: 138
排名: #6244

安装

npx skills add https://github.com/davila7/claude-code-templates --skill gget

gget Overview

gget is a command-line bioinformatics tool and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequence analysis, protein structures, expression data, and disease associations through a consistent interface. All gget modules work both as command-line tools and as Python functions.

Important: The databases queried by gget are continuously updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary.

Installation

Install gget in a clean virtual environment to avoid conflicts:

Using uv (recommended)

uv uv pip install gget

Or using pip

uv pip install --upgrade gget

In Python/Jupyter

import gget

Quick Start

Basic usage pattern for all modules:

Command-line

gget [arguments] [options]

Python

gget.module(arguments, options)

Most modules return:

Command-line: JSON (default) or CSV with -csv flag Python: DataFrame or dictionary

Common flags across modules:

-o/--out: Save results to file -q/--quiet: Suppress progress information -csv: Return CSV format (command-line only) Module Categories 1. Reference & Gene Information gget ref - Reference Genome Downloads

Retrieve download links and metadata for Ensembl reference genomes.

Parameters:

species: Genus_species format (e.g., 'homo_sapiens', 'mus_musculus'). Shortcuts: 'human', 'mouse' -w/--which: Specify return types (gtf, cdna, dna, cds, cdrna, pep). Default: all -r/--release: Ensembl release number (default: latest) -l/--list_species: List available vertebrate species -liv/--list_iv_species: List available invertebrate species -ftp: Return only FTP links -d/--download: Download files (requires curl)

Examples:

List available species

gget ref --list_species

Get all reference files for human

gget ref homo_sapiens

Download only GTF annotation for mouse

gget ref -w gtf -d mouse

Python

gget.ref("homo_sapiens") gget.ref("mus_musculus", which="gtf", download=True)

gget search - Gene Search

Locate genes by name or description across species.

Parameters:

searchwords: One or more search terms (case-insensitive) -s/--species: Target species (e.g., 'homo_sapiens', 'mouse') -r/--release: Ensembl release number -t/--id_type: Return 'gene' (default) or 'transcript' -ao/--andor: 'or' (default) finds ANY searchword; 'and' requires ALL -l/--limit: Maximum results to return

Returns: ensembl_id, gene_name, ensembl_description, ext_ref_description, biotype, URL

Examples:

Search for GABA-related genes in human

gget search -s human gaba gamma-aminobutyric

Find specific gene, require all terms

gget search -s mouse -ao and pax7 transcription

Python

gget.search(["gaba", "gamma-aminobutyric"], species="homo_sapiens")

gget info - Gene/Transcript Information

Retrieve comprehensive gene and transcript metadata from Ensembl, UniProt, and NCBI.

Parameters:

ens_ids: One or more Ensembl IDs (also supports WormBase, Flybase IDs). Limit: ~1000 IDs -n/--ncbi: Disable NCBI data retrieval -u/--uniprot: Disable UniProt data retrieval -pdb: Include PDB identifiers (increases runtime)

Returns: UniProt ID, NCBI gene ID, primary gene name, synonyms, protein names, descriptions, biotype, canonical transcript

Examples:

Get info for multiple genes

gget info ENSG00000034713 ENSG00000104853 ENSG00000170296

Include PDB IDs

gget info ENSG00000034713 -pdb

Python

gget.info(["ENSG00000034713", "ENSG00000104853"], pdb=True)

gget seq - Sequence Retrieval

Fetch nucleotide or amino acid sequences for genes and transcripts.

Parameters:

ens_ids: One or more Ensembl identifiers -t/--translate: Fetch amino acid sequences instead of nucleotide -iso/--isoforms: Return all transcript variants (gene IDs only)

Returns: FASTA format sequences

Examples:

Get nucleotide sequences

gget seq ENSG00000034713 ENSG00000104853

Get all protein isoforms

gget seq -t -iso ENSG00000034713

Python

gget.seq(["ENSG00000034713"], translate=True, isoforms=True)

  1. Sequence Analysis & Alignment gget blast - BLAST Searches

BLAST nucleotide or amino acid sequences against standard databases.

Parameters:

sequence: Sequence string or path to FASTA/.txt file -p/--program: blastn, blastp, blastx, tblastn, tblastx (auto-detected) -db/--database: Nucleotide: nt, refseq_rna, pdbnt Protein: nr, swissprot, pdbaa, refseq_protein -l/--limit: Max hits (default: 50) -e/--expect: E-value cutoff (default: 10.0) -lcf/--low_comp_filt: Enable low complexity filtering -mbo/--megablast_off: Disable MegaBLAST (blastn only)

Examples:

BLAST protein sequence

gget blast MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR

BLAST from file with specific database

gget blast sequence.fasta -db swissprot -l 10

Python

gget.blast("MKWMFK...", database="swissprot", limit=10)

gget blat - BLAT Searches

Locate genomic positions of sequences using UCSC BLAT.

Parameters:

sequence: Sequence string or path to FASTA/.txt file -st/--seqtype: 'DNA', 'protein', 'translated%20RNA', 'translated%20DNA' (auto-detected) -a/--assembly: Target assembly (default: 'human'/hg38; options: 'mouse'/mm39, 'zebrafinch'/taeGut2, etc.)

Returns: genome, query size, alignment positions, matches, mismatches, alignment percentage

Examples:

Find genomic location in human

gget blat ATCGATCGATCGATCG

Search in different assembly

gget blat -a mm39 ATCGATCGATCGATCG

Python

gget.blat("ATCGATCGATCGATCG", assembly="mouse")

gget muscle - Multiple Sequence Alignment

Align multiple nucleotide or amino acid sequences using Muscle5.

Parameters:

fasta: Sequences or path to FASTA/.txt file -s5/--super5: Use Super5 algorithm for faster processing (large datasets)

Returns: Aligned sequences in ClustalW format or aligned FASTA (.afa)

Examples:

Align sequences from file

gget muscle sequences.fasta -o aligned.afa

Use Super5 for large dataset

gget muscle large_dataset.fasta -s5

Python

gget.muscle("sequences.fasta", save=True)

gget diamond - Local Sequence Alignment

Perform fast local protein or translated DNA alignment using DIAMOND.

Parameters:

Query: Sequences (string/list) or FASTA file path --reference: Reference sequences (string/list) or FASTA file path (required) --sensitivity: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive (default), ultra-sensitive --threads: CPU threads (default: 1) --diamond_db: Save database for reuse --translated: Enable nucleotide-to-amino acid alignment

Returns: Identity percentage, sequence lengths, match positions, gap openings, E-values, bit scores

Examples:

Align against reference

gget diamond GGETISAWESQME -ref reference.fasta --threads 4

Save database for reuse

gget diamond query.fasta -ref ref.fasta --diamond_db my_db.dmnd

Python

gget.diamond("GGETISAWESQME", reference="reference.fasta", threads=4)

  1. Structural & Protein Analysis gget pdb - Protein Structures

Query RCSB Protein Data Bank for structure and metadata.

Parameters:

pdb_id: PDB identifier (e.g., '7S7U') -r/--resource: Data type (pdb, entry, pubmed, assembly, entity types) -i/--identifier: Assembly, entity, or chain ID

Returns: PDB format (structures) or JSON (metadata)

Examples:

Download PDB structure

gget pdb 7S7U -o 7S7U.pdb

Get metadata

gget pdb 7S7U -r entry

Python

gget.pdb("7S7U", save=True)

gget alphafold - Protein Structure Prediction

Predict 3D protein structures using simplified AlphaFold2.

Setup Required:

Install OpenMM first

uv pip install openmm

Then setup AlphaFold

gget setup alphafold

Parameters:

sequence: Amino acid sequence (string), multiple sequences (list), or FASTA file. Multiple sequences trigger multimer modeling -mr/--multimer_recycles: Recycling iterations (default: 3; recommend 20 for accuracy) -mfm/--multimer_for_monomer: Apply multimer model to single proteins -r/--relax: AMBER relaxation for top-ranked model plot: Python-only; generate interactive 3D visualization (default: True) show_sidechains: Python-only; include side chains (default: True)

Returns: PDB structure file, JSON alignment error data, optional 3D visualization

Examples:

Predict single protein structure

gget alphafold MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR

Predict multimer with higher accuracy

gget alphafold sequence1.fasta -mr 20 -r

Python with visualization

gget.alphafold("MKWMFK...", plot=True, show_sidechains=True)

Multimer prediction

gget.alphafold(["sequence1", "sequence2"], multimer_recycles=20)

gget elm - Eukaryotic Linear Motifs

Predict Eukaryotic Linear Motifs in protein sequences.

Setup Required:

gget setup elm

Parameters:

sequence: Amino acid sequence or UniProt Acc -u/--uniprot: Indicates sequence is UniProt Acc -e/--expand: Include protein names, organisms, references -s/--sensitivity: DIAMOND alignment sensitivity (default: "very-sensitive") -t/--threads: Number of threads (default: 1)

Returns: Two outputs:

ortholog_df: Linear motifs from orthologous proteins regex_df: Motifs directly matched in input sequence

Examples:

Predict motifs from sequence

gget elm LIAQSIGQASFV -o results

Use UniProt accession with expanded info

gget elm --uniprot Q02410 -e

Python

ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")

  1. Expression & Disease Data gget archs4 - Gene Correlation & Tissue Expression

Query ARCHS4 database for correlated genes or tissue expression data.

Parameters:

gene: Gene symbol or Ensembl ID (with --ensembl flag) -w/--which: 'correlation' (default, returns 100 most correlated genes) or 'tissue' (expression atlas) -s/--species: 'human' (default) or 'mouse' (tissue data only) -e/--ensembl: Input is Ensembl ID

Returns:

Correlation mode: Gene symbols, Pearson correlation coefficients Tissue mode: Tissue identifiers, min/Q1/median/Q3/max expression values

Examples:

Get correlated genes

gget archs4 ACE2

Get tissue expression

gget archs4 -w tissue ACE2

Python

gget.archs4("ACE2", which="tissue")

gget cellxgene - Single-Cell RNA-seq Data

Query CZ CELLxGENE Discover Census for single-cell data.

Setup Required:

gget setup cellxgene

Parameters:

--gene (-g): Gene names or Ensembl IDs (case-sensitive! 'PAX7' for human, 'Pax7' for mouse) --tissue: Tissue type(s) --cell_type: Specific cell type(s) --species (-s): 'homo_sapiens' (default) or 'mus_musculus' --census_version (-cv): Version ("stable", "latest", or dated) --ensembl (-e): Use Ensembl IDs --meta_only (-mo): Return metadata only Additional filters: disease, development_stage, sex, assay, dataset_id, donor_id, ethnicity, suspension_type

Returns: AnnData object with count matrices and metadata (or metadata-only dataframes)

Examples:

Get single-cell data for specific genes and cell types

gget cellxgene --gene ACE2 ABCA1 --tissue lung --cell_type "mucus secreting cell" -o lung_data.h5ad

Metadata only

gget cellxgene --gene PAX7 --tissue muscle --meta_only -o metadata.csv

Python

adata = gget.cellxgene(gene=["ACE2", "ABCA1"], tissue="lung", cell_type="mucus secreting cell")

gget enrichr - Enrichment Analysis

Perform ontology enrichment analysis on gene lists using Enrichr.

Parameters:

genes: Gene symbols or Ensembl IDs -db/--database: Reference database (supports shortcuts: 'pathway', 'transcription', 'ontology', 'diseases_drugs', 'celltypes') -s/--species: human (default), mouse, fly, yeast, worm, fish -bkg_l/--background_list: Background genes for comparison -ko/--kegg_out: Save KEGG pathway images with highlighted genes plot: Python-only; generate graphical results

Database Shortcuts:

'pathway' → KEGG_2021_Human 'transcription' → ChEA_2016 'ontology' → GO_Biological_Process_2021 'diseases_drugs' → GWAS_Catalog_2019 'celltypes' → PanglaoDB_Augmented_2021

Examples:

Enrichment analysis for ontology

gget enrichr -db ontology ACE2 AGT AGTR1

Save KEGG pathways

gget enrichr -db pathway ACE2 AGT AGTR1 -ko ./kegg_images/

Python with plot

gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology", plot=True)

gget bgee - Orthology & Expression

Retrieve orthology and gene expression data from Bgee database.

Parameters:

ens_id: Ensembl gene ID or NCBI gene ID (for non-Ensembl species). Multiple IDs supported when type=expression -t/--type: 'orthologs' (default) or 'expression'

Returns:

Orthologs mode: Matching genes across species with IDs, names, taxonomic info Expression mode: Anatomical entities, confidence scores, expression status

Examples:

Get orthologs

gget bgee ENSG00000169194

Get expression data

gget bgee ENSG00000169194 -t expression

Multiple genes

gget bgee ENSBTAG00000047356 ENSBTAG00000018317 -t expression

Python

gget.bgee("ENSG00000169194", type="orthologs")

gget opentargets - Disease & Drug Associations

Retrieve disease and drug associations from OpenTargets.

Parameters:

Ensembl gene ID (required) -r/--resource: diseases (default), drugs, tractability, pharmacogenetics, expression, depmap, interactions -l/--limit: Cap results count Filter arguments (vary by resource): drugs: --filter_disease pharmacogenetics: --filter_drug expression/depmap: --filter_tissue, --filter_anat_sys, --filter_organ interactions: --filter_protein_a, --filter_protein_b, --filter_gene_b

Examples:

Get associated diseases

gget opentargets ENSG00000169194 -r diseases -l 5

Get associated drugs

gget opentargets ENSG00000169194 -r drugs -l 10

Get tissue expression

gget opentargets ENSG00000169194 -r expression --filter_tissue brain

Python

gget.opentargets("ENSG00000169194", resource="diseases", limit=5)

gget cbio - cBioPortal Cancer Genomics

Plot cancer genomics heatmaps using cBioPortal data.

Two subcommands:

search - Find study IDs:

gget cbio search breast lung

plot - Generate heatmaps:

Parameters:

-s/--study_ids: Space-separated cBioPortal study IDs (required) -g/--genes: Space-separated gene names or Ensembl IDs (required) -st/--stratification: Column to organize data (tissue, cancer_type, cancer_type_detailed, study_id, sample) -vt/--variation_type: Data type (mutation_occurrences, cna_nonbinary, sv_occurrences, cna_occurrences, Consequence) -f/--filter: Filter by column value (e.g., 'study_id:msk_impact_2017') -dd/--data_dir: Cache directory (default: ./gget_cbio_cache) -fd/--figure_dir: Output directory (default: ./gget_cbio_figures) -dpi: Resolution (default: 100) -sh/--show: Display plot in window -nc/--no_confirm: Skip download confirmations

Examples:

Search for studies

gget cbio search esophag ovary

Create heatmap

gget cbio plot -s msk_impact_2017 -g AKT1 ALK BRAF -st tissue -vt mutation_occurrences

Python

gget.cbio_search(["esophag", "ovary"]) gget.cbio_plot(["msk_impact_2017"], ["AKT1", "ALK"], stratification="tissue")

gget cosmic - COSMIC Database

Search COSMIC (Catalogue Of Somatic Mutations In Cancer) database.

Important: License fees apply for commercial use. Requires COSMIC account credentials.

Parameters:

searchterm: Gene name, Ensembl ID, mutation notation, or sample ID -ctp/--cosmic_tsv_path: Path to downloaded COSMIC TSV file (required for querying) -l/--limit: Maximum results (default: 100)

Database download flags:

-d/--download_cosmic: Activate download mode -gm/--gget_mutate: Create version for gget mutate -cp/--cosmic_project: Database type (cancer, census, cell_line, resistance, genome_screen, targeted_screen) -cv/--cosmic_version: COSMIC version -gv/--grch_version: Human reference genome (37 or 38) --email, --password: COSMIC credentials

Examples:

First download database

gget cosmic -d --email user@example.com --password xxx -cp cancer

Then query

gget cosmic EGFR -ctp cosmic_data.tsv -l 10

Python

gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10)

  1. Additional Tools gget mutate - Generate Mutated Sequences

Generate mutated nucleotide sequences from mutation annotations.

Parameters:

sequences: FASTA file path or direct sequence input (string/list) -m/--mutations: CSV/TSV file or DataFrame with mutation data (required) -mc/--mut_column: Mutation column name (default: 'mutation') -sic/--seq_id_column: Sequence ID column (default: 'seq_ID') -mic/--mut_id_column: Mutation ID column -k/--k: Length of flanking sequences (default: 30 nucleotides)

Returns: Mutated sequences in FASTA format

Examples:

Single mutation

gget mutate ATCGCTAAGCT -m "c.4G>T"

Multiple sequences with mutations from file

gget mutate sequences.fasta -m mutations.csv -o mutated.fasta

Python

import pandas as pd mutations_df = pd.DataFrame({"seq_ID": ["seq1"], "mutation": ["c.4G>T"]}) gget.mutate(["ATCGCTAAGCT"], mutations=mutations_df)

gget gpt - OpenAI Text Generation

Generate natural language text using OpenAI's API.

Setup Required:

gget setup gpt

Important: Free tier limited to 3 months after account creation. Set monthly billing limits.

Parameters:

prompt: Text input for generation (required) api_key: OpenAI authentication (required) Model configuration: temperature, top_p, max_tokens, frequency_penalty, presence_penalty Default model: gpt-3.5-turbo (configurable)

Examples:

gget gpt "Explain CRISPR" --api_key your_key_here

Python

gget.gpt("Explain CRISPR", api_key="your_key_here")

gget setup - Install Dependencies

Install/download third-party dependencies for specific modules.

Parameters:

module: Module name requiring dependency installation -o/--out: Output folder path (elm module only)

Modules requiring setup:

alphafold - Downloads ~4GB of model parameters cellxgene - Installs cellxgene-census (may not support latest Python) elm - Downloads local ELM database gpt - Configures OpenAI integration

Examples:

Setup AlphaFold

gget setup alphafold

Setup ELM with custom directory

gget setup elm -o /path/to/elm_data

Python

gget.setup("alphafold")

Common Workflows Workflow 1: Gene Discovery to Sequence Analysis

Find and analyze genes of interest:

1. Search for genes

results = gget.search(["GABA", "receptor"], species="homo_sapiens")

2. Get detailed information

gene_ids = results["ensembl_id"].tolist() info = gget.info(gene_ids[:5])

3. Retrieve sequences

sequences = gget.seq(gene_ids[:5], translate=True)

Workflow 2: Sequence Alignment and Structure

Align sequences and predict structures:

1. Align multiple sequences

alignment = gget.muscle("sequences.fasta")

2. Find similar sequences

blast_results = gget.blast(my_sequence, database="swissprot", limit=10)

3. Predict structure

structure = gget.alphafold(my_sequence, plot=True)

4. Find linear motifs

ortholog_df, regex_df = gget.elm(my_sequence)

Workflow 3: Gene Expression and Enrichment

Analyze expression patterns and functional enrichment:

1. Get tissue expression

tissue_expr = gget.archs4("ACE2", which="tissue")

2. Find correlated genes

correlated = gget.archs4("ACE2", which="correlation")

3. Get single-cell data

adata = gget.cellxgene(gene=["ACE2"], tissue="lung", cell_type="epithelial cell")

4. Perform enrichment analysis

gene_list = correlated["gene_symbol"].tolist()[:50] enrichment = gget.enrichr(gene_list, database="ontology", plot=True)

Workflow 4: Disease and Drug Analysis

Investigate disease associations and therapeutic targets:

1. Search for genes

genes = gget.search(["breast cancer"], species="homo_sapiens")

2. Get disease associations

diseases = gget.opentargets("ENSG00000169194", resource="diseases")

3. Get drug associations

drugs = gget.opentargets("ENSG00000169194", resource="drugs")

4. Query cancer genomics data

study_ids = gget.cbio_search(["breast"]) gget.cbio_plot(study_ids[:2], ["BRCA1", "BRCA2"], stratification="cancer_type")

5. Search COSMIC for mutations

cosmic_results = gget.cosmic("BRCA1", cosmic_tsv_path="cosmic.tsv")

Workflow 5: Comparative Genomics

Compare proteins across species:

1. Get orthologs

orthologs = gget.bgee("ENSG00000169194", type="orthologs")

2. Get sequences for comparison

human_seq = gget.seq("ENSG00000169194", translate=True) mouse_seq = gget.seq("ENSMUSG00000026091", translate=True)

3. Align sequences

alignment = gget.muscle([human_seq, mouse_seq])

4. Compare structures

human_structure = gget.pdb("7S7U") mouse_structure = gget.alphafold(mouse_seq)

Workflow 6: Building Reference Indices

Prepare reference data for downstream analysis (e.g., kallisto|bustools):

1. List available species

gget ref --list_species

2. Download reference files

gget ref -w gtf -w cdna -d homo_sapiens

3. Build kallisto index

kallisto index -i transcriptome.idx transcriptome.fasta

4. Download genome for alignment

gget ref -w dna -d homo_sapiens

Best Practices Data Retrieval Use --limit to control result sizes for large queries Save results with -o/--out for reproducibility Check database versions/releases for consistency across analyses Use --quiet in production scripts to reduce output Sequence Analysis For BLAST/BLAT, start with default parameters, then adjust sensitivity Use gget diamond with --threads for faster local alignment Save DIAMOND databases with --diamond_db for repeated queries For multiple sequence alignment, use -s5/--super5 for large datasets Expression and Disease Data Gene symbols are case-sensitive in cellxgene (e.g., 'PAX7' vs 'Pax7') Run gget setup before first use of alphafold, cellxgene, elm, gpt For enrichment analysis, use database shortcuts for convenience Cache cBioPortal data with -dd to avoid repeated downloads Structure Prediction AlphaFold multimer predictions: use -mr 20 for higher accuracy Use -r flag for AMBER relaxation of final structures Visualize results in Python with plot=True Check PDB database first before running AlphaFold predictions Error Handling Database structures change; update gget regularly: uv pip install --upgrade gget Process max ~1000 Ensembl IDs at once with gget info For large-scale analyses, implement rate limiting for API queries Use virtual environments to avoid dependency conflicts Output Formats Command-line Default: JSON CSV: Add -csv flag FASTA: gget seq, gget mutate PDB: gget pdb, gget alphafold PNG: gget cbio plot Python Default: DataFrame or dictionary JSON: Add json=True parameter Save to file: Add save=True or specify out="filename" AnnData: gget cellxgene Resources

This skill includes reference documentation for detailed module information:

references/ module_reference.md - Comprehensive parameter reference for all modules database_info.md - Information about queried databases and their update frequencies workflows.md - Extended workflow examples and use cases

For additional help:

Official documentation: https://pachterlab.github.io/gget/ GitHub issues: https://github.com/pachterlab/gget/issues Citation: Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836

返回排行榜