tooluniverse-rnaseq-deseq2

安装量: 116
排名: #7368

安装

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-rnaseq-deseq2
RNA-seq Differential Expression Analysis (DESeq2)
Differential expression analysis of RNA-seq count data using PyDESeq2, with enrichment analysis (gseapy) and gene annotation via ToolUniverse.
BixBench Coverage
Validated on 53 BixBench questions across 15 computational biology projects. Core Principles Data-first - Load and validate count data and metadata BEFORE any analysis Statistical rigor - Proper normalization, dispersion estimation, multiple testing correction Flexible design - Single-factor, multi-factor, and interaction designs Threshold awareness - Apply user-specified thresholds exactly (padj, log2FC, baseMean) Reproducible - Set random seeds, document all parameters Question-driven - Parse what the user is actually asking; extract the specific answer Enrichment integration - Chain DESeq2 results into pathway/GO enrichment when requested When to Use RNA-seq count matrices needing differential expression analysis DESeq2, DEGs, padj, log2FC questions Dispersion estimates or diagnostics GO, KEGG, Reactome enrichment on DEGs Specific gene expression changes between conditions Batch effect correction in RNA-seq Required Packages import pandas as pd , numpy as np from pydeseq2 . dds import DeseqDataSet from pydeseq2 . ds import DeseqStats import gseapy as gp

enrichment (optional)

from tooluniverse import ToolUniverse

annotation (optional)

Analysis Workflow
Step 1: Parse the Question
Extract: data files, thresholds (padj/log2FC/baseMean), design factors, contrast, direction, enrichment type, specific genes. See
question_parsing.md
.
Step 2: Load & Validate Data
Load counts + metadata, ensure samples-as-rows/genes-as-columns, verify integer counts, align sample names, remove zero-count genes. See
data_loading.md
.
Step 2.5: Inspect Metadata (REQUIRED)
List ALL metadata columns and levels. Categorize as biological interest vs batch/block. Build design formula with covariates first, factor of interest last. See
design_formula_guide.md
.
Step 3: Run PyDESeq2
Set reference level via
pd.Categorical
, create
DeseqDataSet
, call
dds.deseq2()
, extract
DeseqStats
with contrast, run Wald test, optionally apply LFC shrinkage. See
pydeseq2_workflow.md
.
Tool boundaries
:
Python (PyDESeq2)
ALL DESeq2 analysis
ToolUniverse
ONLY gene annotation (ID conversion, pathway context)
gseapy
Enrichment analysis (GO/KEGG/Reactome)
Step 4: Filter Results
Apply padj, log2FC, baseMean thresholds. Split by direction if needed. See
result_filtering.md
.
Step 5: Dispersion Analysis (if asked)
Key columns:
genewise_dispersions
,
fitted_dispersions
,
MAP_dispersions
,
dispersions
. See
dispersion_analysis.md
.
Step 6: Enrichment (optional)
Use gseapy
enrich()
with appropriate gene set library. See
enrichment_analysis.md
.
Step 7: Gene Annotation (optional)
Use ToolUniverse for ID conversion and gene context only. See
output_formatting.md
.
Common Patterns
Pattern
Type
Key Operation
1
DEG count
len(results[(padj<0.05) & (abs(lfc)>0.5)])
2
Gene value
results.loc['GENE', 'log2FoldChange']
3
Direction
Filter
log2FoldChange > 0
or
< 0
4
Set ops
degs_A - degs_B
for unique DEGs
5
Dispersion
(dds.var['genewise_dispersions'] < thr).sum()
See
bixbench_examples.md
for all 10 patterns with examples.
Error Quick Reference
Error
Fix
No matching samples
Transpose counts; strip whitespace
Dispersion trend no converge
fit_type='mean'
Contrast not found
Check
metadata['factor'].unique()
Non-integer counts
Round to int OR use t-test
NaN in padj
Independent filtering removed genes
See
troubleshooting.md
for full debugging guide.
Known Limitations
PyDESeq2 vs R DESeq2
Numerical differences exist for very low dispersion genes (<1e-05). For exact R reproducibility, use rpy2.
gseapy vs R clusterProfiler
Results may differ. See r_clusterprofiler_guide.md . Reference Files question_parsing.md - Extract parameters from questions data_loading.md - Data loading and validation design_formula_guide.md - Multi-factor design decision tree pydeseq2_workflow.md - Complete PyDESeq2 code examples result_filtering.md - Advanced filtering and extraction dispersion_analysis.md - Dispersion diagnostics enrichment_analysis.md - GO/KEGG/Reactome workflows output_formatting.md - Format answers correctly bixbench_examples.md - All 10 question patterns troubleshooting.md - Common issues and debugging r_clusterprofiler_guide.md - R clusterProfiler via rpy2 Utility Scripts format_deseq2_output.py - Output formatters load_count_matrix.py - Data loading utilities
返回排行榜