- RNA-seq Differential Expression Analysis (DESeq2)
- Differential expression analysis of RNA-seq count data using PyDESeq2, with enrichment analysis (gseapy) and gene annotation via ToolUniverse.
- BixBench Coverage
- Validated on 53 BixBench questions across 15 computational biology projects. Core Principles Data-first - Load and validate count data and metadata BEFORE any analysis Statistical rigor - Proper normalization, dispersion estimation, multiple testing correction Flexible design - Single-factor, multi-factor, and interaction designs Threshold awareness - Apply user-specified thresholds exactly (padj, log2FC, baseMean) Reproducible - Set random seeds, document all parameters Question-driven - Parse what the user is actually asking; extract the specific answer Enrichment integration - Chain DESeq2 results into pathway/GO enrichment when requested When to Use RNA-seq count matrices needing differential expression analysis DESeq2, DEGs, padj, log2FC questions Dispersion estimates or diagnostics GO, KEGG, Reactome enrichment on DEGs Specific gene expression changes between conditions Batch effect correction in RNA-seq Required Packages import pandas as pd , numpy as np from pydeseq2 . dds import DeseqDataSet from pydeseq2 . ds import DeseqStats import gseapy as gp
enrichment (optional)
from tooluniverse import ToolUniverse
annotation (optional)
- Analysis Workflow
- Step 1: Parse the Question
- Extract: data files, thresholds (padj/log2FC/baseMean), design factors, contrast, direction, enrichment type, specific genes. See
- question_parsing.md
- .
- Step 2: Load & Validate Data
- Load counts + metadata, ensure samples-as-rows/genes-as-columns, verify integer counts, align sample names, remove zero-count genes. See
- data_loading.md
- .
- Step 2.5: Inspect Metadata (REQUIRED)
- List ALL metadata columns and levels. Categorize as biological interest vs batch/block. Build design formula with covariates first, factor of interest last. See
- design_formula_guide.md
- .
- Step 3: Run PyDESeq2
- Set reference level via
- pd.Categorical
- , create
- DeseqDataSet
- , call
- dds.deseq2()
- , extract
- DeseqStats
- with contrast, run Wald test, optionally apply LFC shrinkage. See
- pydeseq2_workflow.md
- .
- Tool boundaries
- :
- Python (PyDESeq2)
-
- ALL DESeq2 analysis
- ToolUniverse
-
- ONLY gene annotation (ID conversion, pathway context)
- gseapy
-
- Enrichment analysis (GO/KEGG/Reactome)
- Step 4: Filter Results
- Apply padj, log2FC, baseMean thresholds. Split by direction if needed. See
- result_filtering.md
- .
- Step 5: Dispersion Analysis (if asked)
- Key columns:
- genewise_dispersions
- ,
- fitted_dispersions
- ,
- MAP_dispersions
- ,
- dispersions
- . See
- dispersion_analysis.md
- .
- Step 6: Enrichment (optional)
- Use gseapy
- enrich()
- with appropriate gene set library. See
- enrichment_analysis.md
- .
- Step 7: Gene Annotation (optional)
- Use ToolUniverse for ID conversion and gene context only. See
- output_formatting.md
- .
- Common Patterns
- Pattern
- Type
- Key Operation
- 1
- DEG count
- len(results[(padj<0.05) & (abs(lfc)>0.5)])
- 2
- Gene value
- results.loc['GENE', 'log2FoldChange']
- 3
- Direction
- Filter
- log2FoldChange > 0
- or
- < 0
- 4
- Set ops
- degs_A - degs_B
- for unique DEGs
- 5
- Dispersion
- (dds.var['genewise_dispersions'] < thr).sum()
- See
- bixbench_examples.md
- for all 10 patterns with examples.
- Error Quick Reference
- Error
- Fix
- No matching samples
- Transpose counts; strip whitespace
- Dispersion trend no converge
- fit_type='mean'
- Contrast not found
- Check
- metadata['factor'].unique()
- Non-integer counts
- Round to int OR use t-test
- NaN in padj
- Independent filtering removed genes
- See
- troubleshooting.md
- for full debugging guide.
- Known Limitations
- PyDESeq2 vs R DESeq2
-
- Numerical differences exist for very low dispersion genes (<1e-05). For exact R reproducibility, use rpy2.
- gseapy vs R clusterProfiler
- Results may differ. See r_clusterprofiler_guide.md . Reference Files question_parsing.md - Extract parameters from questions data_loading.md - Data loading and validation design_formula_guide.md - Multi-factor design decision tree pydeseq2_workflow.md - Complete PyDESeq2 code examples result_filtering.md - Advanced filtering and extraction dispersion_analysis.md - Dispersion diagnostics enrichment_analysis.md - GO/KEGG/Reactome workflows output_formatting.md - Format answers correctly bixbench_examples.md - All 10 question patterns troubleshooting.md - Common issues and debugging r_clusterprofiler_guide.md - R clusterProfiler via rpy2 Utility Scripts format_deseq2_output.py - Output formatters load_count_matrix.py - Data loading utilities