- Single-Cell Genomics and Expression Matrix Analysis
- Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.
- When to Use This Skill
- Apply when users:
- Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
- Ask about cell type identification, clustering, or annotation
- Need differential expression analysis by cell type or condition
- Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
- Ask about PCA, UMAP, t-SNE for expression data
- Need Leiden/Louvain clustering on expression matrices
- Want statistical comparisons between cell types (t-test, ANOVA, fold change)
- Ask about marker genes, batch correction, trajectory, or cell-cell communication
- BixBench Coverage
- 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36) NOT for (use other skills instead): Bulk RNA-seq DESeq2 only -> tooluniverse-rnaseq-deseq2 Gene enrichment only -> tooluniverse-gene-enrichment VCF/variant analysis -> tooluniverse-variant-analysis Core Principles Data-first - Load, inspect, validate before analysis AnnData-centric - All data flows through anndata objects Cell type awareness - Per-cell-type subsetting when needed Statistical rigor - Normalization, multiple testing correction, effect sizes Question-driven - Parse what the user is actually asking Required Packages import scanpy as sc , anndata as ad , pandas as pd , numpy as np from scipy import stats from scipy . cluster . hierarchy import linkage , fcluster from sklearn . decomposition import PCA from statsmodels . stats . multitest import multipletests import gseapy as gp
enrichment
import harmonypy
batch correction (optional)
Install: pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels Workflow Decision Tree START: User question about scRNA-seq data | +-- FULL PIPELINE (raw counts -> annotated clusters) | Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE | See: references/scanpy_workflow.md | +-- DIFFERENTIAL EXPRESSION (per-cell-type comparison) | Most common BixBench pattern (bix-33) | See: analysis_patterns.md "Pattern 1" | +-- CORRELATION ANALYSIS (gene property vs expression) | Pattern: Gene length vs expression (bix-22) | See: analysis_patterns.md "Pattern 2" | +-- CLUSTERING & PCA (expression matrix analysis) | See: references/clustering_guide.md | +-- CELL COMMUNICATION (ligand-receptor interactions) | See: references/cell_communication.md | +-- TRAJECTORY ANALYSIS (pseudotime) See: references/trajectory_analysis.md Data format handling : h5ad -> sc.read_h5ad() 10X -> sc.read_10x_mtx() or sc.read_10x_h5() CSV/TSV -> pd.read_csv() -> Convert to AnnData (check orientation!) Data Loading AnnData expects: cells/samples as rows (obs), genes as columns (var) adata = sc . read_h5ad ( "data.h5ad" )
h5ad already oriented
CSV/TSV: check orientation
df
pd . read_csv ( "counts.csv" , index_col = 0 ) if df . shape [ 0 ]
df . shape [ 1 ] * 5 :
genes > samples by 5x => transpose
df
df . T adata = ad . AnnData ( df )
Load metadata
meta
pd . read_csv ( "metadata.csv" , index_col = 0 ) common = adata . obs_names . intersection ( meta . index ) adata = adata [ common ] . copy ( ) for col in meta . columns : adata . obs [ col ] = meta . loc [ common , col ] Quality Control adata . var [ 'mt' ] = adata . var_names . str . startswith ( ( 'MT-' , 'mt-' ) ) sc . pp . calculate_qc_metrics ( adata , qc_vars = [ 'mt' ] , inplace = True ) sc . pp . filter_cells ( adata , min_genes = 200 ) adata = adata [ adata . obs [ 'pct_counts_mt' ] < 20 ] . copy ( ) sc . pp . filter_genes ( adata , min_cells = 3 ) See: references/scanpy_workflow.md for details Complete Pipeline (Quick Reference) import scanpy as sc adata = sc . read_10x_h5 ( "filtered_feature_bc_matrix.h5" )
QC
adata . var [ 'mt' ] = adata . var_names . str . startswith ( 'MT-' ) sc . pp . calculate_qc_metrics ( adata , qc_vars = [ 'mt' ] , inplace = True ) adata = adata [ adata . obs [ 'pct_counts_mt' ] < 20 ] . copy ( ) sc . pp . filter_cells ( adata , min_genes = 200 ) sc . pp . filter_genes ( adata , min_cells = 3 )
Normalize + HVG + PCA
sc . pp . normalize_total ( adata , target_sum = 1e4 ) sc . pp . log1p ( adata ) adata . raw = adata . copy ( ) sc . pp . highly_variable_genes ( adata , n_top_genes = 2000 ) sc . tl . pca ( adata , n_comps = 50 )
Cluster + UMAP
sc . pp . neighbors ( adata , n_pcs = 30 ) sc . tl . leiden ( adata , resolution = 0.5 ) sc . tl . umap ( adata )
Find markers + Annotate + Per-cell-type DE
sc . tl . rank_genes_groups ( adata , groupby = 'leiden' , method = 'wilcoxon' ) Differential Expression Decision Tree Single-Cell DE (many cells per condition): Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg Best for: Per-cell-type DE, marker gene finding Pseudo-Bulk DE (aggregate counts by sample): Use: DESeq2 via PyDESeq2 Best for: Sample-level comparisons with replicates Statistical Tests Only: Use: scipy.stats (ttest_ind, f_oneway, pearsonr) Best for: Correlation, ANOVA, t-tests on summaries Statistical Tests (Quick Reference) from scipy import stats from statsmodels . stats . multitest import multipletests
Pearson/Spearman correlation
r , p = stats . pearsonr ( gene_lengths , mean_expression )
Welch's t-test
t_stat , p_val = stats . ttest_ind ( group1 , group2 , equal_var = False )
ANOVA
f_stat , p_val = stats . f_oneway ( group1 , group2 , group3 )
Multiple testing correction (BH)
- reject
- ,
- pvals_adj
- ,
- _
- ,
- _
- =
- multipletests
- (
- pvals
- ,
- method
- =
- 'fdr_bh'
- )
- Batch Correction (Harmony)
- import
- harmonypy
- sc
- .
- tl
- .
- pca
- (
- adata
- ,
- n_comps
- =
- 50
- )
- ho
- =
- harmonypy
- .
- run_harmony
- (
- adata
- .
- obsm
- [
- 'X_pca'
- ]
- [
- :
- ,
- :
- 30
- ]
- ,
- adata
- .
- obs
- ,
- 'batch'
- ,
- random_state
- =
- 0
- )
- adata
- .
- obsm
- [
- 'X_pca_harmony'
- ]
- =
- ho
- .
- Z_corr
- .
- T
- sc
- .
- pp
- .
- neighbors
- (
- adata
- ,
- use_rep
- =
- 'X_pca_harmony'
- )
- sc
- .
- tl
- .
- leiden
- (
- adata
- ,
- resolution
- =
- 0.5
- )
- sc
- .
- tl
- .
- umap
- (
- adata
- )
- ToolUniverse Integration
- Gene Annotation
- HPA_search_genes_by_query
-
- Cell-type marker gene search
- MyGene_query_genes
- /
- MyGene_batch_query
-
- Gene ID conversion
- ensembl_lookup_gene
-
- Ensembl gene details
- UniProt_get_function_by_accession
-
- Protein function
- Cell-Cell Communication
- OmniPath_get_ligand_receptor_interactions
-
- L-R pairs (CellPhoneDB, CellChatDB)
- OmniPath_get_signaling_interactions
-
- Downstream signaling
- OmniPath_get_complexes
-
- Multi-subunit receptors
- Enrichment (Post-DE)
- PANTHER_enrichment
-
- GO enrichment (BP, MF, CC)
- STRING_functional_enrichment
-
- Network-based enrichment
- ReactomeAnalysis_pathway_enrichment
-
- Reactome pathways
- Scanpy vs Seurat Equivalents
- Operation
- Seurat (R)
- Scanpy (Python)
- Load data
- Read10X()
- sc.read_10x_mtx()
- Normalize
- NormalizeData()
- sc.pp.normalize_total() + sc.pp.log1p()
- Find HVGs
- FindVariableFeatures()
- sc.pp.highly_variable_genes()
- PCA
- RunPCA()
- sc.tl.pca()
- Cluster
- FindClusters()
- sc.tl.leiden()
- UMAP
- RunUMAP()
- sc.tl.umap()
- Find markers
- FindMarkers()
- sc.tl.rank_genes_groups()
- Batch correction
- RunHarmony()
- harmonypy.run_harmony()
- Troubleshooting
- Issue
- Solution
- ModuleNotFoundError: leidenalg
- pip install leidenalg
- Sparse matrix errors
- .toarray()
- :
- X = adata.X.toarray() if issparse(adata.X) else adata.X
- Wrong matrix orientation
- More genes than samples? Transpose
- NaN in correlation
- Filter:
- valid = ~np.isnan(x) & ~np.isnan(y)
- Too few cells for DE
- Need >= 3 cells per condition per cell type
- Memory error
- Use
- sc.pp.highly_variable_genes()
- to reduce features
- Reference Documentation
- Detailed Analysis Patterns
- analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication) Core Workflows : references/scanpy_workflow.md - Complete scanpy pipeline references/seurat_workflow.md - Seurat to Scanpy translation references/clustering_guide.md - Clustering methods references/marker_identification.md - Marker genes, annotation references/trajectory_analysis.md - Pseudotime references/cell_communication.md - OmniPath/CellPhoneDB workflow references/troubleshooting.md - Detailed error solutions