Single-Cell Genomics and Expression Matrix Analysis
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.
When to Use This Skill
Apply when users:
Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
Ask about cell type identification, clustering, or annotation
Need differential expression analysis by cell type or condition
Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
Ask about PCA, UMAP, t-SNE for expression data
Need Leiden/Louvain clustering on expression matrices
Want statistical comparisons between cell types (t-test, ANOVA, fold change)
Ask about marker genes, batch correction, trajectory, or cell-cell communication
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36) NOT for (use other skills instead): Bulk RNA-seq DESeq2 only -> tooluniverse-rnaseq-deseq2 Gene enrichment only -> tooluniverse-gene-enrichment VCF/variant analysis -> tooluniverse-variant-analysis Core Principles Data-first - Load, inspect, validate before analysis AnnData-centric - All data flows through anndata objects Cell type awareness - Per-cell-type subsetting when needed Statistical rigor - Normalization, multiple testing correction, effect sizes Question-driven - Parse what the user is actually asking Required Packages import scanpy as sc , anndata as ad , pandas as pd , numpy as np from scipy import stats from scipy . cluster . hierarchy import linkage , fcluster from sklearn . decomposition import PCA from statsmodels . stats . multitest import multipletests import gseapy as gp

enrichment

import harmonypy

batch correction (optional)

Install: pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels Workflow Decision Tree START: User question about scRNA-seq data | +-- FULL PIPELINE (raw counts -> annotated clusters) | Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE | See: references/scanpy_workflow.md | +-- DIFFERENTIAL EXPRESSION (per-cell-type comparison) | Most common BixBench pattern (bix-33) | See: analysis_patterns.md "Pattern 1" | +-- CORRELATION ANALYSIS (gene property vs expression) | Pattern: Gene length vs expression (bix-22) | See: analysis_patterns.md "Pattern 2" | +-- CLUSTERING & PCA (expression matrix analysis) | See: references/clustering_guide.md | +-- CELL COMMUNICATION (ligand-receptor interactions) | See: references/cell_communication.md | +-- TRAJECTORY ANALYSIS (pseudotime) See: references/trajectory_analysis.md Data format handling : h5ad -> sc.read_h5ad() 10X -> sc.read_10x_mtx() or sc.read_10x_h5() CSV/TSV -> pd.read_csv() -> Convert to AnnData (check orientation!) Data Loading AnnData expects: cells/samples as rows (obs), genes as columns (var) adata = sc . read_h5ad ( "data.h5ad" )

h5ad already oriented

CSV/TSV: check orientation

df

pd . read_csv ( "counts.csv" , index_col = 0 ) if df . shape [ 0 ]

df . shape [ 1 ] * 5 :

genes > samples by 5x => transpose

df

df . T adata = ad . AnnData ( df )

Load metadata

QC

adata . var [ 'mt' ] = adata . var_names . str . startswith ( 'MT-' ) sc . pp . calculate_qc_metrics ( adata , qc_vars = [ 'mt' ] , inplace = True ) adata = adata [ adata . obs [ 'pct_counts_mt' ] < 20 ] . copy ( ) sc . pp . filter_cells ( adata , min_genes = 200 ) sc . pp . filter_genes ( adata , min_cells = 3 )

Normalize + HVG + PCA

sc . pp . normalize_total ( adata , target_sum = 1e4 ) sc . pp . log1p ( adata ) adata . raw = adata . copy ( ) sc . pp . highly_variable_genes ( adata , n_top_genes = 2000 ) sc . tl . pca ( adata , n_comps = 50 )

Cluster + UMAP

sc . pp . neighbors ( adata , n_pcs = 30 ) sc . tl . leiden ( adata , resolution = 0.5 ) sc . tl . umap ( adata )

Find markers + Annotate + Per-cell-type DE

sc . tl . rank_genes_groups ( adata , groupby = 'leiden' , method = 'wilcoxon' ) Differential Expression Decision Tree Single-Cell DE (many cells per condition): Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg Best for: Per-cell-type DE, marker gene finding Pseudo-Bulk DE (aggregate counts by sample): Use: DESeq2 via PyDESeq2 Best for: Sample-level comparisons with replicates Statistical Tests Only: Use: scipy.stats (ttest_ind, f_oneway, pearsonr) Best for: Correlation, ANOVA, t-tests on summaries Statistical Tests (Quick Reference) from scipy import stats from statsmodels . stats . multitest import multipletests

Pearson/Spearman correlation

r , p = stats . pearsonr ( gene_lengths , mean_expression )

Welch's t-test

t_stat , p_val = stats . ttest_ind ( group1 , group2 , equal_var = False )

ANOVA

f_stat , p_val = stats . f_oneway ( group1 , group2 , group3 )

Multiple testing correction (BH)

reject

,

pvals_adj

,

_

,

_

=

multipletests

(

pvals

,

method

=

'fdr_bh'

)

Batch Correction (Harmony)

import

harmonypy

sc

.

tl

.

pca

(

adata

,

n_comps

=

50

)

ho

=

harmonypy

.

run_harmony

(

adata

.

obsm

[

'X_pca'

]

[

:

,

:

30

]

,

adata

.

obs

,

'batch'

,

random_state

=

0

)

adata

.

obsm

[

'X_pca_harmony'

]

=

ho

.

Z_corr

.

T

sc

.

pp

.

neighbors

(

adata

,

use_rep

=

'X_pca_harmony'

)

sc

.

tl

.

leiden

(

adata

,

resolution

=

0.5

)

sc

.

tl

.

umap

(

adata

)

ToolUniverse Integration

Gene Annotation

HPA_search_genes_by_query

Cell-type marker gene search

MyGene_query_genes

/

MyGene_batch_query

Gene ID conversion

ensembl_lookup_gene

Ensembl gene details

UniProt_get_function_by_accession

Protein function

Cell-Cell Communication

OmniPath_get_ligand_receptor_interactions

L-R pairs (CellPhoneDB, CellChatDB)

OmniPath_get_signaling_interactions

Downstream signaling

OmniPath_get_complexes

Multi-subunit receptors

Enrichment (Post-DE)

PANTHER_enrichment

GO enrichment (BP, MF, CC)

STRING_functional_enrichment

Network-based enrichment

ReactomeAnalysis_pathway_enrichment

Reactome pathways
Scanpy vs Seurat Equivalents
Operation
Seurat (R)
Scanpy (Python)
Load data
Read10X()
sc.read_10x_mtx()
Normalize
NormalizeData()
sc.pp.normalize_total() + sc.pp.log1p()
Find HVGs
FindVariableFeatures()
sc.pp.highly_variable_genes()
PCA
RunPCA()
sc.tl.pca()
Cluster
FindClusters()
sc.tl.leiden()
UMAP
RunUMAP()
sc.tl.umap()
Find markers
FindMarkers()
sc.tl.rank_genes_groups()
Batch correction
RunHarmony()
harmonypy.run_harmony()
Troubleshooting
Issue
Solution
ModuleNotFoundError: leidenalg
pip install leidenalg
Sparse matrix errors
.toarray()
:
X = adata.X.toarray() if issparse(adata.X) else adata.X
Wrong matrix orientation
More genes than samples? Transpose
NaN in correlation
Filter:
valid = ~np.isnan(x) & ~np.isnan(y)
Too few cells for DE
Need >= 3 cells per condition per cell type
Memory error
Use
sc.pp.highly_variable_genes()
to reduce features
Reference Documentation
Detailed Analysis Patterns: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication) Core Workflows : references/scanpy_workflow.md - Complete scanpy pipeline references/seurat_workflow.md - Seurat to Scanpy translation references/clustering_guide.md - Clustering methods references/marker_identification.md - Marker genes, annotation references/trajectory_analysis.md - Pseudotime references/cell_communication.md - OmniPath/CellPhoneDB workflow references/troubleshooting.md - Detailed error solutions

tooluniverse-single-cell

安装

enrichment

batch correction (optional)

h5ad already oriented

CSV/TSV: check orientation

df

genes > samples by 5x => transpose

df

Load metadata

meta

QC

Normalize + HVG + PCA

Cluster + UMAP

Find markers + Annotate + Per-cell-type DE

Pearson/Spearman correlation

Welch's t-test

ANOVA

Multiple testing correction (BH)