tooluniverse-single-cell

安装量: 36
排名: #19253

安装

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-single-cell
Single-Cell Genomics and Expression Matrix Analysis
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.
When to Use This Skill
Apply when users:
Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
Ask about cell type identification, clustering, or annotation
Need differential expression analysis by cell type or condition
Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
Ask about PCA, UMAP, t-SNE for expression data
Need Leiden/Louvain clustering on expression matrices
Want statistical comparisons between cell types (t-test, ANOVA, fold change)
Ask about marker genes, batch correction, trajectory, or cell-cell communication
BixBench Coverage
18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36) NOT for (use other skills instead): Bulk RNA-seq DESeq2 only -> tooluniverse-rnaseq-deseq2 Gene enrichment only -> tooluniverse-gene-enrichment VCF/variant analysis -> tooluniverse-variant-analysis Core Principles Data-first - Load, inspect, validate before analysis AnnData-centric - All data flows through anndata objects Cell type awareness - Per-cell-type subsetting when needed Statistical rigor - Normalization, multiple testing correction, effect sizes Question-driven - Parse what the user is actually asking Required Packages import scanpy as sc , anndata as ad , pandas as pd , numpy as np from scipy import stats from scipy . cluster . hierarchy import linkage , fcluster from sklearn . decomposition import PCA from statsmodels . stats . multitest import multipletests import gseapy as gp

enrichment

import harmonypy

batch correction (optional)

Install: pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels Workflow Decision Tree START: User question about scRNA-seq data | +-- FULL PIPELINE (raw counts -> annotated clusters) | Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE | See: references/scanpy_workflow.md | +-- DIFFERENTIAL EXPRESSION (per-cell-type comparison) | Most common BixBench pattern (bix-33) | See: analysis_patterns.md "Pattern 1" | +-- CORRELATION ANALYSIS (gene property vs expression) | Pattern: Gene length vs expression (bix-22) | See: analysis_patterns.md "Pattern 2" | +-- CLUSTERING & PCA (expression matrix analysis) | See: references/clustering_guide.md | +-- CELL COMMUNICATION (ligand-receptor interactions) | See: references/cell_communication.md | +-- TRAJECTORY ANALYSIS (pseudotime) See: references/trajectory_analysis.md Data format handling : h5ad -> sc.read_h5ad() 10X -> sc.read_10x_mtx() or sc.read_10x_h5() CSV/TSV -> pd.read_csv() -> Convert to AnnData (check orientation!) Data Loading AnnData expects: cells/samples as rows (obs), genes as columns (var) adata = sc . read_h5ad ( "data.h5ad" )

h5ad already oriented

CSV/TSV: check orientation

df

pd . read_csv ( "counts.csv" , index_col = 0 ) if df . shape [ 0 ]

df . shape [ 1 ] * 5 :

genes > samples by 5x => transpose

df

df . T adata = ad . AnnData ( df )

Load metadata

meta

pd . read_csv ( "metadata.csv" , index_col = 0 ) common = adata . obs_names . intersection ( meta . index ) adata = adata [ common ] . copy ( ) for col in meta . columns : adata . obs [ col ] = meta . loc [ common , col ] Quality Control adata . var [ 'mt' ] = adata . var_names . str . startswith ( ( 'MT-' , 'mt-' ) ) sc . pp . calculate_qc_metrics ( adata , qc_vars = [ 'mt' ] , inplace = True ) sc . pp . filter_cells ( adata , min_genes = 200 ) adata = adata [ adata . obs [ 'pct_counts_mt' ] < 20 ] . copy ( ) sc . pp . filter_genes ( adata , min_cells = 3 ) See: references/scanpy_workflow.md for details Complete Pipeline (Quick Reference) import scanpy as sc adata = sc . read_10x_h5 ( "filtered_feature_bc_matrix.h5" )

QC

adata . var [ 'mt' ] = adata . var_names . str . startswith ( 'MT-' ) sc . pp . calculate_qc_metrics ( adata , qc_vars = [ 'mt' ] , inplace = True ) adata = adata [ adata . obs [ 'pct_counts_mt' ] < 20 ] . copy ( ) sc . pp . filter_cells ( adata , min_genes = 200 ) sc . pp . filter_genes ( adata , min_cells = 3 )

Normalize + HVG + PCA

sc . pp . normalize_total ( adata , target_sum = 1e4 ) sc . pp . log1p ( adata ) adata . raw = adata . copy ( ) sc . pp . highly_variable_genes ( adata , n_top_genes = 2000 ) sc . tl . pca ( adata , n_comps = 50 )

Cluster + UMAP

sc . pp . neighbors ( adata , n_pcs = 30 ) sc . tl . leiden ( adata , resolution = 0.5 ) sc . tl . umap ( adata )

Find markers + Annotate + Per-cell-type DE

sc . tl . rank_genes_groups ( adata , groupby = 'leiden' , method = 'wilcoxon' ) Differential Expression Decision Tree Single-Cell DE (many cells per condition): Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg Best for: Per-cell-type DE, marker gene finding Pseudo-Bulk DE (aggregate counts by sample): Use: DESeq2 via PyDESeq2 Best for: Sample-level comparisons with replicates Statistical Tests Only: Use: scipy.stats (ttest_ind, f_oneway, pearsonr) Best for: Correlation, ANOVA, t-tests on summaries Statistical Tests (Quick Reference) from scipy import stats from statsmodels . stats . multitest import multipletests

Pearson/Spearman correlation

r , p = stats . pearsonr ( gene_lengths , mean_expression )

Welch's t-test

t_stat , p_val = stats . ttest_ind ( group1 , group2 , equal_var = False )

ANOVA

f_stat , p_val = stats . f_oneway ( group1 , group2 , group3 )

Multiple testing correction (BH)

reject
,
pvals_adj
,
_
,
_
=
multipletests
(
pvals
,
method
=
'fdr_bh'
)
Batch Correction (Harmony)
import
harmonypy
sc
.
tl
.
pca
(
adata
,
n_comps
=
50
)
ho
=
harmonypy
.
run_harmony
(
adata
.
obsm
[
'X_pca'
]
[
:
,
:
30
]
,
adata
.
obs
,
'batch'
,
random_state
=
0
)
adata
.
obsm
[
'X_pca_harmony'
]
=
ho
.
Z_corr
.
T
sc
.
pp
.
neighbors
(
adata
,
use_rep
=
'X_pca_harmony'
)
sc
.
tl
.
leiden
(
adata
,
resolution
=
0.5
)
sc
.
tl
.
umap
(
adata
)
ToolUniverse Integration
Gene Annotation
HPA_search_genes_by_query
Cell-type marker gene search
MyGene_query_genes
/
MyGene_batch_query
Gene ID conversion
ensembl_lookup_gene
Ensembl gene details
UniProt_get_function_by_accession
Protein function
Cell-Cell Communication
OmniPath_get_ligand_receptor_interactions
L-R pairs (CellPhoneDB, CellChatDB)
OmniPath_get_signaling_interactions
Downstream signaling
OmniPath_get_complexes
Multi-subunit receptors
Enrichment (Post-DE)
PANTHER_enrichment
GO enrichment (BP, MF, CC)
STRING_functional_enrichment
Network-based enrichment
ReactomeAnalysis_pathway_enrichment
Reactome pathways
Scanpy vs Seurat Equivalents
Operation
Seurat (R)
Scanpy (Python)
Load data
Read10X()
sc.read_10x_mtx()
Normalize
NormalizeData()
sc.pp.normalize_total() + sc.pp.log1p()
Find HVGs
FindVariableFeatures()
sc.pp.highly_variable_genes()
PCA
RunPCA()
sc.tl.pca()
Cluster
FindClusters()
sc.tl.leiden()
UMAP
RunUMAP()
sc.tl.umap()
Find markers
FindMarkers()
sc.tl.rank_genes_groups()
Batch correction
RunHarmony()
harmonypy.run_harmony()
Troubleshooting
Issue
Solution
ModuleNotFoundError: leidenalg
pip install leidenalg
Sparse matrix errors
.toarray()
:
X = adata.X.toarray() if issparse(adata.X) else adata.X
Wrong matrix orientation
More genes than samples? Transpose
NaN in correlation
Filter:
valid = ~np.isnan(x) & ~np.isnan(y)
Too few cells for DE
Need >= 3 cells per condition per cell type
Memory error
Use
sc.pp.highly_variable_genes()
to reduce features
Reference Documentation
Detailed Analysis Patterns
analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication) Core Workflows : references/scanpy_workflow.md - Complete scanpy pipeline references/seurat_workflow.md - Seurat to Scanpy translation references/clustering_guide.md - Clustering methods references/marker_identification.md - Marker genes, annotation references/trajectory_analysis.md - Pseudotime references/cell_communication.md - OmniPath/CellPhoneDB workflow references/troubleshooting.md - Detailed error solutions
返回排行榜