arboreto

安装量: 132
排名: #6536

安装

npx skills add https://github.com/davila7/claude-code-templates --skill arboreto

Arboreto Overview

Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.

Core capability: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).

Quick Start

Install arboreto:

uv pip install arboreto

Basic GRN inference:

import pandas as pd from arboreto.algo import grnboost2

if name == 'main': # Load expression data (genes as columns) expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

# Infer regulatory network
network = grnboost2(expression_data=expression_matrix)

# Save results (TF, target, importance)
network.to_csv('network.tsv', sep='\t', index=False, header=False)

Critical: Always use if name == 'main': guard because Dask spawns new processes.

Core Capabilities 1. Basic GRN Inference

For standard GRN inference workflows including:

Input data preparation (Pandas DataFrame or NumPy array) Running inference with GRNBoost2 or GENIE3 Filtering by transcription factors Output format and interpretation

See: references/basic_inference.md

Use the ready-to-run script: scripts/basic_grn_inference.py for standard inference tasks:

python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777

  1. Algorithm Selection

Arboreto provides two algorithms:

GRNBoost2 (Recommended):

Fast gradient boosting-based inference Optimized for large datasets (10k+ observations) Default choice for most analyses

GENIE3:

Random Forest-based inference Original multiple regression approach Use for comparison or validation

Quick comparison:

from arboreto.algo import grnboost2, genie3

Fast, recommended

network_grnboost = grnboost2(expression_data=matrix)

Classic algorithm

network_genie3 = genie3(expression_data=matrix)

For detailed algorithm comparison, parameters, and selection guidance: references/algorithms.md

  1. Distributed Computing

Scale inference from local multi-core to cluster environments:

Local (default) - Uses all available cores automatically:

network = grnboost2(expression_data=matrix)

Custom local client - Control resources:

from distributed import LocalCluster, Client

local_cluster = LocalCluster(n_workers=10, memory_limit='8GB') client = Client(local_cluster)

network = grnboost2(expression_data=matrix, client_or_address=client)

client.close() local_cluster.close()

Cluster computing - Connect to remote Dask scheduler:

from distributed import Client

client = Client('tcp://scheduler:8786') network = grnboost2(expression_data=matrix, client_or_address=client)

For cluster setup, performance optimization, and large-scale workflows: references/distributed_computing.md

Installation uv pip install arboreto

Dependencies: scipy, scikit-learn, numpy, pandas, dask, distributed

Common Use Cases Single-Cell RNA-seq Analysis import pandas as pd from arboreto.algo import grnboost2

if name == 'main': # Load single-cell expression matrix (cells x genes) sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')

# Infer cell-type-specific regulatory network
network = grnboost2(expression_data=sc_data, seed=42)

# Filter high-confidence links
high_confidence = network[network['importance'] > 0.5]
high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)

Bulk RNA-seq with TF Filtering from arboreto.utils import load_tf_names from arboreto.algo import grnboost2

if name == 'main': # Load data expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t') tf_names = load_tf_names('human_tfs.txt')

# Infer with TF restriction
network = grnboost2(
    expression_data=expression_data,
    tf_names=tf_names,
    seed=123
)

network.to_csv('tf_target_network.tsv', sep='\t', index=False)

Comparative Analysis (Multiple Conditions) from arboreto.algo import grnboost2

if name == 'main': # Infer networks for different conditions conditions = ['control', 'treatment_24h', 'treatment_48h']

for condition in conditions:
    data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
    network = grnboost2(expression_data=data, seed=42)
    network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)

Output Interpretation

Arboreto returns a DataFrame with regulatory links:

Column Description TF Transcription factor (regulator) target Target gene importance Regulatory importance score (higher = stronger)

Filtering strategy:

Top N links per target gene Importance threshold (e.g., > 0.5) Statistical significance testing (permutation tests) Integration with pySCENIC

Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:

Step 1: Use arboreto for GRN inference

from arboreto.algo import grnboost2 network = grnboost2(expression_data=sc_data, tf_names=tf_list)

Step 2: Use pySCENIC for regulon identification and activity scoring

(See pySCENIC documentation for downstream analysis)

Reproducibility

Always set a seed for reproducible results:

network = grnboost2(expression_data=matrix, seed=777)

Run multiple seeds for robustness analysis:

from distributed import LocalCluster, Client

if name == 'main': client = Client(LocalCluster())

seeds = [42, 123, 777]
networks = []

for seed in seeds:
    net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
    networks.append(net)

# Combine networks and filter consensus links
consensus = analyze_consensus(networks)

Troubleshooting

Memory errors: Reduce dataset size by filtering low-variance genes or use distributed computing

Slow performance: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list

Dask errors: Ensure if name == 'main': guard is present in scripts

Empty results: Check data format (genes as columns), verify TF names match gene names

返回排行榜