tooluniverse-statistical-modeling

安装量: 116
排名: #7412

安装

npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-statistical-modeling
Statistical Modeling for Biomedical Data Analysis
Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.
Features
Linear Regression
- OLS for continuous outcomes with diagnostic tests
Logistic Regression
- Binary, ordinal, and multinomial models with odds ratios
Survival Analysis
- Cox proportional hazards and Kaplan-Meier curves
Mixed-Effects Models
- LMM/GLMM for hierarchical/repeated measures data
ANOVA
- One-way/two-way ANOVA, per-feature ANOVA for omics data
Model Diagnostics
- Assumption checking, fit statistics, residual analysis
Statistical Tests
- t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.
When to Use
Apply this skill when user asks:
"What is the odds ratio of X associated with Y?"
"What is the hazard ratio for treatment?"
"Fit a linear regression of Y on X1, X2, X3"
"Perform ordinal logistic regression for severity outcome"
"What is the Kaplan-Meier survival estimate at time T?"
"What is the percentage reduction in odds ratio after adjusting for confounders?"
"Run a mixed-effects model with random intercepts"
"Compute the interaction term between A and B"
"What is the F-statistic from ANOVA comparing groups?"
"Test if gene/miRNA expression differs across cell types"
Model Selection Decision Tree
START: What type of outcome variable?
|
+-- CONTINUOUS (height, blood pressure, score)
| +-- Independent observations -> Linear Regression (OLS)
| +-- Repeated measures -> Mixed-Effects Model (LMM)
| +-- Count data -> Poisson/Negative Binomial
|
+-- BINARY (yes/no, disease/healthy)
| +-- Independent observations -> Logistic Regression
| +-- Repeated measures -> Logistic Mixed-Effects (GLMM/GEE)
| +-- Rare events -> Firth logistic regression
|
+-- ORDINAL (mild/moderate/severe, stages I/II/III/IV)
| +-- Ordinal Logistic Regression (Proportional Odds)
|
+-- MULTINOMIAL (>2 unordered categories)
| +-- Multinomial Logistic Regression
|
+-- TIME-TO-EVENT (survival time + censoring)
+-- Regression -> Cox Proportional Hazards
+-- Survival curves -> Kaplan-Meier
Workflow
Phase 0: Data Validation
Goal
Load data, identify variable types, check for missing values.
CRITICAL: Identify the Outcome Variable First
Before any analysis, verify what you're actually predicting:
Read the full question
- Look for "predict [outcome]", "model [outcome]", or "dependent variable"
Examine available columns
- List all columns in the dataset
Match question to data
- Find the column that matches the described outcome
Verify outcome exists
- Don't create outcome variables from predictors
Common mistake
Question mentions "obesity" -> Assumed outcome = BMI >= 30 (circular logic with BMI predictor). Always check data columns first:
print(df.columns.tolist())
import
pandas
as
pd
import
numpy
as
np
df
=
pd
.
read_csv
(
'data.csv'
)
print
(
f"Observations:
{
len
(
df
)
}
, Variables:
{
len
(
df
.
columns
)
}
, Missing:
{
df
.
isnull
(
)
.
sum
(
)
.
sum
(
)
}
"
)
for
col
in
df
.
columns
:
n_unique
=
df
[
col
]
.
nunique
(
)
if
n_unique
==
2
:
print
(
f"
{
col
}
binary"
)
elif
n_unique
<=
10
and
df
[
col
]
.
dtype
==
'object'
:
print
(
f"
{
col
}
categorical (
{
n_unique
}
levels)"
)
elif
df
[
col
]
.
dtype
in
[
'float64'
,
'int64'
]
:
print
(
f"
{
col
}
continuous (mean=
{
df
[
col
]
.
mean
(
)
:
.2f
}
)"
)
Phase 1: Model Fitting
Goal
Fit appropriate model based on outcome type. Use the decision tree above to select model type, then refer to the appropriate reference file for detailed code: Linear Regression : references/linear_models.md Logistic Regression (binary): references/logistic_regression.md Ordinal Logistic : references/ordinal_logistic.md Cox Proportional Hazards : references/cox_regression.md ANOVA / Statistical Tests : anova_and_tests.md Quick reference for key models : import statsmodels . formula . api as smf import numpy as np

Linear regression

model

smf . ols ( 'outcome ~ predictor1 + predictor2' , data = df ) . fit ( )

Logistic regression (odds ratios)

model

smf . logit ( 'disease ~ exposure + age + sex' , data = df ) . fit ( disp = 0 ) ors = np . exp ( model . params ) ci = np . exp ( model . conf_int ( ) )

Cox proportional hazards

from
lifelines
import
CoxPHFitter
cph
=
CoxPHFitter
(
)
cph
.
fit
(
df
[
[
'time'
,
'event'
,
'treatment'
,
'age'
]
]
,
duration_col
=
'time'
,
event_col
=
'event'
)
hr
=
cph
.
hazard_ratios_
[
'treatment'
]
Phase 1b: ANOVA for Multi-Feature Data
When data has multiple features (genes, miRNAs, metabolites), use
per-feature ANOVA
(not aggregate). This is the most common pattern in genomics.
See
anova_and_tests.md
for the full decision tree, both methods, and worked examples.
Default for gene expression data
Per-feature ANOVA (Method B).
Phase 2: Model Diagnostics
Goal
Check model assumptions and fit quality.
Key diagnostics by model type:
OLS
Shapiro-Wilk (normality), Breusch-Pagan (heteroscedasticity), VIF (multicollinearity)
Cox
Proportional hazards test via
cph.check_assumptions()
Logistic
Hosmer-Lemeshow, ROC/AUC
See
references/troubleshooting.md
for diagnostic code and common issues.
Phase 3: Interpretation
Goal
Generate publication-quality summary.
For every result, report: effect size (OR/HR/coefficient), 95% CI, p-value, and model fit statistic. See
bixbench_patterns_summary.md
for common question-answer patterns.
Common BixBench Patterns
Pattern
Question Type
Key Steps
1
Odds ratio from ordinal regression
Fit OrderedModel, exp(coef)
2
Percentage reduction in OR
Compare crude vs adjusted model
3
Interaction effects
Fit
A * B
, extract
A:B
coef
4
Hazard ratio
Cox PH model, exp(coef)
5
Multi-feature ANOVA
Per-feature F-stats (not aggregate)
See
bixbench_patterns_summary.md
for solution code for each pattern.
See
references/bixbench_patterns.md
for 15+ detailed question patterns.
Statsmodels vs Scikit-learn
Use Case
Library
Reason
Inference
(p-values, CIs, ORs)
statsmodels
Full statistical output
Prediction
(accuracy, AUC)
scikit-learn
Better prediction tools
Mixed-effects models
statsmodels
Only option
Regularization
(LASSO, Ridge)
scikit-learn
Better optimization
Survival analysis
lifelines
Specialized library
General rule
Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).
Python Package Requirements
statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0
Key Principles
Data-first approach
- Always inspect and validate data before modeling
Model selection by outcome type
- Use decision tree above
Assumption checking
- Verify model assumptions (linearity, proportional hazards, etc.)
Complete reporting
- Always report effect sizes, CIs, p-values, and model fit statistics
Confounder awareness
- Adjust for confounders when specified or clinically relevant
Reproducible analysis
- All code must be deterministic and reproducible
Robust error handling
- Graceful handling of convergence failures, separation, collinearity
Round correctly
- Match the precision requested (typically 2-4 decimal places)
Completeness Checklist
Before finalizing any statistical analysis:
Outcome variable identified
Verified which column is the actual outcome
Data validated
N, missing values, variable types confirmed
Multi-feature data identified
If multiple features, use per-feature approach
Model appropriate
Outcome type matches model family
Assumptions checked
Relevant diagnostics performed
Effect sizes reported
OR/HR/Cohen's d with CIs
P-values reported
With appropriate correction if needed
Model fit assessed
R-squared, AIC/BIC, concordance
Results interpreted
Plain-language interpretation
Precision correct
Numbers rounded appropriately
File Structure
tooluniverse-statistical-modeling/
+-- SKILL.md # This file (workflow guide)
+-- QUICK_START.md # 8 quick examples
+-- EXAMPLES.md # Legacy examples
+-- TOOLS_REFERENCE.md # ToolUniverse tool catalog
+-- anova_and_tests.md # ANOVA decision tree and code
+-- bixbench_patterns_summary.md # Common BixBench solution patterns
+-- test_skill.py # Test suite
+-- references/
| +-- logistic_regression.md # Detailed logistic examples
| +-- ordinal_logistic.md # Ordinal logit guide
| +-- cox_regression.md # Survival analysis guide
| +-- linear_models.md # OLS and mixed-effects
| +-- bixbench_patterns.md # 15+ question patterns
| +-- troubleshooting.md # Diagnostic issues
+-- scripts/
+-- format_statistical_output.py # Format results for reporting
+-- model_diagnostics.py # Automated diagnostics
ToolUniverse Integration
While this skill is primarily computational, ToolUniverse tools can provide data:
Use Case
Tools
Clinical trial data
clinical_trials_search
Drug safety outcomes
FAERS_calculate_disproportionality
Gene-disease associations
OpenTargets_target_disease_evidence
Biomarker data
fda_pharmacogenomic_biomarkers
See
TOOLS_REFERENCE.md
for complete tool catalog.
References
statsmodels
:
https://www.statsmodels.org/
lifelines
:
https://lifelines.readthedocs.io/
scikit-learn
:
https://scikit-learn.org/
Ordinal models
statsmodels.miscmodels.ordinal_model.OrderedModel Support For detailed examples and troubleshooting: Logistic regression : references/logistic_regression.md Ordinal models : references/ordinal_logistic.md Survival analysis : references/cox_regression.md Linear/mixed models : references/linear_models.md BixBench patterns : references/bixbench_patterns.md ANOVA and tests : anova_and_tests.md Diagnostics : references/troubleshooting.md
返回排行榜