- Statistical Modeling for Biomedical Data Analysis
- Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.
- Features
- Linear Regression
- - OLS for continuous outcomes with diagnostic tests
- Logistic Regression
- - Binary, ordinal, and multinomial models with odds ratios
- Survival Analysis
- - Cox proportional hazards and Kaplan-Meier curves
- Mixed-Effects Models
- - LMM/GLMM for hierarchical/repeated measures data
- ANOVA
- - One-way/two-way ANOVA, per-feature ANOVA for omics data
- Model Diagnostics
- - Assumption checking, fit statistics, residual analysis
- Statistical Tests
- - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.
- When to Use
- Apply this skill when user asks:
- "What is the odds ratio of X associated with Y?"
- "What is the hazard ratio for treatment?"
- "Fit a linear regression of Y on X1, X2, X3"
- "Perform ordinal logistic regression for severity outcome"
- "What is the Kaplan-Meier survival estimate at time T?"
- "What is the percentage reduction in odds ratio after adjusting for confounders?"
- "Run a mixed-effects model with random intercepts"
- "Compute the interaction term between A and B"
- "What is the F-statistic from ANOVA comparing groups?"
- "Test if gene/miRNA expression differs across cell types"
- Model Selection Decision Tree
- START: What type of outcome variable?
- |
- +-- CONTINUOUS (height, blood pressure, score)
- | +-- Independent observations -> Linear Regression (OLS)
- | +-- Repeated measures -> Mixed-Effects Model (LMM)
- | +-- Count data -> Poisson/Negative Binomial
- |
- +-- BINARY (yes/no, disease/healthy)
- | +-- Independent observations -> Logistic Regression
- | +-- Repeated measures -> Logistic Mixed-Effects (GLMM/GEE)
- | +-- Rare events -> Firth logistic regression
- |
- +-- ORDINAL (mild/moderate/severe, stages I/II/III/IV)
- | +-- Ordinal Logistic Regression (Proportional Odds)
- |
- +-- MULTINOMIAL (>2 unordered categories)
- | +-- Multinomial Logistic Regression
- |
- +-- TIME-TO-EVENT (survival time + censoring)
- +-- Regression -> Cox Proportional Hazards
- +-- Survival curves -> Kaplan-Meier
- Workflow
- Phase 0: Data Validation
- Goal
-
- Load data, identify variable types, check for missing values.
- CRITICAL: Identify the Outcome Variable First
- Before any analysis, verify what you're actually predicting:
- Read the full question
- - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
- Examine available columns
- - List all columns in the dataset
- Match question to data
- - Find the column that matches the described outcome
- Verify outcome exists
- - Don't create outcome variables from predictors
- Common mistake
-
- Question mentions "obesity" -> Assumed outcome = BMI >= 30 (circular logic with BMI predictor). Always check data columns first:
- print(df.columns.tolist())
- import
- pandas
- as
- pd
- import
- numpy
- as
- np
- df
- =
- pd
- .
- read_csv
- (
- 'data.csv'
- )
- (
- f"Observations:
- {
- len
- (
- df
- )
- }
- , Variables:
- {
- len
- (
- df
- .
- columns
- )
- }
- , Missing:
- {
- df
- .
- isnull
- (
- )
- .
- sum
- (
- )
- .
- sum
- (
- )
- }
- "
- )
- for
- col
- in
- df
- .
- columns
- :
- n_unique
- =
- df
- [
- col
- ]
- .
- nunique
- (
- )
- if
- n_unique
- ==
- 2
- :
- (
- f"
- {
- col
- }
-
- binary"
- )
- elif
- n_unique
- <=
- 10
- and
- df
- [
- col
- ]
- .
- dtype
- ==
- 'object'
- :
- (
- f"
- {
- col
- }
-
- categorical (
- {
- n_unique
- }
- levels)"
- )
- elif
- df
- [
- col
- ]
- .
- dtype
- in
- [
- 'float64'
- ,
- 'int64'
- ]
- :
- (
- f"
- {
- col
- }
-
- continuous (mean=
- {
- df
- [
- col
- ]
- .
- mean
- (
- )
- :
- .2f
- }
- )"
- )
- Phase 1: Model Fitting
- Goal
- Fit appropriate model based on outcome type. Use the decision tree above to select model type, then refer to the appropriate reference file for detailed code: Linear Regression : references/linear_models.md Logistic Regression (binary): references/logistic_regression.md Ordinal Logistic : references/ordinal_logistic.md Cox Proportional Hazards : references/cox_regression.md ANOVA / Statistical Tests : anova_and_tests.md Quick reference for key models : import statsmodels . formula . api as smf import numpy as np
Linear regression
model
smf . ols ( 'outcome ~ predictor1 + predictor2' , data = df ) . fit ( )
Logistic regression (odds ratios)
model
smf . logit ( 'disease ~ exposure + age + sex' , data = df ) . fit ( disp = 0 ) ors = np . exp ( model . params ) ci = np . exp ( model . conf_int ( ) )
Cox proportional hazards
- from
- lifelines
- import
- CoxPHFitter
- cph
- =
- CoxPHFitter
- (
- )
- cph
- .
- fit
- (
- df
- [
- [
- 'time'
- ,
- 'event'
- ,
- 'treatment'
- ,
- 'age'
- ]
- ]
- ,
- duration_col
- =
- 'time'
- ,
- event_col
- =
- 'event'
- )
- hr
- =
- cph
- .
- hazard_ratios_
- [
- 'treatment'
- ]
- Phase 1b: ANOVA for Multi-Feature Data
- When data has multiple features (genes, miRNAs, metabolites), use
- per-feature ANOVA
- (not aggregate). This is the most common pattern in genomics.
- See
- anova_and_tests.md
- for the full decision tree, both methods, and worked examples.
- Default for gene expression data
-
- Per-feature ANOVA (Method B).
- Phase 2: Model Diagnostics
- Goal
-
- Check model assumptions and fit quality.
- Key diagnostics by model type:
- OLS
-
- Shapiro-Wilk (normality), Breusch-Pagan (heteroscedasticity), VIF (multicollinearity)
- Cox
-
- Proportional hazards test via
- cph.check_assumptions()
- Logistic
-
- Hosmer-Lemeshow, ROC/AUC
- See
- references/troubleshooting.md
- for diagnostic code and common issues.
- Phase 3: Interpretation
- Goal
-
- Generate publication-quality summary.
- For every result, report: effect size (OR/HR/coefficient), 95% CI, p-value, and model fit statistic. See
- bixbench_patterns_summary.md
- for common question-answer patterns.
- Common BixBench Patterns
- Pattern
- Question Type
- Key Steps
- 1
- Odds ratio from ordinal regression
- Fit OrderedModel, exp(coef)
- 2
- Percentage reduction in OR
- Compare crude vs adjusted model
- 3
- Interaction effects
- Fit
- A * B
- , extract
- A:B
- coef
- 4
- Hazard ratio
- Cox PH model, exp(coef)
- 5
- Multi-feature ANOVA
- Per-feature F-stats (not aggregate)
- See
- bixbench_patterns_summary.md
- for solution code for each pattern.
- See
- references/bixbench_patterns.md
- for 15+ detailed question patterns.
- Statsmodels vs Scikit-learn
- Use Case
- Library
- Reason
- Inference
- (p-values, CIs, ORs)
- statsmodels
- Full statistical output
- Prediction
- (accuracy, AUC)
- scikit-learn
- Better prediction tools
- Mixed-effects models
- statsmodels
- Only option
- Regularization
- (LASSO, Ridge)
- scikit-learn
- Better optimization
- Survival analysis
- lifelines
- Specialized library
- General rule
-
- Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).
- Python Package Requirements
- statsmodels>=0.14.0
- scikit-learn>=1.3.0
- lifelines>=0.27.0
- pandas>=2.0.0
- numpy>=1.24.0
- scipy>=1.10.0
- Key Principles
- Data-first approach
- - Always inspect and validate data before modeling
- Model selection by outcome type
- - Use decision tree above
- Assumption checking
- - Verify model assumptions (linearity, proportional hazards, etc.)
- Complete reporting
- - Always report effect sizes, CIs, p-values, and model fit statistics
- Confounder awareness
- - Adjust for confounders when specified or clinically relevant
- Reproducible analysis
- - All code must be deterministic and reproducible
- Robust error handling
- - Graceful handling of convergence failures, separation, collinearity
- Round correctly
- - Match the precision requested (typically 2-4 decimal places)
- Completeness Checklist
- Before finalizing any statistical analysis:
- Outcome variable identified
-
- Verified which column is the actual outcome
- Data validated
-
- N, missing values, variable types confirmed
- Multi-feature data identified
-
- If multiple features, use per-feature approach
- Model appropriate
-
- Outcome type matches model family
- Assumptions checked
-
- Relevant diagnostics performed
- Effect sizes reported
-
- OR/HR/Cohen's d with CIs
- P-values reported
-
- With appropriate correction if needed
- Model fit assessed
-
- R-squared, AIC/BIC, concordance
- Results interpreted
-
- Plain-language interpretation
- Precision correct
-
- Numbers rounded appropriately
- File Structure
- tooluniverse-statistical-modeling/
- +-- SKILL.md # This file (workflow guide)
- +-- QUICK_START.md # 8 quick examples
- +-- EXAMPLES.md # Legacy examples
- +-- TOOLS_REFERENCE.md # ToolUniverse tool catalog
- +-- anova_and_tests.md # ANOVA decision tree and code
- +-- bixbench_patterns_summary.md # Common BixBench solution patterns
- +-- test_skill.py # Test suite
- +-- references/
- | +-- logistic_regression.md # Detailed logistic examples
- | +-- ordinal_logistic.md # Ordinal logit guide
- | +-- cox_regression.md # Survival analysis guide
- | +-- linear_models.md # OLS and mixed-effects
- | +-- bixbench_patterns.md # 15+ question patterns
- | +-- troubleshooting.md # Diagnostic issues
- +-- scripts/
- +-- format_statistical_output.py # Format results for reporting
- +-- model_diagnostics.py # Automated diagnostics
- ToolUniverse Integration
- While this skill is primarily computational, ToolUniverse tools can provide data:
- Use Case
- Tools
- Clinical trial data
- clinical_trials_search
- Drug safety outcomes
- FAERS_calculate_disproportionality
- Gene-disease associations
- OpenTargets_target_disease_evidence
- Biomarker data
- fda_pharmacogenomic_biomarkers
- See
- TOOLS_REFERENCE.md
- for complete tool catalog.
- References
- statsmodels
- :
- https://www.statsmodels.org/
- lifelines
- :
- https://lifelines.readthedocs.io/
- scikit-learn
- :
- https://scikit-learn.org/
- Ordinal models
- statsmodels.miscmodels.ordinal_model.OrderedModel Support For detailed examples and troubleshooting: Logistic regression : references/logistic_regression.md Ordinal models : references/ordinal_logistic.md Survival analysis : references/cox_regression.md Linear/mixed models : references/linear_models.md BixBench patterns : references/bixbench_patterns.md ANOVA and tests : anova_and_tests.md Diagnostics : references/troubleshooting.md