Statistical Modeling for Biomedical Data Analysis

Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.

Features

Linear Regression

- OLS for continuous outcomes with diagnostic tests

Logistic Regression

- Binary, ordinal, and multinomial models with odds ratios

Survival Analysis

- Cox proportional hazards and Kaplan-Meier curves

Mixed-Effects Models

- LMM/GLMM for hierarchical/repeated measures data

ANOVA

- One-way/two-way ANOVA, per-feature ANOVA for omics data

Model Diagnostics

- Assumption checking, fit statistics, residual analysis

Statistical Tests

- t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.

When to Use

Apply this skill when user asks:

"What is the odds ratio of X associated with Y?"

"What is the hazard ratio for treatment?"

"Fit a linear regression of Y on X1, X2, X3"

"Perform ordinal logistic regression for severity outcome"

"What is the Kaplan-Meier survival estimate at time T?"

"What is the percentage reduction in odds ratio after adjusting for confounders?"

"Run a mixed-effects model with random intercepts"

"Compute the interaction term between A and B"

"What is the F-statistic from ANOVA comparing groups?"

"Test if gene/miRNA expression differs across cell types"

Model Selection Decision Tree

START: What type of outcome variable?

|

+-- CONTINUOUS (height, blood pressure, score)

| +-- Independent observations -> Linear Regression (OLS)

| +-- Repeated measures -> Mixed-Effects Model (LMM)

| +-- Count data -> Poisson/Negative Binomial

|

+-- BINARY (yes/no, disease/healthy)

| +-- Independent observations -> Logistic Regression

| +-- Repeated measures -> Logistic Mixed-Effects (GLMM/GEE)

| +-- Rare events -> Firth logistic regression

|

+-- ORDINAL (mild/moderate/severe, stages I/II/III/IV)

| +-- Ordinal Logistic Regression (Proportional Odds)

|

+-- MULTINOMIAL (>2 unordered categories)

| +-- Multinomial Logistic Regression

|

+-- TIME-TO-EVENT (survival time + censoring)

+-- Regression -> Cox Proportional Hazards

+-- Survival curves -> Kaplan-Meier

Workflow

Phase 0: Data Validation

Goal

Load data, identify variable types, check for missing values.

CRITICAL: Identify the Outcome Variable First

Before any analysis, verify what you're actually predicting:

Read the full question

- Look for "predict [outcome]", "model [outcome]", or "dependent variable"

Examine available columns

- List all columns in the dataset

Match question to data

- Find the column that matches the described outcome

Verify outcome exists

- Don't create outcome variables from predictors

Common mistake

Question mentions "obesity" -> Assumed outcome = BMI >= 30 (circular logic with BMI predictor). Always check data columns first:

print(df.columns.tolist())

import

pandas

as

pd

import

numpy

as

np

df

=

pd

.

read_csv

(

'data.csv'

)

print

(

f"Observations:

{

len

(

df

)

}

, Variables:

{

len

(

df

.

columns

)

}

, Missing:

{

df

.

isnull

(

)

.

sum

(

)

.

sum

(

)

}

"

)

for

col

in

df

.

columns

:

n_unique

=

df

[

col

]

.

nunique

(

)

if

n_unique

==

2

:

print

(

f"

{

col

}

binary"

)

elif

n_unique

<=

10

and

df

[

col

]

.

dtype

==

'object'

:

print

(

f"

{

col

}

categorical (

{

n_unique

}

levels)"

)

elif

df

[

col

]

.

dtype

in

[

'float64'

,

'int64'

]

:

print

(

f"

{

col

}

continuous (mean=
{
df
[
col
]
.
mean
(
)
:
.2f
}
)"
)
Phase 1: Model Fitting
Goal: Fit appropriate model based on outcome type. Use the decision tree above to select model type, then refer to the appropriate reference file for detailed code: Linear Regression : references/linear_models.md Logistic Regression (binary): references/logistic_regression.md Ordinal Logistic : references/ordinal_logistic.md Cox Proportional Hazards : references/cox_regression.md ANOVA / Statistical Tests : anova_and_tests.md Quick reference for key models : import statsmodels . formula . api as smf import numpy as np

Linear regression

model

smf . ols ( 'outcome ~ predictor1 + predictor2' , data = df ) . fit ( )

Logistic regression (odds ratios)

model

smf . logit ( 'disease ~ exposure + age + sex' , data = df ) . fit ( disp = 0 ) ors = np . exp ( model . params ) ci = np . exp ( model . conf_int ( ) )

Cox proportional hazards

from

lifelines

import

CoxPHFitter

cph

=

CoxPHFitter

(

)

cph

.

fit

(

df

[

'time'

,

'event'

,

'treatment'

,

'age'

]

,

duration_col

=

'time'

,

event_col

=

'event'

)

hr

=

cph

.

hazard_ratios_

[

'treatment'

]

Phase 1b: ANOVA for Multi-Feature Data

When data has multiple features (genes, miRNAs, metabolites), use

per-feature ANOVA

(not aggregate). This is the most common pattern in genomics.

See

anova_and_tests.md

for the full decision tree, both methods, and worked examples.

Default for gene expression data

Per-feature ANOVA (Method B).

Phase 2: Model Diagnostics

Goal

Check model assumptions and fit quality.

Key diagnostics by model type:

OLS

Shapiro-Wilk (normality), Breusch-Pagan (heteroscedasticity), VIF (multicollinearity)

Cox

Proportional hazards test via

cph.check_assumptions()

Logistic

Hosmer-Lemeshow, ROC/AUC

See

references/troubleshooting.md

for diagnostic code and common issues.

Phase 3: Interpretation

Goal

Generate publication-quality summary.

For every result, report: effect size (OR/HR/coefficient), 95% CI, p-value, and model fit statistic. See

bixbench_patterns_summary.md

for common question-answer patterns.

Common BixBench Patterns

Pattern

Question Type

Key Steps

1

Odds ratio from ordinal regression

Fit OrderedModel, exp(coef)

2

Percentage reduction in OR

Compare crude vs adjusted model

3

Interaction effects

Fit

A * B

, extract

A:B

coef

4

Hazard ratio

Cox PH model, exp(coef)

5

Multi-feature ANOVA

Per-feature F-stats (not aggregate)

See

bixbench_patterns_summary.md

for solution code for each pattern.

See

references/bixbench_patterns.md

for 15+ detailed question patterns.

Statsmodels vs Scikit-learn

Use Case

Library

Reason

Inference

(p-values, CIs, ORs)

statsmodels

Full statistical output

Prediction

(accuracy, AUC)

scikit-learn

Better prediction tools

Mixed-effects models

statsmodels

Only option

Regularization

(LASSO, Ridge)

scikit-learn

Better optimization

Survival analysis

lifelines

Specialized library

General rule

Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).

Python Package Requirements

statsmodels>=0.14.0

scikit-learn>=1.3.0

lifelines>=0.27.0

pandas>=2.0.0

numpy>=1.24.0

scipy>=1.10.0

Key Principles

Data-first approach

- Always inspect and validate data before modeling

Model selection by outcome type

- Use decision tree above

Assumption checking

- Verify model assumptions (linearity, proportional hazards, etc.)

Complete reporting

- Always report effect sizes, CIs, p-values, and model fit statistics

Confounder awareness

- Adjust for confounders when specified or clinically relevant

Reproducible analysis

- All code must be deterministic and reproducible

Robust error handling

- Graceful handling of convergence failures, separation, collinearity

Round correctly

- Match the precision requested (typically 2-4 decimal places)

Completeness Checklist

Before finalizing any statistical analysis:

Outcome variable identified

Verified which column is the actual outcome

Data validated

N, missing values, variable types confirmed

Multi-feature data identified

If multiple features, use per-feature approach

Model appropriate

Outcome type matches model family

Assumptions checked

Relevant diagnostics performed

Effect sizes reported

OR/HR/Cohen's d with CIs

P-values reported

With appropriate correction if needed

Model fit assessed

R-squared, AIC/BIC, concordance

Results interpreted

Plain-language interpretation

Precision correct

Numbers rounded appropriately
File Structure
tooluniverse-statistical-modeling/
+-- SKILL.md # This file (workflow guide)
+-- QUICK_START.md # 8 quick examples
+-- EXAMPLES.md # Legacy examples
+-- TOOLS_REFERENCE.md # ToolUniverse tool catalog
+-- anova_and_tests.md # ANOVA decision tree and code
+-- bixbench_patterns_summary.md # Common BixBench solution patterns
+-- test_skill.py # Test suite
+-- references/
| +-- logistic_regression.md # Detailed logistic examples
| +-- ordinal_logistic.md # Ordinal logit guide
| +-- cox_regression.md # Survival analysis guide
| +-- linear_models.md # OLS and mixed-effects
| +-- bixbench_patterns.md # 15+ question patterns
| +-- troubleshooting.md # Diagnostic issues
+-- scripts/
+-- format_statistical_output.py # Format results for reporting
+-- model_diagnostics.py # Automated diagnostics
ToolUniverse Integration
While this skill is primarily computational, ToolUniverse tools can provide data:
Use Case
Tools
Clinical trial data
clinical_trials_search
Drug safety outcomes
FAERS_calculate_disproportionality
Gene-disease associations
OpenTargets_target_disease_evidence
Biomarker data
fda_pharmacogenomic_biomarkers
See
TOOLS_REFERENCE.md
for complete tool catalog.
References
statsmodels
:
https://www.statsmodels.org/
lifelines
:
https://lifelines.readthedocs.io/
scikit-learn
:
https://scikit-learn.org/
Ordinal models: statsmodels.miscmodels.ordinal_model.OrderedModel Support For detailed examples and troubleshooting: Logistic regression : references/logistic_regression.md Ordinal models : references/ordinal_logistic.md Survival analysis : references/cox_regression.md Linear/mixed models : references/linear_models.md BixBench patterns : references/bixbench_patterns.md ANOVA and tests : anova_and_tests.md Diagnostics : references/troubleshooting.md

安装

Linear regression

model

Logistic regression (odds ratios)

model

Cox proportional hazards