Variant Analysis and Annotation Production-ready VCF processing and variant annotation skill combining local bioinformatics computation with ToolUniverse database integration. Designed to answer bioinformatics analysis questions about VCF data, mutation classification, variant filtering, and clinical annotation. When to Use This Skill Triggers : User provides a VCF file (SNV/indel or SV) and asks questions about its contents Questions about variant allele frequency (VAF) filtering Mutation type classification queries (missense, nonsense, synonymous, etc.) Structural variant interpretation requests (deletions, duplications, CNVs) Variant annotation requests (ClinVar, gnomAD, CADD, dbSNP) CNV pathogenicity assessment using ClinGen dosage sensitivity Cohort comparison questions Population frequency filtering (SNVs or SVs) Intronic/intergenic variant filtering Gene dosage sensitivity queries Example Questions : "What fraction of variants with VAF < 0.3 are annotated as missense mutations?" "After filtering intronic/intergenic variants, how many non-reference variants remain?" "What is the clinical significance of this deletion affecting BRCA1?" "Which dosage-sensitive genes overlap this 500kb duplication on chr17?" "How many variants have clinical significance annotations?" "Compare variant counts between samples" Core Capabilities Capability Description VCF Parsing Pure Python + cyvcf2 parsers. VCF 4.x, gzipped, multi-sample, SNV/indel/SV Mutation Classification Maps SO terms, SnpEff ANN, VEP CSQ, GATK Funcotator to standard types VAF Extraction Handles AF, AD, AO/RO, NR/NV, INFO AF formats Filtering VAF, depth, quality, PASS, variant type, mutation type, consequence, chromosome, SV size Statistics Ti/Tv ratio, per-sample VAF/depth stats, mutation type distribution, SV size distribution Annotation MyVariant.info (aggregates ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen) SV/CNV Analysis gnomAD SV population frequencies, DGVa/dbVar known SVs, ClinGen dosage sensitivity Clinical Interpretation ACMG/ClinGen CNV pathogenicity classification using haploinsufficiency/triplosensitivity scores DataFrame Convert to pandas for advanced analytics Reporting Markdown reports with tables and statistics, SV clinical reports Workflow Overview Input VCF File (SNVs/indels or SVs) | v Phase 1: Parse VCF |-- Pure Python parser (any VCF 4.x) |-- cyvcf2 parser (faster, C-based) |-- Extract: CHROM, POS, REF, ALT, QUAL, FILTER, INFO, FORMAT, samples |-- Extract per-sample: GT, VAF, depth |-- Extract annotations from INFO (ANN, CSQ, FUNCOTATION) |-- Detect variant class: SNV/indel vs SV/CNV | v Phase 2: Classify Variants |-- Variant type: SNV, INS, DEL, MNV, COMPLEX, SV |-- Mutation type: missense, nonsense, synonymous, frameshift, splice, etc. |-- Impact: HIGH, MODERATE, LOW, MODIFIER |-- SV type: DEL, DUP, INV, BND, CNV (if structural variant) | v Phase 3: Apply Filters |-- VAF range (min/max) |-- Read depth minimum |-- Quality threshold |-- PASS only |-- Variant/mutation type inclusion/exclusion |-- Consequence exclusion (intronic, intergenic) |-- Population frequency range |-- Chromosome selection |-- SV size range (for structural variants) | v Phase 4: Compute Statistics |-- Variant type distribution |-- Mutation type distribution |-- Impact distribution |-- Chromosome distribution |-- Ti/Tv ratio (for SNVs) |-- Per-sample VAF/depth stats |-- Gene mutation counts |-- SV size distribution (for structural variants) | v Phase 5: Annotate with ToolUniverse (optional) |-- MyVariant.info: ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen |-- dbSNP: Population frequencies, gene associations |-- gnomAD: Population allele frequencies |-- Ensembl VEP: Consequence prediction | v Phase 6: Generate Report / Answer Question |-- Markdown report with tables |-- Direct answer to specific question |-- DataFrame for downstream analysis | v Phase 7: Structural Variant & CNV Analysis (if SV/CNV detected) |-- Annotate with gnomAD SV population frequencies |-- Query DGVa/dbVar for known SVs (Ensembl) |-- Identify affected genes |-- Query ClinGen dosage sensitivity (HI/TS scores) |-- Classify pathogenicity (Pathogenic/Likely Pathogenic/VUS/Benign) |-- Generate SV clinical report with ACMG/ClinGen guidelines Phase Summaries Phase 1: VCF Parsing Use pandas for : Reading VCF as structured data Quick exploratory analysis When you need to manipulate columns and rows Use python_implementation tools for : Production parsing with annotation extraction Multi-sample VCF handling VAF extraction from FORMAT fields Large file streaming Key functions : vcf_data = parse_vcf ( "input.vcf" )

Pure Python (always works)

vcf_data

parse_vcf_cyvcf2 ( "input.vcf" )

Fast C-based (if installed)

df

variants_to_dataframe ( vcf_data . variants , sample = "TUMOR" )

For pandas

Phase 2: Variant Classification
Automatic classification from annotations
:
SnpEff ANN field
VEP CSQ field
GATK Funcotator FUNCOTATION field
Standard INFO keys: EFFECT, EFF, TYPE
Mutation types supported: missense, nonsense, synonymous, frameshift, splice_site, splice_region, inframe_insertion, inframe_deletion, intronic, intergenic, UTR_5, UTR_3, upstream, downstream, stop_lost, start_lost See references/mutation_classification_guide.md for full details Phase 3: Filtering Common filtering patterns :

Somatic-like variants

criteria

FilterCriteria ( min_vaf = 0.05 , max_vaf = 0.95 , min_depth = 20 , pass_only = True , exclude_consequences = [ "intronic" , "intergenic" , "upstream" , "downstream" ] )

High-confidence germline

criteria

FilterCriteria ( min_vaf = 0.25 , min_depth = 30 , pass_only = True , chromosomes = [ "1" , "2" , . . . , "22" , "X" , "Y" ] )

Rare pathogenic candidates

criteria

FilterCriteria

(

min_depth

=

20

,

pass_only

=

True

,

mutation_types

=

[

"missense"

,

"nonsense"

,

"frameshift"

]

)

See references/vcf_filtering.md for all filter options

Phase 4: Statistics

Use pandas for

:

Complex aggregations (groupby, pivot tables)

Custom statistical tests

Data exploration

Use python_implementation for

:

Standard variant statistics (Ti/Tv, type distribution)

Per-sample VAF/depth summary

Quick mutation type counts

Phase 5: ToolUniverse Annotation

When to use ToolUniverse annotation tools

:

ClinVar clinical significance

Use MyVariant.info or dbSNP tools

Population frequencies

Use MyVariant.info (aggregates gnomAD, ExAC, 1000G)

Pathogenicity scores

Use MyVariant.info (aggregates CADD, SIFT, PolyPhen)

Consequence prediction

Use Ensembl VEP tools

Best practices

:

Annotate variants with rsIDs first (most reliable)

Use MyVariant.info for batch annotation (aggregates multiple sources)

Limit to top variants (max_annotate=50-100) to respect rate limits

Query dbSNP/gnomAD directly for specific use cases

Key tools

:

MyVariant_query_variants

Batch annotation (ClinVar, dbSNP, gnomAD, CADD)

dbsnp_get_variant_by_rsid

Population frequencies

gnomad_get_variant

Basic variant metadata
EnsemblVEP_annotate_rsid: Consequence prediction See references/annotation_guide.md for detailed examples Phase 6: Report Generation Report includes : Summary Statistics (total variants, type counts, Ti/Tv) Mutation Type Distribution (table with counts and percentages) Impact Distribution Chromosome Distribution VAF Distribution (per-sample) Clinical Significance Top Mutated Genes Variant Annotations (ClinVar-annotated variants) Phase 7: Structural Variant & CNV Analysis When VCF contains SV calls (SVTYPE=DEL/DUP/INV/BND): Identify affected genes (from VCF annotation or coordinate overlap) Query ClinGen dosage sensitivity : clingen = ClinGen_dosage_by_gene ( gene_symbol = "BRCA1" )

Returns: haploinsufficiency_score, triplosensitivity_score

Check population frequency : gnomad_sv = gnomad_get_sv_by_gene ( gene_symbol = "BRCA1" )

Returns: SVs with AF, AC, AN

Classify pathogenicity

:

Pathogenic: Deletion + HI score = 3, AF < 0.0001

Likely Pathogenic: Deletion + HI score = 2, AF < 0.001

VUS: HI/TS score = 0-1, AF 0.001-0.01

Benign: AF > 0.01

ClinGen dosage score interpretation

:

3

Sufficient evidence for dosage pathogenicity (HIGH impact)

2

Some evidence (MODERATE impact)

1

Little evidence (LOW impact)

0

No evidence (MINIMAL impact)

40

Dosage sensitivity unlikely
See references/sv_cnv_analysis.md for full SV workflow
Answering BixBench Questions
Pattern 1: VAF + Mutation Type Fraction
Question: "What fraction of variants with VAF < X are annotated as Y mutations?" result = answer_vaf_mutation_fraction ( vcf_path = "input.vcf" , max_vaf = 0.3 , mutation_type = "missense" , sample = "TUMOR" )

Returns: fraction, total_below_vaf, matching_mutation_type

Pattern 2: Cohort Comparison
Question: "What is the difference in mutation frequency between cohorts?" result = answer_cohort_comparison ( vcf_paths = [ "cohort1.vcf" , "cohort2.vcf" ] , mutation_type = "missense" , cohort_names = [ "Treatment" , "Control" ] )

Returns: cohorts, frequency_difference

Pattern 3: Filter and Count
Question: "After filtering X, how many Y remain?" result = answer_non_reference_after_filter ( vcf_path = "input.vcf" , exclude_intronic_intergenic = True )

Returns: total_input, non_reference, remaining

ToolUniverse Tools Reference
SNV/Indel Annotation
Tool
When to Use
Parameters
Response
MyVariant_query_variants
Batch annotation
query
(rsID/HGVS)
ClinVar, dbSNP, gnomAD, CADD
dbsnp_get_variant_by_rsid
Population frequencies
rsid
Frequencies, clinical significance
gnomad_get_variant
gnomAD metadata
variant_id
(CHR-POS-REF-ALT)
Basic variant info
EnsemblVEP_annotate_rsid
Consequence prediction
variant_id
(rsID)
Transcript impact
Structural Variant Annotation
Tool
When to Use
Parameters
Response
gnomad_get_sv_by_gene
SV population frequency
gene_symbol
SVs with AF, AC, AN
gnomad_get_sv_by_region
Regional SV search
chrom
,
start
,
end
SVs in region
ClinGen_dosage_by_gene
Dosage sensitivity
gene_symbol
HI/TS scores, disease
ClinGen_dosage_region_search
Dosage-sensitive genes in region
chromosome
,
start
,
end
All genes with HI/TS scores
ensembl_get_structural_variants
Known SVs from DGVa/dbVar
chrom
,
start
,
end
,
species
Clinical significance
See references/annotation_guide.md for detailed tool usage examples
Common Use Patterns
Pattern 1: Quick VCF Summary
Parse VCF, compute statistics, generate report.
report
=
variant_analysis_pipeline
(
"input.vcf"
,
output_file
=
"report.md"
)
Pattern 2: Filtered Analysis
Parse VCF, apply multi-criteria filter, compute statistics on filtered set.
report
=
variant_analysis_pipeline
(
vcf_path
=
"input.vcf"
,
filters
=
FilterCriteria
(
min_vaf
=
0.1
,
min_depth
=
20
,
pass_only
=
True
)
,
output_file
=
"filtered_report.md"
)
Pattern 3: Annotated Report
Parse VCF, annotate top variants with ClinVar/gnomAD/CADD, generate clinical report.
report
=
variant_analysis_pipeline
(
vcf_path
=
"input.vcf"
,
annotate
=
True
,
max_annotate
=
50
,
output_file
=
"annotated_report.md"
)
Pattern 4: BixBench Question Answering
Parse VCF, apply specific filters, compute targeted statistics to answer precise questions.
result
=
answer_vaf_mutation_fraction
(
vcf_path
=
"input.vcf"
,
max_vaf
=
0.3
,
mutation_type
=
"missense"
)
Pattern 5: Cohort Comparison
Parse multiple VCFs, compare mutation frequencies across cohorts.
result
=
answer_cohort_comparison
(
vcf_paths
=
[
"cohort1.vcf"
,
"cohort2.vcf"
]
,
mutation_type
=
"missense"
)
When to Use pandas vs python_implementation
Use pandas when
:
You need to read VCF as a flat table
You want to do custom aggregations (groupby, pivot)
You need to join with other data
You're doing exploratory data analysis
You want to export to CSV/Excel
Use python_implementation when
:
You need production-grade VCF parsing
You need to extract INFO annotations (ANN, CSQ)
You need per-sample VAF/depth extraction
You need to classify mutation types
You need standard variant statistics (Ti/Tv)
You need to integrate with ToolUniverse annotation
Best approach: Use python_implementation for parsing/classification, then convert to DataFrame for custom analysis:

Parse and classify

vcf_data

parse_vcf ( "input.vcf" ) passing , failing = filter_variants ( vcf_data . variants , criteria )

Convert to DataFrame for custom analysis

df

variants_to_dataframe ( passing , sample = "TUMOR" )

Now use pandas

missense_high_vaf

df

[

(

df

[

'mutation_type'

]

==

'missense'

)

&

(

df

[

'vaf'

]

>=

0.3

)

]

Limitations

VCF annotation required for mutation classification

If VCF has no ANN/CSQ/FUNCOTATION in INFO, mutation types will be "unknown" until ToolUniverse annotation is applied

Multi-allelic variants

Parser takes first ALT allele for type classification

ToolUniverse annotation rate

API-based, limited to ~100 variants per batch by default to respect rate limits

gnomAD tool

Returns basic metadata only (not full allele frequencies); use MyVariant.info for gnomAD AF

Large VCFs

Pure Python parser streams line-by-line; cyvcf2 is recommended for files with >100K variants

Reference Documentation

references/vcf_filtering.md

Complete filter options and examples

references/mutation_classification_guide.md

Detailed mutation type classification rules

references/annotation_guide.md

ToolUniverse annotation workflows with examples

references/sv_cnv_analysis.md

Complete SV/CNV interpretation workflow

Utility Scripts

scripts/parse_vcf.py

Standalone VCF parsing script

scripts/filter_variants.py

Command-line variant filtering
scripts/annotate_variants.py: Batch variant annotation Quick Start See QUICK_START.md for: Python SDK examples (pipeline, question functions, individual tools) MCP conversational examples Common recipes (somatic analysis, clinical screening, population frequency) Expected output formats Troubleshooting guide

tooluniverse-variant-analysis

安装

Pure Python (always works)

vcf_data

Fast C-based (if installed)

df

For pandas

Somatic-like variants

criteria

High-confidence germline

criteria

Rare pathogenic candidates

criteria

Returns: haploinsufficiency_score, triplosensitivity_score

Returns: SVs with AF, AC, AN

Returns: fraction, total_below_vaf, matching_mutation_type

Returns: cohorts, frequency_difference

Returns: total_input, non_reference, remaining

Parse and classify

vcf_data

Convert to DataFrame for custom analysis

df

Now use pandas

missense_high_vaf