- Small Molecule Binder Discovery Strategy
- Systematic discovery of novel small molecule binders using 60+ ToolUniverse tools across druggability assessment, known ligand mining, similarity expansion, ADMET filtering, and synthesis feasibility.
- KEY PRINCIPLES
- :
- Report-first approach
- - Create report file FIRST, then populate progressively
- Target validation FIRST
- - Confirm druggability before compound searching
- Multi-strategy approach
- - Combine structure-based and ligand-based methods
- ADMET-aware filtering
- - Eliminate poor compounds early
- Evidence grading
- - Grade candidates by supporting evidence
- Actionable output
- - Provide prioritized candidates with rationale
- English-first queries
- - Always use English terms in tool calls, even if the user writes in another language. Only try original-language terms as a fallback. Respond in the user's language
- Critical Workflow Requirements
- 1. Report-First Approach (MANDATORY)
- DO NOT
- show search process or tool outputs to the user. Instead:
- Create the report file FIRST
- - Before any data collection:
- File name:
- [TARGET]_binder_discovery_report.md
- Initialize with all section headers from the template (see REPORT_TEMPLATE.md)
- Add placeholder text:
- [Researching...]
- in each section
- Progressively update the report
- - As you gather data:
- Update each section with findings immediately
- The user sees the report growing, not the search process
- Output separate data files
- :
- [TARGET]_candidate_compounds.csv
- - Prioritized compounds with SMILES, scores
- [TARGET]_bibliography.json
- - Literature references (optional)
- 2. Citation Requirements (MANDATORY)
- Every piece of information MUST include its source:
- *
- Source: ChEMBL via
ChEMBL_get_target_activities- (CHEMBL203)
- *
- *
- Source: PDB via
get_protein_metadata_by_pdb_id- (1M17)
- *
- *
- Source: ADMET-AI via
ADMETAI_predict_toxicity- *
- *
- Source: NVIDIA NIM via
NvidiaNIM_alphafold2- (pLDDT: 90.94)
- *
- Workflow Overview
- Phase 0: Tool Verification (check parameter names)
- |
- Phase 1: Target Validation
- |- 1.1 Resolve identifiers (UniProt, Ensembl, ChEMBL target ID)
- |- 1.2 Assess druggability/tractability
- | +- 1.2a GPCRdb integration (for GPCR targets)
- | +- 1.2.5 Check therapeutic antibodies (Thera-SAbDab)
- |- 1.3 Identify binding sites
- +- 1.4 Predict structure (NvidiaNIM_alphafold2/esmfold)
- |
- Phase 2: Known Ligand Mining
- |- ChEMBL bioactivity data
- |- GtoPdb interactions
- |- Chemical probes (Open Targets)
- |- BindingDB affinity data (Ki/IC50/Kd)
- |- PubChem BioAssay HTS data (screening hits)
- +- SAR analysis from known actives
- |
- Phase 3: Structure Analysis
- |- PDB structures with ligands
- |- EMDB cryo-EM structures (for membrane targets)
- |- Binding pocket analysis
- +- Key interactions
- |
- Phase 3.5: Docking Validation (NvidiaNIM_diffdock/boltz2)
- |- Dock reference inhibitor
- +- Validate binding pocket geometry
- |
- Phase 4: Compound Expansion
- |- 4.1-4.3 Similarity/substructure search
- +- 4.4 De novo generation (NvidiaNIM_genmol/molmim)
- |
- Phase 5: ADMET Filtering
- |- Physicochemical properties (Lipinski, QED)
- |- Bioavailability, toxicity, CYP interactions
- +- Structural alerts (PAINS)
- |
- Phase 6: Candidate Docking & Prioritization
- |- Dock all candidates (NvidiaNIM_diffdock/boltz2)
- |- Score by docking (40%) + ADMET (30%) + similarity (20%) + novelty (10%)
- |- Assess synthesis feasibility
- +- Generate final ranked list (top 20)
- |
- Phase 6.5: Literature Evidence
- |- PubMed (peer-reviewed SAR studies)
- |- EuropePMC preprints (source='PPR')
- +- OpenAlex citation analysis
- |
- Phase 7: Report Synthesis & Delivery
- Phase 0: Tool Verification
- CRITICAL
-
- Verify tool parameters before calling unfamiliar tools.
- tool_info
- =
- tu
- .
- tools
- .
- get_tool_info
- (
- tool_name
- =
- "ChEMBL_get_target_activities"
- )
- Known Parameter Corrections
- Tool
- WRONG Parameter
- CORRECT Parameter
- OpenTargets_*
- ensembl_id
- ensemblId
- (camelCase)
- ChEMBL_get_target_activities
- chembl_target_id
- target_chembl_id
- ChEMBL_search_similar_molecules
- smiles
- molecule
- (accepts SMILES, ChEMBL ID, or name)
- alphafold_get_prediction
- uniprot
- accession
- ADMETAI_*
- smiles="..."
- smiles=["..."]
- (must be list)
- NvidiaNIM_alphafold2
- seq
- sequence
- NvidiaNIM_genmol
- smiles="C..."
- smiles="C...[*{1-3}]..."
- (must have mask)
- NvidiaNIM_boltz2
- sequence="..."
- polymers=[{"molecule_type": "protein", "sequence": "..."}]
- Phase 1: Target Validation
- 1.1 Identifier Resolution
- Resolve all IDs upfront and store for downstream queries:
- 1. UniProt_search(query=target_name, organism="human") -> UniProt accession
- 2. MyGene_query_genes(q=gene_symbol, species="human") -> Ensembl gene ID
- 3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens") -> ChEMBL target ID
- 4. GtoPdb_get_targets(query=target_name) -> GtoPdb ID (if GPCR/channel/enzyme)
- 1.2 Druggability Assessment
- Use multi-source triangulation:
- OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
- - tractability bucket
- DGIdb_get_gene_druggability(genes=[gene_symbol])
- - druggability categories
- OpenTargets_get_target_classes_by_ensemblID(ensemblId)
- - target class
- For GPCRs:
- GPCRdb_get_protein
- +
- GPCRdb_get_ligands
- +
- GPCRdb_get_structures
- For antibody landscape:
- TheraSAbDab_search_by_target(target=target_name)
- Decision Point
- If druggability < 2 stars, warn user about challenges.
1.3 Binding Site Analysis
ChEMBL_search_binding_sites(target_chembl_id)
get_binding_affinity_by_pdb_id(pdb_id)
for co-crystallized ligands
InterPro_get_protein_domains(accession)
for domain architecture
1.4 Structure Prediction (NVIDIA NIM)
Requires
NVIDIA_API_KEY
. Two options:
AlphaFold2
:
NvidiaNIM_alphafold2(sequence, algorithm="mmseqs2")
- high accuracy, 5-15 min
ESMFold
:
NvidiaNIM_esmfold(sequence)
- fast (~30s), max 1024 AA
Always report pLDDT confidence scores (>=90 very high, 70-90 confident, <70 caution).
Phase 2: Known Ligand Mining
Tools (in order of priority)
Source
Tool
Strengths
ChEMBL
ChEMBL_get_target_activities
Curated, SAR-ready
BindingDB
BindingDB_get_ligands_by_uniprot
Direct Ki/Kd, literature links
GtoPdb
GtoPdb_get_target_interactions
Pharmacology focus (GPCRs, channels)
PubChem
PubChem_search_assays_by_target_gene
HTS screens, novel scaffolds
Open Targets
OpenTargets_get_chemical_probes_by_target_ensemblID
Validated probes
Key Steps
Get all bioactivities: filter to IC50/Ki/Kd < 10 uM
Get molecule details for top actives:
ChEMBL_get_molecule
Identify chemical probes and approved drugs
Analyze SAR: common scaffolds, key modifications
Check off-target selectivity:
BindingDB_get_targets_by_compound
Phase 3: Structure Analysis
Tools
PDB_search_similar_structures(query=uniprot, type="sequence")
- find PDB entries
get_protein_metadata_by_pdb_id(pdb_id)
- resolution, method
get_binding_affinity_by_pdb_id(pdb_id)
- co-crystal ligand affinities
get_ligand_smiles_by_chem_comp_id(chem_comp_id)
- ligand SMILES from PDB
emdb_search(query)
- cryo-EM structures (prefer for GPCRs, ion channels)
alphafold_get_prediction(accession)
- AlphaFold DB fallback
Phase 3.5: Docking Validation (NVIDIA NIM)
Situation
Tool
Input
Have PDB + SDF
NvidiaNIM_diffdock
protein=PDB, ligand=SDF, num_poses=10
Have sequence + SMILES
NvidiaNIM_boltz2
polymers=[...], ligands=[...]
Dock a known reference inhibitor first to validate the binding pocket.
Phase 4: Compound Expansion
4.1-4.3 Search-Based Expansion
Use 3-5 diverse actives as seeds, similarity threshold 70-85%:
ChEMBL_search_similar_molecules(molecule=SMILES, similarity=70)
PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
ChEMBL_search_substructure(smiles=core_scaffold)
STITCH_get_chemical_protein_interactions(identifier=gene, species=9606)
4.4 De Novo Generation (NVIDIA NIM)
GenMol
- scaffold hopping with masked regions:
NvidiaNIM_genmol(smiles="...core...[{3-8}]...tail...[]...", num_molecules=100, temperature=2.0, scoring="QED")
Mask syntax:
[*{min-max}]
specifies atom count range.
MolMIM
- controlled analog generation:
NvidiaNIM_molmim(smi=reference_smiles, num_molecules=50, algorithm="CMA-ES")
Phase 5: ADMET Filtering
Apply filters sequentially (all take
smiles=[list]
):
Step
Tool
Filter Criteria
Physicochemical
ADMETAI_predict_physicochemical_properties
Lipinski <= 1, QED > 0.3, MW 200-600
Bioavailability
ADMETAI_predict_bioavailability
Oral bioavailability > 0.3
Toxicity
ADMETAI_predict_toxicity
AMES < 0.5, hERG < 0.5, DILI < 0.5
CYP
ADMETAI_predict_CYP_interactions
Flag CYP3A4 inhibitors
Alerts
ChEMBL_search_compound_structural_alerts
No PAINS
Include a filter funnel table in the report showing pass/fail counts at each stage.
Phase 6: Candidate Docking & Prioritization
Scoring Framework
Dimension
Weight
Source
Docking confidence
40%
NvidiaNIM_diffdock/boltz2
ADMET score
30%
ADMETAI predictions
Similarity to known active
20%
Tanimoto coefficient
Novelty
10%
Not in ChEMBL + novel scaffold bonus
Evidence Tiers
Tier
Criteria
T0 (4 stars)
Docking score > reference inhibitor
T1 (3 stars)
Experimental IC50/Ki < 100 nM
T2 (2 stars)
Docking within 5% of reference OR IC50 100-1000 nM
T3 (1 star)
80% similarity to T1 compound T4 (0 stars) 70-80% similarity, scaffold match T5 (empty) Generated molecule, ADMET-passed, no docking Deliver top 20 candidates with: Rank, ID, SMILES, Docking score, ADMET score, overall score, source, evidence tier. Phase 6.5: Literature Evidence PubMed_search_articles(query="[TARGET] inhibitor SAR") - peer-reviewed EuropePMC_search_articles(query, source="PPR") - preprints (not peer-reviewed) openalex_search_works(query) - citation analysis Fallback Chains Target ID: ChEMBL_search_targets -> GtoPdb_get_targets -> "Not in databases" Druggability: OpenTargets tractability -> DGIdb druggability -> target class proxy Bioactivity: ChEMBL -> BindingDB -> GtoPdb -> PubChem BioAssay -> "No data" Structure: PDB -> EMDB (membrane) -> NvidiaNIM_alphafold2 -> NvidiaNIM_esmfold -> AlphaFold DB -> "None" Similarity: ChEMBL similar -> PubChem similar -> "Search failed" Docking: NvidiaNIM_diffdock -> NvidiaNIM_boltz2 -> similarity-based scoring Generation: NvidiaNIM_genmol -> NvidiaNIM_molmim -> similarity search only Literature: PubMed -> EuropePMC (preprints) -> OpenAlex GPCR data: GPCRdb_get_protein -> GtoPdb_get_targets NVIDIA NIM Runtime Reference Tool Runtime Notes NvidiaNIM_alphafold2 5-15 min Async, max ~2000 AA NvidiaNIM_esmfold ~30 sec Max 1024 AA NvidiaNIM_diffdock ~1-2 min Per ligand NvidiaNIM_boltz2 ~2-5 min End-to-end complex NvidiaNIM_genmol ~1-3 min Depends on num_molecules NvidiaNIM_molmim ~1-2 min Close analog generation Always check: import os; nvidia_available = bool(os.environ.get("NVIDIA_API_KEY")) Rate Limiting Database Limit Strategy ChEMBL ~10 req/sec Batch queries PubChem ~5 req/sec Batch endpoints ADMET-AI No strict limit Batch SMILES in lists NVIDIA NIM API key quota Cache results For large expansions (>500 compounds): batch in chunks of 100, prioritize top candidates for docking. Reference Files For detailed protocols, examples, and templates, see: File Contents WORKFLOW_DETAILS.md Phase-by-phase procedures, code patterns, screening protocols, fallback chain details TOOLS_REFERENCE.md Complete tool reference with parameters, usage examples, and fallback chains REPORT_TEMPLATE.md Report file template, evidence grading system, section formatting examples EXAMPLES.md End-to-end workflow examples (EGFR, novel target, lead optimization, NVIDIA NIM) CHECKLIST.md Pre-delivery verification checklist for report quality
tooluniverse-binder-discovery
安装
npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-binder-discovery