Vector Search Designer The Vector Search Designer skill helps you architect and implement vector similarity search systems that power semantic search, recommendation engines, and AI applications. It guides you through selecting the right vector database, designing index structures, optimizing query performance, and scaling to millions or billions of vectors. Vector search has become foundational to modern AI systems, from RAG pipelines to product recommendations. This skill covers the full stack: understanding approximate nearest neighbor (ANN) algorithms, choosing between database options, tuning recall vs latency tradeoffs, and implementing production-ready search infrastructure. Whether you are building on Pinecone, Weaviate, Qdrant, pgvector, or implementing your own solution, this skill ensures your vector search system meets your performance and accuracy requirements. Core Workflows Workflow 1: Select Vector Database Gather requirements: Scale: How many vectors? Query patterns: Single vs batch, filters needed? Latency requirements: Real-time vs batch? Update frequency: Static vs dynamic? Infrastructure: Managed vs self-hosted? Compare options: Database Scale Managed Features Best For Pinecone Billion+ Yes Hybrid, namespaces Production, zero-ops Weaviate 100M+ Both GraphQL, modules Multi-modal, complex queries Qdrant 100M+ Both Rust perf, filtering High performance, self-hosted Milvus Billion+ Both GPU support, clustering Large scale, ML teams pgvector 10M No PostgreSQL native Existing Postgres, small scale Chroma 1M No Simple API Prototyping, embedded Evaluate with your data Document decision rationale Workflow 2: Design Index Architecture Choose ANN algorithm: HNSW: Best recall/speed tradeoff, memory intensive IVF: Good for very large datasets, requires training PQ: Compression for scale, some accuracy loss Flat: Exact search, small datasets only Configure index parameters:

HNSW example configuration

hnsw_config

{ "M" : 16 ,

Connections per node (higher = better recall, more memory)

"efConstruction" : 200 ,

Build-time search depth

"efSearch" : 100 ,

Query-time search depth

}

IVF example configuration

ivf_config

{ "nlist" : 1024 ,

Number of clusters

"nprobe" : 32 ,

Clusters to search at query time

}

Plan

sharding strategy for scale

Design

metadata schema for filtering

Workflow 3: Optimize Search Performance

Benchmark

baseline performance:

Queries per second (QPS)

Latency (p50, p95, p99)

Recall@k accuracy

Identify

bottlenecks:

Index loading time

Search latency

Filter computation

Network overhead

Apply

optimizations:

Tune index parameters (ef, nprobe)

Implement query caching

Optimize filter expressions

Consider quantization for memory

Validate

recall vs latency tradeoff

Document

optimal configuration

Quick Reference

Action

Command/Trigger

Choose database

"Which vector database for [use case]"

Design index

"Design vector index for [scale]"

Optimize search

"Speed up vector search"

Add filtering

"Add metadata filters to vector search"

Scale vectors

"Scale to [N] million vectors"

Benchmark search

"Benchmark vector search performance"

Best Practices

Right-Size Your Database

Don't over-engineer for scale you don't need

<1M vectors: pgvector or Chroma often sufficient

1-100M vectors: Qdrant, Weaviate, or Pinecone

100M+ vectors: Milvus or Pinecone with careful design

Understand Recall vs Speed Tradeoff

ANN is approximate by design

Higher recall = slower queries

Tune based on your accuracy requirements

Measure actual recall, don't assume

Use Hybrid Search When Needed

Combine vector and keyword search

Vector: semantic similarity

Keyword (BM25): exact terms, names, codes

Typically improves results for mixed queries

Design Metadata for Filtering

Plan your filter strategy upfront

Indexed fields for frequent filters

Avoid filtering on high-cardinality fields

Pre-filter vs post-filter tradeoffs

Batch Operations When Possible

Reduce network overhead
Batch upserts for ingestion
Batch queries when latency allows
Use async operations
Monitor and Alert: Production search needs observability Query latency percentiles Index size and memory usage Recall degradation over time Advanced Techniques Multi-Vector Search Handle documents with multiple representations: class MultiVectorIndex : def init ( self ) : self . title_index = VectorIndex ( dim = 768 ) self . content_index = VectorIndex ( dim = 768 ) self . summary_index = VectorIndex ( dim = 768 ) def search ( self , query_embedding , weights = None ) : weights = weights or { "title" : 0.3 , "content" : 0.5 , "summary" : 0.2 } results = { } for field , weight in weights . items ( ) : index = getattr ( self , f" { field } _index" ) field_results = index . search ( query_embedding , k = 20 ) for doc_id , score in field_results : results [ doc_id ] = results . get ( doc_id , 0 ) + score * weight return sorted ( results . items ( ) , key = lambda x : x [ 1 ] , reverse = True ) [ : 10 ] Filtered Vector Search Strategies Optimize search with filters: def filtered_search ( query_embedding , filters , k = 10 ) :

Strategy 1: Pre-filter (for selective filters)

if estimate_selectivity ( filters ) < 0.1 : candidate_ids = apply_filters ( filters ) return vector_search_subset ( query_embedding , candidate_ids , k )

Strategy 2: Post-filter (for non-selective filters)

elif estimate_selectivity ( filters )

0.5 : results = vector_search ( query_embedding , k * 3 ) filtered = [ r for r in results if matches_filters ( r , filters ) ] return filtered [ : k ]

Strategy 3: Hybrid (general case)

else : return vector_search_with_filters ( query_embedding , filters , k ) Quantization for Scale Reduce memory with acceptable accuracy loss:

Product Quantization configuration

pq_config

{ "nbits" : 8 ,

Bits per sub-quantizer

"m" : 16 ,

Number of sub-quantizers

768-dim * 4 bytes = 3KB/vector -> 16 * 1 byte = 16 bytes/vector

}

Binary quantization (extreme compression)

binary_config

{ "threshold" : 0 ,

Values > 0 -> 1, else -> 0

768-dim * 4 bytes = 3KB/vector -> 768 bits = 96 bytes/vector

} Incremental Index Updates Handle dynamic data efficiently: class DynamicVectorIndex : def init ( self , rebuild_threshold = 10000 ) : self . main_index = build_optimized_index ( ) self . delta_index = [ ]

Recent additions

self . rebuild_threshold = rebuild_threshold def add ( self , vector , metadata ) : self . delta_index . append ( ( vector , metadata ) ) if len ( self . delta_index )

= self . rebuild_threshold : self . rebuild ( ) def search ( self , query , k ) : main_results = self . main_index . search ( query , k ) delta_results = brute_force_search ( self . delta_index , query , k ) return merge_results ( main_results , delta_results , k ) def rebuild ( self ) : all_data = self . main_index . get_all ( ) + self . delta_index self . main_index = build_optimized_index ( all_data ) self . delta_index = [ ] Common Pitfalls to Avoid Using exact (flat) search at scale instead of ANN Not measuring actual recall, assuming ANN is "good enough" Over-indexing metadata fields, slowing down updates Ignoring the memory requirements of HNSW indexes Not planning for index rebuilds and maintenance windows Assuming vector databases handle all use cases (sometimes Elasticsearch is better) Forgetting to normalize vectors for cosine similarity Mixing embeddings from different models in the same index

vector search designer

安装