- Knowledge Graph Builder
- Build structured knowledge graphs for enhanced AI system performance through relational knowledge.
- Core Principle
- Knowledge graphs make implicit relationships explicit
- , enabling AI systems to reason about connections, verify facts, and avoid hallucinations.
- When to Use Knowledge Graphs
- Use Knowledge Graphs When:
- ✅ Complex entity relationships are central to your domain
- ✅ Need to verify AI-generated facts against structured knowledge
- ✅ Semantic search and relationship traversal required
- ✅ Data has rich interconnections (people, organizations, products)
- ✅ Need to answer "how are X and Y related?" queries
- ✅ Building recommendation systems based on relationships
- ✅ Fraud detection or pattern recognition across connected data
- Don't Use Knowledge Graphs When:
- ❌ Simple tabular data (use relational DB)
- ❌ Purely document-based search (use RAG with vector DB)
- ❌ No significant relationships between entities
- ❌ Team lacks graph modeling expertise
- ❌ Read-heavy workload with no traversal (use traditional DB)
- 6-Phase Knowledge Graph Implementation
- Phase 1: Ontology Design
- Goal
- Define entities, relationships, and properties for your domain Entity Types (Nodes): Person, Organization, Location, Product, Concept, Event, Document Relationship Types (Edges): Hierarchical: IS_A, PART_OF, REPORTS_TO Associative: WORKS_FOR, LOCATED_IN, AUTHORED_BY, RELATED_TO Temporal: CREATED_ON, OCCURRED_BEFORE, OCCURRED_AFTER Properties (Attributes): Node properties: id, name, type, created_at, metadata Edge properties: type, confidence, source, timestamp Example Ontology :
RDF/Turtle format
@prefix : < http://example.org/ontology#
. : Person a owl : Class ; rdfs : label "Person" . : Organization a owl : Class ; rdfs : label "Organization" . : worksFor a owl : ObjectProperty ; rdfs : domain : Person ; rdfs : range : Organization ; rdfs : label "works for" . Validation : Entities cover all domain concepts Relationships capture key connections Ontology reviewed with domain experts Classification hierarchy defined (is-a relationships) Phase 2: Graph Database Selection Decision Matrix : Neo4j (Recommended for most): Pros: Mature, Cypher query language, graph algorithms, excellent visualization Cons: Licensing costs for enterprise, scaling complexity Use when: Complex queries, graph algorithms, team can learn Cypher Amazon Neptune : Pros: Managed service, supports Gremlin and SPARQL, AWS integration Cons: Vendor lock-in, more expensive than self-hosted Use when: AWS infrastructure, need managed service, compliance requirements ArangoDB : Pros: Multi-model (graph + document + key-value), JavaScript queries Cons: Smaller community, fewer graph-specific features Use when: Need document DB + graph in one system TigerGraph : Pros: Best performance for deep traversals, parallel processing Cons: Complex setup, higher learning curve Use when: Massive graphs (billions of edges), real-time analytics Technology Stack : graph_database : 'Neo4j Community'
or Enterprise for production
vector_integration : 'Pinecone'
For hybrid search
embeddings : 'text-embedding-3-large'
OpenAI
etl : 'Apache Airflow'
For data pipelines
- Neo4j Schema Setup
- :
- // Create constraints for uniqueness
- CREATE
- CONSTRAINT
- person_id IF
- NOT
- EXISTS
- FOR
- (
- p
- :
- Person
- )
- REQUIRE
- p
- .
- id
- IS
- UNIQUE
- ;
- CREATE
- CONSTRAINT
- org_name IF
- NOT
- EXISTS
- FOR
- (
- o
- :
- Organization
- )
- REQUIRE
- o
- .
- name
- IS
- UNIQUE
- ;
- // Create indexes for performance
- CREATE
- INDEX
- entity_search IF
- NOT
- EXISTS
- FOR
- (
- e
- :
- Entity
- )
- ON
- (
- e
- .
- name
- ,
- e
- .
- type
- )
- ;
- CREATE
- INDEX
- relationship_type IF
- NOT
- EXISTS
- FOR
- (
- )
- -
- [
- r
- :
- RELATED_TO
- ]
- -
- (
- )
- ON
- (
- r
- .
- type
- ,
- r
- .
- confidence
- )
- ;
- Phase 3: Entity Extraction & Relationship Building
- Goal
- Extract entities and relationships from data sources Data Sources : Structured: Databases, APIs, CSV files Unstructured: Documents, web content, text files Semi-structured: JSON, XML, knowledge bases Entity Extraction Pipeline : class EntityExtractionPipeline : def init ( self ) : self . ner_model = load_ner_model ( )
spaCy, Hugging Face
self . entity_linker = EntityLinker ( ) self . deduplicator = EntityDeduplicator ( ) def process_text ( self , text : str ) -
List [ Entity ] :
1. Extract named entities
entities
self . ner_model . extract ( text )
2. Link to existing entities (entity resolution)
linked_entities
self . entity_linker . link ( entities )
3. Deduplicate and resolve conflicts
resolved_entities
self . deduplicator . resolve ( linked_entities ) return resolved_entities Relationship Extraction : class RelationshipExtractor : def extract_relationships ( self , entities : List [ Entity ] , text : str ) -
List [ Relationship ] : relationships = [ ]
Use dependency parsing or LLM for extraction
doc
self . nlp ( text ) for sent in doc . sents : rels = self . extract_from_sentence ( sent , entities ) relationships . extend ( rels )
Validate against ontology
valid_relationships
- self
- .
- validate_relationships
- (
- relationships
- )
- return
- valid_relationships
- LLM-Based Extraction
- (for complex relationships):
- def
- extract_with_llm
- (
- text
- :
- str
- )
- -
- >
- List
- [
- Relationship
- ]
- :
- prompt
- =
- f"""
- Extract entities and relationships from this text:
- {
- text
- }
- Format: (Entity1, Relationship, Entity2, Confidence)
- Only extract factual relationships.
- """
- response
- =
- llm
- .
- generate
- (
- prompt
- )
- relationships
- =
- parse_llm_response
- (
- response
- )
- return
- relationships
- Validation
- :
- Entity extraction accuracy >85%
- Entity deduplication working
- Relationships validated against ontology
- Confidence scores assigned
- Phase 4: Hybrid Knowledge-Vector Architecture
- Goal
- Combine structured graph with semantic vector search Architecture : class HybridKnowledgeSystem : def init ( self ) : self . graph_db = Neo4jConnection ( ) self . vector_db = PineconeClient ( ) self . embedding_model = OpenAIEmbeddings ( ) def store_entity ( self , entity : Entity ) :
Store structured data in graph
self . graph_db . create_node ( entity )
Store embeddings in vector database
embedding
self . embedding_model . embed ( entity . description ) self . vector_db . upsert ( id = entity . id , values = embedding , metadata = entity . metadata ) def hybrid_search ( self , query : str , top_k : int = 10 ) -
SearchResults :
1. Vector similarity search
query_embedding
self . embedding_model . embed ( query ) vector_results = self . vector_db . query ( vector = query_embedding , top_k = 100 )
2. Graph traversal from vector results
entity_ids
[ r . id for r in vector_results . matches ] graph_results = self . graph_db . get_subgraph ( entity_ids , max_hops = 2 )
3. Merge and rank results
merged
- self
- .
- merge_results
- (
- vector_results
- ,
- graph_results
- )
- return
- merged
- [
- :
- top_k
- ]
- Benefits of Hybrid Approach
- :
- Vector search: Semantic similarity, flexible queries
- Graph traversal: Relationship-based reasoning, context expansion
- Combined: Best of both worlds
- Phase 5: Query Patterns & API Design
- Common Query Patterns
- :
- 1. Find Entity
- :
- MATCH
- (
- e
- :
- Entity
- {
- id
- :
- $entity_id
- }
- )
- RETURN
- e
- 2. Find Relationships
- :
- MATCH
- (
- source
- :
- Entity
- {
- id
- :
- $entity_id
- }
- )
- -
- [
- r
- ]
- -
- (
- target
- )
- RETURN
- source
- ,
- r
- ,
- target
- LIMIT
- 20
- 3. Path Between Entities
- :
- MATCH
- path
- =
- shortestPath
- (
- (
- source
- :
- Person
- {
- id
- :
- $source_id
- }
- )
- -
- [
- *
- ..
- 5
- ]
- -
- (
- target
- :
- Person
- {
- id
- :
- $target_id
- }
- )
- )
- RETURN
- path
- 4. Multi-Hop Traversal
- :
- MATCH
- (
- p
- :
- Person
- {
- name
- :
- $name
- }
- )
- -
- [
- :
- WORKS_FOR
- ]
- ->
- (
- o
- :
- Organization
- )
- -
- [
- :
- LOCATED_IN
- ]
- ->
- (
- l
- :
- Location
- )
- RETURN
- p
- .
- name
- ,
- o
- .
- name
- ,
- l
- .
- city
- 5. Recommendation Query
- :
- // Find people similar to this person based on shared organizations
- MATCH
- (
- p1
- :
- Person
- {
- id
- :
- $person_id
- }
- )
- -
- [
- :
- WORKS_FOR
- ]
- ->
- (
- o
- :
- Organization
- )
- <-
- [
- :
- WORKS_FOR
- ]
- -
- (
- p2
- :
- Person
- )
- WHERE
- p1
- <>
- p2
- RETURN
- p2
- ,
- COUNT
- (
- o
- )
- AS
- shared_orgs
- ORDER
- BY
- shared_orgs
- DESC
- LIMIT
- 10
- Knowledge Graph API
- :
- class
- KnowledgeGraphAPI
- :
- def
- init
- (
- self
- ,
- graph_db
- )
- :
- self
- .
- graph
- =
- graph_db
- def
- find_entity
- (
- self
- ,
- entity_name
- :
- str
- )
- -
- >
- Entity
- :
- """Find entity by name with fuzzy matching"""
- query
- =
- """
- MATCH (e:Entity)
- WHERE e.name CONTAINS $name
- RETURN e
- ORDER BY apoc.text.levenshtein(e.name, $name)
- LIMIT 1
- """
- return
- self
- .
- graph
- .
- run
- (
- query
- ,
- name
- =
- entity_name
- )
- .
- single
- (
- )
- def
- find_relationships
- (
- self
- ,
- entity_id
- :
- str
- ,
- relationship_type
- :
- str
- =
- None
- ,
- max_hops
- :
- int
- =
- 2
- )
- -
- >
- List
- [
- Relationship
- ]
- :
- """Find relationships within specified hops"""
- query
- =
- f"""
- MATCH (source:Entity {{id: $entity_id}})
- MATCH path = (source)-[r*1..
- {
- max_hops
- }
- ]-(target)
- RETURN path, relationships(path) AS rels
- LIMIT 100
- """
- return
- self
- .
- graph
- .
- run
- (
- query
- ,
- entity_id
- =
- entity_id
- )
- .
- data
- (
- )
- def
- get_subgraph
- (
- self
- ,
- entity_ids
- :
- List
- [
- str
- ]
- ,
- max_hops
- :
- int
- =
- 2
- )
- -
- >
- Subgraph
- :
- """Get connected subgraph for multiple entities"""
- query
- =
- f"""
- MATCH (e:Entity)
- WHERE e.id IN $entity_ids
- CALL apoc.path.subgraphAll(e, {{maxLevel:
- {
- max_hops
- }
- }})
- YIELD nodes, relationships
- RETURN nodes, relationships
- """
- return
- self
- .
- graph
- .
- run
- (
- query
- ,
- entity_ids
- =
- entity_ids
- )
- .
- data
- (
- )
- Phase 6: AI Integration & Hallucination Prevention
- Goal
- Use knowledge graph to ground LLM responses and detect hallucinations
Knowledge Graph RAG
:
class
KnowledgeGraphRAG
:
def
init
(
self
,
kg_api
,
llm_client
)
:
self
.
kg
=
kg_api
self
.
llm
=
llm_client
def
retrieve_context
(
self
,
query
:
str
)
-
str :
Extract entities from query
entities
self . extract_entities_from_query ( query )
Retrieve relevant subgraph
subgraph
self . kg . get_subgraph ( [ e . id for e in entities ] , max_hops = 2 )
Format subgraph for LLM
context
self . format_subgraph_for_llm ( subgraph ) return context def generate_with_grounding ( self , query : str ) -
GroundedResponse : context = self . retrieve_context ( query ) prompt = f""" Context from knowledge graph: { context } User query: { query } Answer based only on the provided context. Include source entities. """ response = self . llm . generate ( prompt ) return GroundedResponse ( response = response , sources = self . extract_sources ( context ) , confidence = self . calculate_confidence ( response , context ) ) Hallucination Detection : class HallucinationDetector : def init ( self , knowledge_graph ) : self . kg = knowledge_graph def verify_claim ( self , claim : str ) -
VerificationResult :
Parse claim into (subject, predicate, object)
parsed_claim
self . parse_claim ( claim )
Query knowledge graph for evidence
evidence
self . kg . find_evidence ( parsed_claim . subject , parsed_claim . predicate , parsed_claim . object ) if evidence : return VerificationResult ( is_supported = True , evidence = evidence , confidence = evidence . confidence )
Check for contradictory evidence
contradiction
self . kg . find_contradiction ( parsed_claim ) return VerificationResult ( is_supported = False , is_contradicted = bool ( contradiction ) , contradiction = contradiction ) Key Principles 1. Start with Ontology Define your schema before ingesting data. Changing ontology later is expensive. 2. Entity Resolution is Critical Deduplicate entities aggressively. "Apple Inc", "Apple", "Apple Computer" → same entity. 3. Confidence Scores on Everything Every relationship should have a confidence score (0.0-1.0) and source. 4. Incremental Building Don't try to model entire domain at once. Start with core entities and expand. 5. Hybrid Architecture Wins Combine graph traversal (structured) with vector search (semantic) for best results. Common Use Cases 1. Question Answering : Extract entities from question Traverse graph to find answer Return path as explanation 2. Recommendation : Find similar entities via shared relationships Rank by relationship strength Return top-K recommendations 3. Fraud Detection : Model transactions as graph Find suspicious patterns (cycles, anomalies) Flag for review 4. Knowledge Discovery : Identify implicit relationships Suggest missing connections Validate with domain experts 5. Semantic Search : Hybrid vector + graph search Expand context via relationships Return rich connected results Technology Recommendations For MVPs (<10K entities) : Neo4j Community Edition (free) SQLite for metadata OpenAI embeddings FastAPI for API layer For Production (10K-1M entities) : Neo4j Enterprise or ArangoDB Pinecone for vector search Airflow for ETL GraphQL API For Enterprise (1M+ entities) : Neo4j Enterprise or TigerGraph Distributed vector DB (Pinecone, Weaviate) Kafka for streaming Kubernetes deployment Validation Checklist Ontology designed and validated with domain experts Graph database selected and set up Entity extraction pipeline tested (>85% accuracy) Relationship extraction validated Hybrid search (graph + vector) implemented Query API created and documented AI integration tested (RAG or hallucination detection) Performance benchmarks met (query <100ms for common patterns) Data quality monitoring in place Backup and recovery tested Related Resources