TEGR: Typed-Edge Graph Retrieval for Multi-Hop RAG
A chunk-level approach to GraphRAG with a contradiction-detection trick nobody else seems to be doing.
Most RAG systems retrieve text chunks by how similar they are to the question. That works for simple lookups. It falls apart on multi-hop queries and when sources contradict each other.
I've been building Knowledge Engine — a multi-tenant knowledge base platform — and hit the same wall. The fix I landed on combines three techniques. Two are extensions of published work. One, as far as I can tell, is new.
I'm calling the stack TEGR — Typed-Edge Graph Retrieval.
Where existing approaches sit
- Flat vector RAG: chunks, no relationships. Embedding similarity only. Fails on anything multi-hop.
- Microsoft's GraphRAG: extracts a knowledge graph of entities from documents, runs community summarization. Entity-level, not chunk-level. Strong on global queries, weak on specific passage retrieval.
- SAGE: chunk-level graph with similarity-based edges (document-document, column similarity, entity overlap). No semantic relationship types.
- Anthropic's Contextual Retrieval: enriches chunks with context sentences before embedding. Complementary to TEGR, not competing.
None of them combine chunks as graph nodes with typed semantic edges (contradicts, elaborates, supersedes, example-of, defines, sequence, synthesizes) at retrieval time.
What TEGR does
1. Chunks as nodes. Every page is split into 300-500 token chunks at ingest. Each chunk is a node.
2. Typed edges between chunks. Page-level edges are classified by an LLM at ingest (A contradicts B, A elaborates B, etc). Those types are inherited down to the chunk level by finding the top-K chunk pairs via embedding similarity + keyword overlap. No LLM calls per chunk pair — pure vector math.
3. Contradiction-inversion scoring.This is the piece I haven't found elsewhere. For contradicts edges:
score = keyword_overlap × (1.5 - cosine_similarity)
The insight: chunks that agree have high similarity. Chunks that contradict share topic vocabulary (high keyword overlap) but diverge on claims (lower similarity). Inverting the similarity term surfaces the second case. No specialized training needed.
4. Similarity-weighted typed-edge graph walk. BFS from seed chunks, propagating scores through:
new_score = score × decay × type_weight × edge_similarity
Standard GraphRAG uses type weights. TEGR also multiplies by the actual strength of each specific chunk-pair connection.
5. LLM-as-retriever on a vectorized index. Instead of flat cosine search, an LLM reads the top-20 vector-similar index entries and picks 5-8 pages. Combines reasoning with scale.
Benchmark
RAGAS on 20 hand-curated questions across factual / multi-hop / synthesis types. Judge: Claude Sonnet 4 (same model as generator — worth disclosing).
| Metric | TEGR | Industry "Good" |
|---|---|---|
| Faithfulness | 0.895 | >0.75 ✅ |
| Context Recall | 0.925 | >0.80 ✅ |
| Context Precision | 0.783 | ~0.70 ✅ |
| Answer Relevancy | 0.586 | ⚠️ See caveat |
| End-to-end latency | 9.3s | — |
Honest caveats
Small N.20 questions is a real dataset; it's not CRAG or RAGBench. Running against those is the next step.
Same-model judging. Sonnet 4 as generator and judge likely inflates numbers ~0.05. A cross-model judge pass is on the roadmap.
Answer relevancy below threshold. RAGAS measures this by reverse-generating questions from the answer and scoring embedding similarity. TEGR produces wiki-style answers with typed-relationship context that a strict reverse-generator scores lower than a terse chatbot-style answer would. We tested whether [[wikilink]] syntax caused it — stripping citations moved the score by 0.003 (confirmed not the cause). I suspect this is a real characteristic of wiki-style answers, not a retrieval failure.
Prior-art disclosure.I haven't done an exhaustive arXiv search beyond SAGE, GraphRAG, GNN-Ret, SparseCL, and Anthropic's Contextual Retrieval. If similar combinations exist, I'd want to know.
What's next
- Run against standardized benchmarks: CRAG, MultiHop-RAG, LegalBench-RAG
- Cross-model judge validation (GPT-4o)
- Per-customer multi-tenant deployment (architecture is ready, GTM isn't)
Talk to me
Happy to trade notes with anyone building in applied AI or retrieval. Reach out via LinkedIn.