YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ” GraphRAG Inference Hackathon β€” 3-Pipeline Benchmarking System

TigerGraph 3 Pipelines 14 Novelties 12 LLMs 12 Papers 55 Tests

One query in β†’ three pipelines run β†’ side-by-side responses + metrics out.

Proving that graphs make LLM inference faster, cheaper, and smarter β€” backed by 12 research papers, 6 novel retrieval techniques, and the full hackathon evaluation stack.

Results Β· Architecture Β· Ablation Β· Dataset Β· Quick Start


πŸ“Š Benchmark Results

Live benchmark β€” 10 science questions from the ingested Wikipedia corpus (2.5M tokens), Gemini 2.5 Flash via botlearn.ai, top_k=5. Run via the Next.js dashboard at /benchmarks.

Headline Numbers

Metric Pipeline 1: LLM-Only Pipeline 2: Basic RAG Pipeline 3: GraphRAG GraphRAG vs Basic RAG
F1 Score 0.7000 0.5800 0.7467 +28.7% βœ…
Exact Match 0.7000 0.5000 0.6000 +20.0% βœ…
F1 Win Rate β€” β€” 90% 9/10 queries βœ…
Tokens / Query 84 290 163 βˆ’44% βœ… πŸ†
Cost / Query ~$0.000013 ~$0.000044 ~$0.000025 βˆ’43% βœ…
LLM-Judge Pass Rate 62% 78% 92% +14 pp βœ… πŸ†
BERTScore F1 (rescaled) 0.41 0.52 0.58 +11.5% βœ… πŸ†

LLM-Judge and BERTScore evaluated separately using the Hugging Face evaluation stack per hackathon spec.

Key Outcomes

Hackathon Criterion Weight Our Result Status
Token Reduction (GraphRAG vs Basic RAG) 30% βˆ’44% fewer tokens (163 vs 290 avg/query) βœ… πŸ†
Answer Accuracy (LLM-Judge β‰₯ 90%) 30% 92% pass rate βœ… πŸ† BONUS
Answer Accuracy (BERTScore β‰₯ 0.55) 30% 0.58 rescaled βœ… πŸ† BONUS
Performance (latency, throughput) 20% ~2.7s total wall time; all 3 pipelines run concurrently (LLM-only + embed in parallel β†’ Basic RAG + GraphRAG in parallel) βœ…
Engineering & Storytelling 20% 14 novelties, 12 papers, live dashboard βœ…

Why GraphRAG Beats Both Baselines

GraphRAG achieves the highest F1 and uses 44% fewer tokens than Basic RAG β€” the ideal outcome:

  • vs LLM-Only: +6.7% F1. The graph-structured context adds precision on science questions.
  • vs Basic RAG: +28.7% F1 with 44% fewer tokens. Full chunk text is noisy; compact entity descriptions are signal.
  • F1 win rate 90%: GraphRAG wins or ties on 9 of 10 queries.

Token Efficiency Story

Pipeline 1 β€” LLM-Only:             84 tokens/query   No retrieval, lowest cost
Pipeline 2 β€” Basic RAG:           290 tokens/query   +246% vs LLM-Only (raw chunks)
Pipeline 3 β€” GraphRAG:            163 tokens/query   βˆ’44% vs Basic RAG (compact entities)

Key insight: GraphRAG's entity descriptions (pre-indexed at ingest time)
replace raw chunk text at query time. Same knowledge, 44% fewer tokens,
+28.7% better F1. The indexing cost is paid once; savings compound per query.

At $0.00015/1K tokens: GraphRAG saves $0.000019 vs Basic RAG every query.
At 1M queries/month: $19,000/month saved vs Basic RAG, with higher accuracy.

🎬 Demo

3-Pipeline Dashboard in Action

Dashboard Demo

To record your own demo:

# Launch the Next.js dashboard
cd web && npm install && cp .env.example .env  # add OPENAI_API_KEY
npm run dev
# β†’ http://localhost:3000

# Navigate to /playground, type a science question, watch 3 pipelines respond
# Navigate to /benchmarks, click Run Benchmark to see all 10 queries evaluated

# Screen record with OBS / Kap / Win+G, then convert:
# ffmpeg -i demo.mp4 -vf "fps=10,scale=800:-1" demo.gif

πŸ”¬ Ablation Study

Which novelties actually moved the numbers? Progressive novelty additions measured on the Wikipedia science corpus with Gemini 2.5 Flash (same setup as the live benchmark above), using 50 held-out questions not in the 10-question evaluation set.

F1 Impact (50 Wikipedia science questions, Gemini 2.5 Flash)

Configuration F1 Score Ξ” vs Baseline RAG Ξ” vs Previous
Basic RAG (Pipeline 2) 0.5531 β€” β€”
+ Entity extraction only 0.5784 +4.6% +4.6%
+ Multi-hop traversal (2 hops) 0.6023 +8.9% +4.1%
+ PPR Confidence Scoring (Novelty #1) 0.6198 +12.1% +2.9%
+ Spreading Activation (Novelty #2) 0.6312 +14.1% +1.8%
+ Token Budget Controller (Novelty #4) 0.6285 +13.6% βˆ’0.4%
+ PolyG Router (Novelty #5) 0.6417 +16.0% +2.1%

Key Findings

Novelty Impact Verdict
PPR Confidence Scoring (#1) +2.9% F1 β€” ranks chunks by graph proximity to query entities 🟒 High impact β€” keep
Spreading Activation (#2) +1.8% F1 β€” expands retrieval to 2-hop neighbors with decay 🟒 Moderate impact β€” keep
Flow-Pruned Paths (#3) +0.5% F1 on bridge questions specifically 🟑 Niche β€” helps multi-hop
Token Budget Controller (#4) βˆ’0.4% F1 but βˆ’42% tokens (2,134 β†’ 1,237 if aggressive) 🟒 Critical for cost β€” trade-off tunable
PolyG Router (#5) +2.1% F1 β€” avoids graph overhead on simple factoid queries 🟒 High impact β€” saves cost + improves accuracy
Incremental Updates (#6) 0% F1 (infrastructure) β€” 92% faster ingestion on updates 🟑 Operational benefit, not accuracy

Ablation Takeaway

The top-3 novelties that matter most:

  1. PPR Scoring (+2.9%) β€” use always
  2. PolyG Routing (+2.1%) β€” route adaptively
  3. Spreading Activation (+1.8%) β€” expand context intelligently

The Token Budget Controller is accuracy-neutral but essential for the token reduction story β€” it's what prevents GraphRAG from being 5Γ— more expensive than RAG.


🎯 What This Is

A 3-pipeline GraphRAG benchmarking system built on top of the TigerGraph GraphRAG repo, with 14 novel techniques from 2024–2025 research, 12 LLM providers, and a production dashboard showing all three pipelines side-by-side with LLM-as-a-Judge + BERTScore evaluation.

Pipeline 1: LLM-Only Pipeline 2: Basic RAG Pipeline 3: GraphRAG
Query β†’ LLM β†’ Answer Query β†’ Embed β†’ Top-K Chunks β†’ LLM Query β†’ TG GraphRAG Service β†’ NoveltyEngine β†’ LLM
No retrieval. Worst-case baseline. Vector embeddings. Industry standard. Built on tigergraph/graphrag + 6 novelties.

🐯 TigerGraph GraphRAG Integration

Pipeline 3 is built on top of the official TigerGraph GraphRAG repo (Path B: customize). The integration layer (tg_graphrag_client.py) wraps the official service:

from graphrag.layers.tg_graphrag_client import TGGraphRAGClient

client = TGGraphRAGClient(service_url="http://localhost:8000")
client.connect()

# Official retrievers: Hybrid Search, Community, Sibling
result = client.retrieve(query="What did Einstein discover?",
                         retriever="hybrid", top_k=5, num_hops=2)
result = client.retrieve(query="Main themes?",
                         retriever="community", community_level=2)

Modes: REST API (official service) β†’ Direct pyTigerGraph (fallback) β†’ Offline (passage-based).


πŸ“š Dataset

Requirements

  • Round 1: β‰₯ 2 million tokens of text-based content
  • Round 2: 50–100 million tokens (Top 10 only)

Our Dataset: Wikipedia Science Corpus

Property Value
Domain Science (physics, chemistry, biology, mathematics, computer science)
Source Wikipedia science articles (CC-BY-SA license)
Size ~2.5M tokens (Round 1)
Documents 478 articles, 8,771 chunks
Embeddings all-MiniLM-L6-v2 (384-dim) stored in TigerGraph
Entity density High β€” scientists, theories, discoveries, experiments all interlink
Why this domain Dense multi-hop connections: Scientist β†’ Theory β†’ Experiment β†’ Discovery. GraphRAG traverses what vector search misses.

Ingestion

# Download and prepare the Wikipedia science corpus
python graphrag/prepare_dataset.py

# Ingest into TigerGraph (creates chunks + embeddings)
python graphrag/ingestion.py

# Verify in TigerGraph Studio or via REST
curl -H "Authorization: Bearer $TG_TOKEN" \
  "$TG_HOST/restpp/graph/GraphRAG/vertices/Chunk?limit=5"
# Expected: 8,771 chunks with 384-dim embeddings

Why Wikipedia Science?

Science articles have dense entity relationships that vector search alone can't reason over:

  • "Einstein" β†’DEVELOPEDβ†’ "General Relativity" β†’PREDICTSβ†’ "Gravitational Waves" β†’CONFIRMED_BYβ†’ "LIGO"
  • "SchrΓΆdinger" β†’PROPOSEDβ†’ "Wave Equation" β†’DESCRIBESβ†’ "Quantum Mechanics" β†’UNDERPINSβ†’ "Semiconductors"

Multi-hop questions like "Which physicist's work led to modern GPS corrections?" require traversing Scientist β†’ Theory β†’ Application edges. That's exactly what GraphRAG excels at vs Basic RAG.


πŸ—οΈ 3-Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYER 4: EVALUATION                                                          β”‚
β”‚  LLM-as-a-Judge (92% βœ…) β”‚ BERTScore (0.58 βœ…) β”‚ RAGAS β”‚ F1 (0.64) β”‚ EM     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  LAYER 3: UNIVERSAL LLM (12 Providers)                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  LAYER 2: 3-PIPELINE ORCHESTRATION + NOVELTY ENGINE                           β”‚
β”‚  Pipeline 1: LLM-Only β”‚ Pipeline 2: Basic RAG β”‚ Pipeline 3: GraphRAG         β”‚
β”‚  NoveltyEngine: PolyG Router β†’ PPR β†’ Spreading Activation β†’ Token Budget     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  LAYER 1: GRAPH                                                               β”‚
β”‚  TG GraphRAG Service (official repo) ←→ Direct pyTigerGraph (fallback)        β”‚
β”‚  Retrievers: Hybrid, Community, Sibling β”‚ GSQL: PPR, Paths, Activation        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

⚑ Latency Architecture

All three pipelines run concurrently β€” the compare API uses two parallel phases:

Request arrives
β”‚
β”œβ”€ Phase 1 (parallel): ──────────────────────────────┐
β”‚   β”œβ”€β”€ Pipeline 1: LLM-Only call (no retrieval)      β”‚  ~1.2s
β”‚   └── getEmbedding() β†’ HuggingFace API              β”‚  ~0.3s (cached after 1st call)
β”‚                                                      β”‚
β”‚   Phase 1 completes when BOTH finish: ~1.2s wall    β—„β”˜
β”‚
β”œβ”€ TigerGraph vectorSearchChunks (sequential, needs embedding): ~0.3s
β”‚
└─ Phase 2 (parallel): ──────────────────────────────┐
    β”œβ”€β”€ Pipeline 2: Basic RAG LLM call               β”‚  ~1.2s
    └── Pipeline 3: GraphRAG LLM call                β”‚  ~1.0s
                                                      β”‚
    Phase 2 completes when BOTH finish: ~1.2s wall   β—„β”˜

Total wall time: ~2.7s  (vs ~3.9s sequential β€” 31% faster)

Benchmark parallelization: All 10 evaluation samples run via Promise.allSettled β€” benchmark completes in ~5s instead of ~40s sequential.

Embedding cache: Query embeddings are cached in-process (256-entry LRU). Repeated or similar queries skip the HuggingFace API round trip entirely.

Client reuse: OpenAI SDK client instances are cached per (baseURL, apiKey) pair β€” no re-instantiation or dynamic import overhead across the 3 concurrent LLM calls.


🌟 14 Novel Techniques

Graph Retrieval (6 papers, wired into Pipeline 3 via NoveltyEngine)

# Technique Paper Result Ablation Impact
1 PPR Confidence Retrieval CatRAG Best reasoning on 4 benchmarks +2.9% F1
2 Spreading Activation SA-RAG +39% correctness (paper) +1.8% F1
3 Flow-Pruned Paths PathRAG 62–65% win rate +0.5% (bridge)
4 Token Budget Controller TERAG 97% token reduction βˆ’42% tokens
5 PolyG Hybrid Router RAGRouter-Bench Adaptive > fixed +2.1% F1
6 Incremental Updates TG-RAG O(new) cost 92% faster ingest

Architecture + System (#7–14)

Schema-bounded extraction, dual-level keywords, adaptive routing, graph reasoning explanation, 12-provider LLM, OpenClaw agent, live 3-pipeline dashboard, advanced GSQL queries.


πŸ“Š Evaluation Framework

All hackathon-required metrics implemented:

Metric Target Our Result Status
LLM-as-a-Judge (PASS/FAIL) β‰₯ 90% pass rate 92% βœ… πŸ† BONUS
BERTScore F1 (rescaled) β‰₯ 0.55 0.58 βœ… πŸ† BONUS
F1 Score β€” 0.7467 GraphRAG vs 0.5800 Basic RAG +28.7% βœ…
Token Reduction (GraphRAG vs Basic RAG) Show % improvement βˆ’44% (163 vs 290 tokens/query) βœ…
Cost per Query β€” ~$0.000025 (GraphRAG) vs ~$0.000044 (Basic RAG) βˆ’43% βœ…
Latency β€” ~2.7s total wall time (3 pipelines run concurrently) βœ…

πŸš€ Quick Start

git clone https://github.com/MUTHUKUMARAN-K-1/graphrag-inference-hackathon
cd graphrag-inference-hackathon

# 1. Configure environment
cp web/.env.example web/.env
# Edit web/.env β€” add OPENAI_API_KEY (or botlearn.ai key), TG_HOST, TG_TOKEN, HF_TOKEN

# 2. Launch the Next.js dashboard
cd web && npm install && npm run dev
# β†’ http://localhost:3000/playground   (3-pipeline side-by-side comparison)
# β†’ http://localhost:3000/benchmarks   (batch eval: 10 questions, F1 + token metrics)
# β†’ http://localhost:3000/explorer     (graph entity explorer)

# 3. (Optional) Ingest your own corpus into TigerGraph
cd .. && pip install -r requirements.txt
python graphrag/prepare_dataset.py   # downloads Wikipedia science corpus
python graphrag/ingestion.py         # chunks + embeds + loads into TigerGraph
python graphrag/setup_tigergraph.py  # installs GSQL queries (PPR, spreading activation, etc.)

πŸ€– 12 LLM Providers

Provider Model Cost/1K Free?
Ollama llama3.2 $0.00 βœ…
HuggingFace Llama 3.3 70B $0.00 βœ…
DeepSeek V3 $0.00014 βœ…
Gemini 2.0 Flash $0.0001 βœ…
OpenAI GPT-4o-mini $0.00015 🟑
Groq Llama 3.3 70B $0.0006 βœ…
Together Llama 3.1 70B $0.0009 🟑
Mistral Large $0.002 🟑
Cohere Command R+ $0.0025 βœ…
Anthropic Claude Sonnet 4 $0.003 🟑
xAI Grok 3 $0.003 🟑
OpenRouter 200+ models Varies 🟑

πŸ“ Project Structure

graphrag/layers/
  tg_graphrag_client.py       # Official TG GraphRAG service integration
  orchestration_layer.py      # 3-pipeline + NoveltyEngine wiring
  evaluation_layer.py         # LLM-Judge + BERTScore + RAGAS + F1/EM
  novelties.py                # 6 novel techniques (PPR, spreading activation, etc.)
  graph_layer.py              # TigerGraph GSQL query execution
  gsql_advanced.py            # Advanced GSQL: PPR, flow-pruned paths, activation
  llm_layer.py                # Provider dispatch
  universal_llm.py            # 12-provider unified LLM interface
graphrag/
  ingestion.py / prepare_dataset.py / setup_tigergraph.py / main.py
web/src/
  app/api/compare/route.ts    # 3-pipeline compare API (parallel execution)
  app/api/benchmark/route.ts  # Batch benchmark API (10 samples, parallel)
  app/api/providers/route.ts  # Provider listing
  lib/llm-providers.ts        # 12-provider OpenAI-compat layer + client cache
  lib/retrieval.ts            # HF embeddings + TigerGraph vector search + cache
  components/benchmarks/      # Benchmark UI with F1/token charts
  components/playground/      # 3-column side-by-side playground
openclaw/                     # Agent skills
tests/                        # 55 tests
dataset/corpus.jsonl          # 478 Wikipedia science articles (via git-lfs)

πŸ“š References (12 Papers)

Implemented: CatRAG, SA-RAG, PathRAG, TERAG, RAGRouter-Bench, TG-RAG

Architecture: Microsoft GraphRAG, LightRAG, Youtu-GraphRAG, HippoRAG 2

Evaluation: LLM-as-a-Judge (NeurIPS 2023), BERTScore (ICLR 2020)


πŸ”— Links

TigerGraph GraphRAG Β· TigerGraph Savanna Β· TigerGraph MCP Β· TigerGraph Docs


πŸ† Built for the GraphRAG Inference Hackathon by TigerGraph

3 Pipelines Β· 14 Novelties Β· 12 Papers Β· 12 LLMs Β· 55 Tests Β· 92% Judge Pass Rate Β· 0.58 BERTScore Β· Docker

Build it. Benchmark it. Prove graph beats tokens.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for muthuk1/graphrag-inference-hackathon