Project Overview
LegalRAG combines no-OCR multimodal indexing, knowledge-graph retrieval, hybrid search, and a RAGAS-gated evaluation harness into a single legal-document intelligence system.
Objective
Index a contract corpus with ColPali, populate a Neo4j knowledge graph from extracted entities, and build a routed hybrid retrieval pipeline gated by RAGAS faithfulness scores.
Scope
- ColPali multimodal page indexing (no OCR).
- A Neo4j knowledge graph with nodes for contracts, parties, clauses, obligations, dates, and amounts.
- Hybrid retrieval with BM25 + dense and Reciprocal Rank Fusion.
- Cross-encoder reranking over fused candidates.
- Adaptive query routing across ColPali, BM25, Neo4j, and the fused stack.
Datasets
- Commercial-contract corpora with clause categories.
- Optional multi-jurisdiction legal text.
Stack
- Multi-vector index with MaxSim (late-interaction) scoring (Qdrant).
- BM25 sparse retrieval (Elasticsearch).
- A graph database (Neo4j) with LLM + Pydantic entity extraction.
- Cross-encoder reranker.
- PII masking and input/output guardrails.
- FastAPI + LangChain LCEL + Docker for serving.
Evaluation
- RAGAS faithfulness, answer relevancy, context precision and recall.
- Faithfulness gate on golden QA pairs.
Deliverables
- Indexed corpus across Qdrant, BM25, and Neo4j.
- Operational hybrid retrieval with RRF and reranker.
- A working adaptive query router.
- Integrated PII masking and guardrails.
- A RAGAS report meeting the faithfulness gate.
Prerequisites
Modules 14–19 (embeddings, RAG basics, advanced RAG, quantization & multimodal RAG, graph/caching/security).