Advanced Hybrid Search and Reranking for Agentic RAG

Plain vector search often falls short on financial and technical documents. Exact numbers, fiscal quarters, and company names can get lost in dense embedding vectors. So we use a hybrid search strategy instead. It combines four things: dense embeddings, sparse token counts (BM25), metadata filters, and Cross-Encoder reranking. Together they build a far more accurate retrieval engine.

In this blog, we build advanced retrieval pipelines with Qdrant, LangChain, and deep-learning rerankers.

Dense, Sparse, and Hybrid Retrieval

A strong RAG system stores the data in a few ways:

Dense retrieval: encodes text chunks into long vectors (e.g., 3072 numbers). This is ideal for search by meaning and matching synonyms (e.g., matching "cash on hand" to "liquidity").
Sparse retrieval: maps text using term frequencies (like BM25). This is ideal for matching exact terms, model names, numbers, or unique IDs (e.g., matching "Apple Q1 2024").
Hybrid retrieval: merges the sparse and dense lists with search-fusion, like Reciprocal Rank Fusion. This gives a balanced result set.

Let's initialize the hybrid vector index connection:

PYTHON

from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_qdrant import QdrantVectorStore, RetrievalMode, FastEmbedSparse

load_dotenv()

COLLECTION_NAME = "financial_docs"

# Configure dense and sparse embedding models
dense_embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

# Open connection to the existing Qdrant collection
vector_store = QdrantVectorStore.from_existing_collection(
    embedding=dense_embeddings,
    sparse_embedding=sparse_embeddings,
    collection_name=COLLECTION_NAME,
    url="http://localhost:6333",
    retrieval_mode=RetrievalMode.HYBRID
)

Extracting Query Metadata Filters with LLMs

To query one document, we must avoid pulling data from unrelated years or companies. So we pull metadata filters straight from the user's plain-English query.

Gemini parses natural queries into a structured Pydantic class, generating dynamic filter parameters for Qdrant

Define the structural schema scripts/schema.py:

PYTHON

from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field

class DocType(str, Enum):
    TEN_K = "10-k"
    TEN_Q = "10-q"
    EIGHT_K = "8-k"

class FiscalQuarter(str, Enum):
    Q1 = "q1"
    Q2 = "q2"
    Q3 = "q3"
    Q4 = "q4"

class ChunkMetadata(BaseModel):
    company_name: Optional[str] = Field(default=None, description="Company name (lowercase, e.g. 'amazon', 'apple')")
    doc_type: Optional[DocType] = Field(default=None, description="Document type (10-k, 10-q, 8-k)")
    fiscal_year: Optional[str] = Field(default=None, description="Fiscal year (e.g. '2024')")
    fiscal_quarter: Optional[FiscalQuarter] = Field(default=None, description="Fiscal quarter (q1-q4)")

    model_config = {"use_enum_values": True}

Construct the metadata filter parser:

PYTHON

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

def extract_filters(user_query: str) -> dict:
    prompt = f"""
    Extract metadata filters from the query. Return None for fields not mentioned.

    <USER QUERY STARTS>
    {user_query}
    </USER QUERY ENDS>

    #### EXAMPLES
    COMPANY MAPPINGS:
    - Amazon/AMZN -> amazon
    - Google/Alphabet/GOOGL/GOOG -> google
    - Apple/AAPL -> apple

    DOC TYPE:
    - Annual report -> 10-k
    - Quarterly report -> 10-q

    EXAMPLES:
    "Amazon Q3 2024 revenue" -> {{"company_name": "amazon", "doc_type": "10-q", "fiscal_year": "2024", "fiscal_quarter": "q3"}}
    "Apple 2023 annual report" -> {{"company_name": "apple", "doc_type": "10-k", "fiscal_year": "2023"}}

    Extract metadata based on the user query only:
    """
    structured_llm = llm.with_structured_output(ChunkMetadata)
    metadata = structured_llm.invoke(prompt)
    
    if metadata:
        return metadata.model_dump(exclude_none=True)
    return {}

# Verify filter parsing logic
print(extract_filters("What was Amazon's profit in Q1 2023?"))

OUTPUT

{'company_name': 'amazon', 'doc_type': '10-q', 'fiscal_year': '2023', 'fiscal_quarter': 'q1'}

Dynamic Qdrant Metadata Filtering

We use the extracted metadata dictionary to build type-safe Qdrant filter objects. This drops off-topic documents before we score by similarity.

PYTHON

from qdrant_client.models import Filter, FieldCondition, MatchValue

def hybrid_search(query: str, k: int = 5):
    filters = extract_filters(query)
    qdrant_filter = None

    if filters:
        conditions = [
            FieldCondition(key=f"metadata.{key}", match=MatchValue(value=value))
            for key, value in filters.items()
        ]
        qdrant_filter = Filter(must=conditions)

    # Execute dynamic filtered vector search
    results = vector_store.similarity_search(query=query, k=k, filter=qdrant_filter)
    return results

# Test execution with target metadata constraints
results = hybrid_search("What is Amazon's cash flow in Q1 2024?", k=3)
for idx, doc in enumerate(results):
    print(f"[{idx}] Source: {doc.metadata['source_file']} (Page {doc.metadata['page']})")

OUTPUT

[0] Source: amazon 10-q q1 2024.md (Page 28)
[1] Source: amazon 10-q q1 2024.md (Page 26)
[2] Source: amazon 10-q q1 2024.md (Page 12)

Cross-Encoder Reranking

Embedding models are great at finding candidate chunks, but less precise at sorting them. A Bi-Encoder represents documents and queries on their own. A Cross-Encoder instead scores each query-document pair together. So it captures deeper links.

Candidate documents retrieved from the vector database are re-scored by a Cross-Encoder model to select the top matches

A Cross-Encoder used after retrieval clearly boosts RAG accuracy.

PYTHON

from langchain_community.cross_encoders import HuggingFaceCrossEncoder

RERANKER_MODEL = "BAAI/bge-reranker-base"

def rerank_results(query: str, documents: list, top_k: int = 5):
    if not documents:
        return []

    # Initialize the cross-encoder model using CUDA if available
    reranker = HuggingFaceCrossEncoder(model_name=RERANKER_MODEL, model_kwargs={'device': 'cuda'})
    
    # Pair the query with each document text
    query_doc_pairs = [(query, doc.page_content) for doc in documents]
    scores = reranker.score(query_doc_pairs)

    # Sort documents based on similarity score
    reranked = sorted(zip(scores, documents), key=lambda x: x[0], reverse=True)
    
    # Return the top K sorted documents
    return [doc for score, doc in reranked[:top_k]]

Run the complete pipeline to retrieve and rerank documents:

PYTHON

query = "what is the revenue of apple in 2024?"
retrieved_docs = hybrid_search(query, k=10)
reranked_docs = rerank_results(query, retrieved_docs, top_k=3)

for idx, doc in enumerate(reranked_docs):
    print(f"Rank {idx+1}: {doc.metadata['source_file']} (Page {doc.metadata['page']})")
    print(doc.page_content[:200].strip())
    print("-" * 50)

OUTPUT

Rank 1: apple 10-k 2024.md (Page 26)
## Products and Services Performance
The following table shows net sales by category for 2024, 2023 and 2022 (dollars in millions):
--------------------------------------------------
Rank 2: apple 10-k 2024.md (Page 28)
Operating income for 2024 grew by 8% compared to the prior year period, driven by expansion of our Services segment sales and sustained margins.
--------------------------------------------------
Rank 3: apple 10-k 2024.md (Page 12)
Selected Financial Data: Net Sales was $ 391,035 million for the fiscal year ended September 28, 2024.
--------------------------------------------------

Here, we can see the reranker put Apple's product-sales and revenue pages at the top. This is how advanced agentic retrieval works. We combined dense and sparse search. We added LLM-driven metadata filters. Then we reranked the results with a Cross-Encoder for the sharpest final matches.