Retrieval Performance Evaluation

This notebook focuses on the systematic evaluation of retrieval performance. It proceeds in three stages:

A synthetic legal question dataset is constructed by transforming a legal corpus into diverse natural-language queries, each paired with its reference law article(s).
The retrieval process is illustrated using a sampled query, demonstrating how candidate evidence is retrieved, expanded, and ranked in a step-by-step manner. This example provides intuition for the quantitative evaluations that follow.
Using the synthetic dataset, retrieval performance is evaluated comprehensively and quantitatively using retrieval metrics, including Recall@K, Hit@K, NDCG, and MRR.

1 Generate Synthetic Evaluation Dataset

Evaluating legal retrievers requires ground truth dataset with question–reference article pairs.

Generates synthetic legal questions by transforming legal corpus into natural-language queries makes it possible to evaluate retrieval behavior at scale.

Load the law corpus and sample an example

corpus = Path(cfg.paths.law_jsonl)
chunks = [json.loads(l) for l in corpus.open("r", encoding="utf-8") if l.strip()]
df_chunks = pd.DataFrame(chunks)

print("Example law chunk:", )
display(df_chunks.drop(columns=["subpart", "section", "article_key", "source"]).sample(1, random_state=1))
print("Number of law chunks:", len(df_chunks))

Example law chunk:

	id	law_name	part	chapter	article_no	article_id	text
1179	minfadian.txt::一千一百八十	中华人民共和国民法典	七编侵权责任	二章损害赔偿	第一千一百八十条	1180	第一千一百八十条因同一侵权行为造成多人死亡的，可以以相同数额确定死亡赔偿金。

Number of law chunks: 1260

Generate synthetic ground-truth legal query–evidence evaluation dataset

from scripts.generate_synthetic_data import build_ground_truth_queries

df_queries = build_ground_truth_queries(
    df_chunks,
    per_article=2,
    max_articles=200,
    total_queries=150,
    logger=logger,
    zh_ratio=0.9,
    generator_llm=llm,
    judge_llm=llm,
    embedding_model=embedding_model
)

df_sample = df_queries.sample(5, random_state=1)
display(df_sample.drop(columns=["lang", "round", "rewritten", "score"]))
print("Generated queries count:", len(df_queries))

Generated queries count: 150

	query	role	law_name	article_no	article_id
14	听起来挺复杂的，那如果我在签署合同时已经意识到这些风险了，是否还能要求赔偿损失？	user	中华人民共和国民法典	第七百三十八条	738
98	最后一个问题了，这个规定适用于所有的技术咨询服务合同吗？	user	中华人民共和国民法典	第八百八十条	880
75	认购书或者订购书等是否属于预约合同？	user	中华人民共和国民法典	第四百九十五条	495
16	When all parties sign, seal, or affix their fi...	inhouse	中华人民共和国民法典	第四百九十三条	493
131	当合同无效是由于承租人的不当行为引起时，租赁物的所有权归谁，承租人需不需要向出租人进行经济补偿？	lawyer	中华人民共和国民法典	第七百六十条	760

2 Illustrate Multi-channel Multi-stage Retrieval

The retrieval pipeline in Legal-RAG is designed as a multi-stage, hybrid, and graph-aware process, balancing recall, precision, and interpretability. Rather than relying on a single retrieval signal, the retriever progressively refines candidates through multi-channel coarse retrieval, score fusion, graph expansion, and reranking.

Sample a synthetic query to illustrate the retrieval pipeline

sample = df_sample[["query", "article_id"]].sample(1)
display(sample)
question = sample["query"].values[0]

	query	article_id
75	认购书或者订购书等是否属于预约合同？	495

Sparse lexical retrieval

Sparse lexical retrieval (BM25) preserves keyword and statutory phrase matching, performing pure text-based retrieval.

	score	article_id	chapter	preview
0	37.36	495	二章合同的订立	第四百九十五条当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等，构成预约合同。...
1	8.09	1134	三章遗嘱继承和遗赠	第一千一百三十四条自书遗嘱由遗嘱人亲笔书写，签名，注明年、月、日。...
2	7.53	501	三章合同的效力	第五百零一条当事人在订立合同过程中知悉的商业秘密或者其他应当保密的信息，无论合同是否成立，...
3	7.47	250	五章国家所有权和集体所有权、私人所有权	第二百五十条森林、山岭、草原、荒地、滩涂等自然资源，属于国家所有，但是法律规定属于集体所有...
4	7.03	254	五章国家所有权和集体所有权、私人所有权	第二百五十四条国防资产属于国家所有。铁路、公路、电力设施、电信设施和油气管道等基础设施，...

Dense semantic retrieval

Dense retrieval (BGE embedding-based FAISS search) captures semantic similarity and paraphrased intent.

	score	article_id	chapter	preview
0	0.65	495	二章合同的订立	第四百九十五条当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等，构成预约合同。...
1	0.53	491	二章合同的订立	第四百九十一条当事人采用信件、数据电文等形式订立合同要求签订确认书的，签订确认书时合同成立...
2	0.53	471	二章合同的订立	第四百七十一条当事人订立合同，可以采取要约、承诺方式或者其他方式。...
3	0.51	493	二章合同的订立	第四百九十三条当事人采用合同书形式订立合同的，最后签名、盖章或者按指印的地点为合同成立的地...
4	0.51	483	二章合同的订立	第四百八十三条承诺生效时合同成立，但是法律另有规定或者当事人另有约定的除外。...

Late-interaction retrieval

Late-interaction retrieval (ColBERT) encodes each token into a contextualized embedding and computes relevance through fine-grained token–token interactions at query time. This design preserves detailed lexical signals while retaining strong semantic generalization.

	score	article_id	chapter	preview
0	22.08	495	二章合同的订立	第四百九十五条当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等，构成预约合同。...
1	19.97	502	三章合同的效力	第五百零二条依法成立的合同，自成立时生效，但是法律另有规定或者当事人另有约定的除外。依照...
2	19.92	493	二章合同的订立	第四百九十三条当事人采用合同书形式订立合同的，最后签名、盖章或者按指印的地点为合同成立的地...
3	19.88	888	二十一章保管合同	第八百八十八条保管合同是保管人保管寄存人交付的保管物，并返还该物的合同。寄存人到保管人处...
4	19.84	984	二十九章不当得利	第九百八十四条管理人管理事务经受益人事后追认的，从管理事务开始时起，适用委托合同的有关规定...

Multi-channel Retrieval Fusion

To integrate heterogeneous retrieval signals, Legal-RAG applies a fusion step over the multi-channel results. Two fusion strategies are supported:

Reciprocal Rank Fusion (RRF) Candidates are ranked based on the inverse of their rank positions across channels, emphasizing consensus among retrievers.
Normalized score blending (norm_blend) Raw scores from different retrievers are normalized into a comparable range and linearly combined.

This fusion step yields a unified ranked list of top candidates, mitigating the weaknesses of any single retrieval method.

	score	article_id	chapter	preview	channel
0	1.18	495	二章合同的订立	第四百九十五条当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等，构成预约合同。...	[dense, bm25, colbert]
1	0.32	491	二章合同的订立	第四百九十一条当事人采用信件、数据电文等形式订立合同要求签订确认书的，签订确认书时合同成立...	[dense, colbert]
2	0.31	471	二章合同的订立	第四百七十一条当事人订立合同，可以采取要约、承诺方式或者其他方式。...	[dense, colbert]
3	0.30	493	二章合同的订立	第四百九十三条当事人采用合同书形式订立合同的，最后签名、盖章或者按指印的地点为合同成立的地...	[dense, colbert]
4	0.22	469	二章合同的订立	第四百六十九条当事人订立合同，可以采用书面形式、口头形式或者其他形式。书面形式是合同书、...	[colbert, dense]

Graph-Augmented Retrieval

To capture structural legal relationships, such as:

General–specific rule hierarchies
Cross-article references
Implicit doctrinal dependencies

a graph-based enhancement stage is applied. Using the fusion top candidates as seeds, a graph walk augments the candidate set with structurally relevant but textually under-retrieved articles.

	score	article_id	chapter	preview
0	0.36	490	二章合同的订立	第四百九十条当事人采用合同书形式订立合同的，自当事人均签名、盖章或者按指印时合同成立。在签...
1	0.34	494	二章合同的订立	第四百九十四条国家根据抢险救灾、疫情防控或者其他需要下达国家订货任务、指令性任务的，有关民...
2	0.34	472	二章合同的订立	第四百七十二条要约是希望与他人订立合同的意思表示，该意思表示应当符合下列条件：（一）内容...
3	0.34	470	二章合同的订立	第四百七十条合同的内容由当事人约定，一般包括下列条款：（一）当事人的姓名或者名称和住所；...
4	0.33	492	二章合同的订立	第四百九十二条承诺生效的地点为合同成立的地点。采用数据电文形式订立合同的，收件人的主营业...

Reranking

A reranking stage using cross-encoder or LLM-based scoring refines ordering at the end. Reranking prioritizes answer-critical provisions while demoting peripheral or weakly connected articles based on higher-resolution signals, such as:

Query–article semantic alignment
Coverage of legally salient terms

	score	article_id	chapter	preview	channel
0	1.11	495	二章合同的订立	第四百九十五条当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等，构成预约合同。...	[dense, bm25, colbert]
1	0.21	491	二章合同的订立	第四百九十一条当事人采用信件、数据电文等形式订立合同要求签订确认书的，签订确认书时合同成立...	[dense, colbert]
2	0.20	471	二章合同的订立	第四百七十一条当事人订立合同，可以采取要约、承诺方式或者其他方式。...	[dense, colbert]
3	0.19	493	二章合同的订立	第四百九十三条当事人采用合同书形式订立合同的，最后签名、盖章或者按指印的地点为合同成立的地...	[dense, colbert]
4	0.15	469	二章合同的订立	第四百六十九条当事人订立合同，可以采用书面形式、口头形式或者其他形式。书面形式是合同书、...	[colbert, dense]

3 Evaluating Retrieval Pipelines with Synthetic Questions

The generated synthetic dataset is used for systematic evaluation of retrieval performance in each stage:

Multi-Channel Retrieval, combining dense semantic search, sparse lexical matching, and late-interaction models.
Graph-Augmented Retrieval, with legal graph expansion.
Reranking, final ranking by applying a cross-encoder or LLM.

Define Retrieval Evaluation Metrics

from collections import defaultdict
from typing import Callable
import math


def _hit_ids(hits: List[RetrievalHit]) -> List[str]:
    # compare at article_id level for legal retrieval
    out = []
    for h in hits:
        aid = str(getattr(h.chunk, "article_id", "") or "")
        if aid:
            out.append(aid)
    return out

def hit_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    top_hits = pred[:k]
    return int(any(h.strip() in gold for h in top_hits))

def recall_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    if not gold:
        return 0.0
    return len(set(pred[:k]) & gold) / len(gold)

def mrr_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    for i, x in enumerate(pred[:k], start=1):
        if x in gold:
            return 1.0 / i
    return 0.0

def ndcg_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    # binary relevance
    def dcg(xs: List[str]) -> float:
        s = 0.0
        for i, x in enumerate(xs[:k], start=1):
            rel = 1.0 if x in gold else 0.0
            s += rel / math.log2(i + 1)
        return s
    ideal = dcg(list(gold))
    if ideal <= 1e-12:
        return 0.0
    return dcg(pred) / ideal

Evaluate on the Systematic Dataset

subset = df_queries.sample(100, random_state=0).reset_index(drop=True)

results = []
for _, row in subset.iterrows():
    out = evaluate_one(row["query"], [row["article_id"]], top_k=10)
    results.append({
        "query": row["query"],
        "positives": row["article_id"],
        "hits":  out["hits"],
        "metrics": out["metrics"],
    })

	R@5	R@10	MRR@10	nDCG@10	hit@3	hit@10
fusion	0.82	0.85	0.650857	0.699083	0.70	0.85
hybrid	0.82	0.84	0.690524	0.727520	0.75	0.84
colbert	0.79	0.81	0.663024	0.699208	0.71	0.81
dense	0.66	0.75	0.528690	0.582100	0.60	0.75
bm25	0.52	0.57	0.443345	0.473788	0.49	0.57

4 Key observations

Neural retrieval consistently outperforms lexical BM25. Both dense and ColBERT retrievers achieve substantial gains over BM25 across all metrics (e.g., R@10, MRR@10, nDCG@10), indicating that semantic matching is essential for legal queries with abstract or paraphrased formulations.
Late-interaction retrieval (ColBERT) further improves ranking quality. Compared to standard dense retrieval, ColBERT yields higher nDCG@10 and hit@k scores, suggesting better fine-grained alignment between queries and statutory text.
Multi-retriever fusion provides robust and consistent improvements. The fusion variant, which combines BM25, dense, and ColBERT retrieval results, achieves the strongest overall recall and ranking performance, benefiting from complementary retrieval signals.
Hybrid retrieval with augmentation and reranking achieves the best performance. The hybrid retriever—built on top of fusion and further enhanced with optional graph-based augmentation and reranking—consistently attains the highest scores across nearly all metrics. In particular, it shows clear gains over pure BM25 and pure dense retrieval, confirming the effectiveness of combining heterogeneous retrievers with structural and ranking enhancements.

Overall, the results demonstrate that progressively enriched retrieval pipelines—from lexical to semantic, from single retriever to fused and augmented retrievers—lead to systematic and measurable improvements in legal retrieval quality.