This notebook focuses on the systematic evaluation of retrieval performance. It proceeds in three stages:

  1. A synthetic legal question dataset is constructed by transforming a legal corpus into diverse natural-language queries, each paired with its reference law article(s).

  2. The retrieval process is illustrated using a sampled query, demonstrating how candidate evidence is retrieved, expanded, and ranked in a step-by-step manner. This example provides intuition for the quantitative evaluations that follow.

  3. Using the synthetic dataset, retrieval performance is evaluated comprehensively and quantitatively using retrieval metrics, including Recall@K, Hit@K, NDCG, and MRR.

1 Generate Synthetic Evaluation Dataset

Evaluating legal retrievers requires ground truth dataset with question–reference article pairs.

Generates synthetic legal questions by transforming legal corpus into natural-language queries makes it possible to evaluate retrieval behavior at scale.

Load the law corpus and sample an example
corpus = Path(cfg.paths.law_jsonl)
chunks = [json.loads(l) for l in corpus.open("r", encoding="utf-8") if l.strip()]
df_chunks = pd.DataFrame(chunks)

print("Example law chunk:", )
display(df_chunks.drop(columns=["subpart", "section", "article_key", "source"]).sample(1, random_state=1))
print("Number of law chunks:", len(df_chunks))
Example law chunk:
id law_name part chapter article_no article_id text
1179 minfadian.txt::一千一百八十 中华人民共和国民法典 七编 侵权责任 二章 损害赔偿 第一千一百八十条 1180 第一千一百八十条 因同一侵权行为造成多人死亡的,可以以相同数额确定死亡赔偿金。
Number of law chunks: 1260
Generate synthetic ground-truth legal query–evidence evaluation dataset
from scripts.generate_synthetic_data import build_ground_truth_queries

df_queries = build_ground_truth_queries(
    df_chunks,
    per_article=2,
    max_articles=200,
    total_queries=150,
    logger=logger,
    zh_ratio=0.9,
    generator_llm=llm,
    judge_llm=llm,
    embedding_model=embedding_model
)

df_sample = df_queries.sample(5, random_state=1)
display(df_sample.drop(columns=["lang", "round", "rewritten", "score"]))
print("Generated queries count:", len(df_queries))
Generated queries count: 150
query role law_name article_no article_id
14 听起来挺复杂的,那如果我在签署合同时已经意识到这些风险了,是否还能要求赔偿损失? user 中华人民共和国民法典 第七百三十八条 738
98 最后一个问题了,这个规定适用于所有的技术咨询服务合同吗? user 中华人民共和国民法典 第八百八十条 880
75 认购书或者订购书等是否属于预约合同? user 中华人民共和国民法典 第四百九十五条 495
16 When all parties sign, seal, or affix their fi... inhouse 中华人民共和国民法典 第四百九十三条 493
131 当合同无效是由于承租人的不当行为引起时,租赁物的所有权归谁,承租人需不需要向出租人进行经济补偿? lawyer 中华人民共和国民法典 第七百六十条 760

2 Illustrate Multi-channel Multi-stage Retrieval

The retrieval pipeline in Legal-RAG is designed as a multi-stage, hybrid, and graph-aware process, balancing recall, precision, and interpretability. Rather than relying on a single retrieval signal, the retriever progressively refines candidates through multi-channel coarse retrieval, score fusion, graph expansion, and reranking.

Sample a synthetic query to illustrate the retrieval pipeline
sample = df_sample[["query", "article_id"]].sample(1)
display(sample)
question = sample["query"].values[0]
query article_id
75 认购书或者订购书等是否属于预约合同? 495

Sparse lexical retrieval

Sparse lexical retrieval (BM25) preserves keyword and statutory phrase matching, performing pure text-based retrieval.

score article_id chapter preview
0 37.36 495 二章 合同的订立 第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。...
1 8.09 1134 三章 遗嘱继承和遗赠 第一千一百三十四条 自书遗嘱由遗嘱人亲笔书写,签名,注明年、月、日。...
2 7.53 501 三章 合同的效力 第五百零一条 当事人在订立合同过程中知悉的商业秘密或者其他应当保密的信息,无论合同是否成立,...
3 7.47 250 五章 国家所有权和集体所有权、私人所有权 第二百五十条 森林、山岭、草原、荒地、滩涂等自然资源,属于国家所有,但是法律规定属于集体所有...
4 7.03 254 五章 国家所有权和集体所有权、私人所有权 第二百五十四条 国防资产属于国家所有。 铁路、公路、电力设施、电信设施和油气管道等基础设施,...

Dense semantic retrieval

Dense retrieval (BGE embedding-based FAISS search) captures semantic similarity and paraphrased intent.

score article_id chapter preview
0 0.65 495 二章 合同的订立 第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。...
1 0.53 491 二章 合同的订立 第四百九十一条 当事人采用信件、数据电文等形式订立合同要求签订确认书的,签订确认书时合同成立...
2 0.53 471 二章 合同的订立 第四百七十一条 当事人订立合同,可以采取要约、承诺方式或者其他方式。...
3 0.51 493 二章 合同的订立 第四百九十三条 当事人采用合同书形式订立合同的,最后签名、盖章或者按指印的地点为合同成立的地...
4 0.51 483 二章 合同的订立 第四百八十三条 承诺生效时合同成立,但是法律另有规定或者当事人另有约定的除外。...

Late-interaction retrieval

Late-interaction retrieval (ColBERT) encodes each token into a contextualized embedding and computes relevance through fine-grained token–token interactions at query time. This design preserves detailed lexical signals while retaining strong semantic generalization.

score article_id chapter preview
0 22.08 495 二章 合同的订立 第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。...
1 19.97 502 三章 合同的效力 第五百零二条 依法成立的合同,自成立时生效,但是法律另有规定或者当事人另有约定的除外。 依照...
2 19.92 493 二章 合同的订立 第四百九十三条 当事人采用合同书形式订立合同的,最后签名、盖章或者按指印的地点为合同成立的地...
3 19.88 888 二十一章 保管合同 第八百八十八条 保管合同是保管人保管寄存人交付的保管物,并返还该物的合同。 寄存人到保管人处...
4 19.84 984 二十九章 不当得利 第九百八十四条 管理人管理事务经受益人事后追认的,从管理事务开始时起,适用委托合同的有关规定...

Multi-channel Retrieval Fusion

To integrate heterogeneous retrieval signals, Legal-RAG applies a fusion step over the multi-channel results. Two fusion strategies are supported:

  • Reciprocal Rank Fusion (RRF) Candidates are ranked based on the inverse of their rank positions across channels, emphasizing consensus among retrievers.

  • Normalized score blending (norm_blend) Raw scores from different retrievers are normalized into a comparable range and linearly combined.

This fusion step yields a unified ranked list of top candidates, mitigating the weaknesses of any single retrieval method.

score article_id chapter preview channel
0 1.18 495 二章 合同的订立 第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。... [dense, bm25, colbert]
1 0.32 491 二章 合同的订立 第四百九十一条 当事人采用信件、数据电文等形式订立合同要求签订确认书的,签订确认书时合同成立... [dense, colbert]
2 0.31 471 二章 合同的订立 第四百七十一条 当事人订立合同,可以采取要约、承诺方式或者其他方式。... [dense, colbert]
3 0.30 493 二章 合同的订立 第四百九十三条 当事人采用合同书形式订立合同的,最后签名、盖章或者按指印的地点为合同成立的地... [dense, colbert]
4 0.22 469 二章 合同的订立 第四百六十九条 当事人订立合同,可以采用书面形式、口头形式或者其他形式。 书面形式是合同书、... [colbert, dense]

Graph-Augmented Retrieval

To capture structural legal relationships, such as:

  • General–specific rule hierarchies
  • Cross-article references
  • Implicit doctrinal dependencies

a graph-based enhancement stage is applied. Using the fusion top candidates as seeds, a graph walk augments the candidate set with structurally relevant but textually under-retrieved articles.

score article_id chapter preview
0 0.36 490 二章 合同的订立 第四百九十条 当事人采用合同书形式订立合同的,自当事人均签名、盖章或者按指印时合同成立。在签...
1 0.34 494 二章 合同的订立 第四百九十四条 国家根据抢险救灾、疫情防控或者其他需要下达国家订货任务、指令性任务的,有关民...
2 0.34 472 二章 合同的订立 第四百七十二条 要约是希望与他人订立合同的意思表示,该意思表示应当符合下列条件: (一)内容...
3 0.34 470 二章 合同的订立 第四百七十条 合同的内容由当事人约定,一般包括下列条款: (一)当事人的姓名或者名称和住所;...
4 0.33 492 二章 合同的订立 第四百九十二条 承诺生效的地点为合同成立的地点。 采用数据电文形式订立合同的,收件人的主营业...

Reranking

A reranking stage using cross-encoder or LLM-based scoring refines ordering at the end. Reranking prioritizes answer-critical provisions while demoting peripheral or weakly connected articles based on higher-resolution signals, such as:

  • Query–article semantic alignment
  • Coverage of legally salient terms
score article_id chapter preview channel
0 1.11 495 二章 合同的订立 第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。... [dense, bm25, colbert]
1 0.21 491 二章 合同的订立 第四百九十一条 当事人采用信件、数据电文等形式订立合同要求签订确认书的,签订确认书时合同成立... [dense, colbert]
2 0.20 471 二章 合同的订立 第四百七十一条 当事人订立合同,可以采取要约、承诺方式或者其他方式。... [dense, colbert]
3 0.19 493 二章 合同的订立 第四百九十三条 当事人采用合同书形式订立合同的,最后签名、盖章或者按指印的地点为合同成立的地... [dense, colbert]
4 0.15 469 二章 合同的订立 第四百六十九条 当事人订立合同,可以采用书面形式、口头形式或者其他形式。 书面形式是合同书、... [colbert, dense]

3 Evaluating Retrieval Pipelines with Synthetic Questions

The generated synthetic dataset is used for systematic evaluation of retrieval performance in each stage:

  • Multi-Channel Retrieval, combining dense semantic search, sparse lexical matching, and late-interaction models.

  • Graph-Augmented Retrieval, with legal graph expansion.

  • Reranking, final ranking by applying a cross-encoder or LLM.

Define Retrieval Evaluation Metrics
from collections import defaultdict
from typing import Callable
import math


def _hit_ids(hits: List[RetrievalHit]) -> List[str]:
    # compare at article_id level for legal retrieval
    out = []
    for h in hits:
        aid = str(getattr(h.chunk, "article_id", "") or "")
        if aid:
            out.append(aid)
    return out

def hit_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    top_hits = pred[:k]
    return int(any(h.strip() in gold for h in top_hits))

def recall_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    if not gold:
        return 0.0
    return len(set(pred[:k]) & gold) / len(gold)

def mrr_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    for i, x in enumerate(pred[:k], start=1):
        if x in gold:
            return 1.0 / i
    return 0.0

def ndcg_at_k(pred: List[str], gold: Set[str], k: int) -> float:
    # binary relevance
    def dcg(xs: List[str]) -> float:
        s = 0.0
        for i, x in enumerate(xs[:k], start=1):
            rel = 1.0 if x in gold else 0.0
            s += rel / math.log2(i + 1)
        return s
    ideal = dcg(list(gold))
    if ideal <= 1e-12:
        return 0.0
    return dcg(pred) / ideal
Evaluate on the Systematic Dataset
subset = df_queries.sample(100, random_state=0).reset_index(drop=True)

results = []
for _, row in subset.iterrows():
    out = evaluate_one(row["query"], [row["article_id"]], top_k=10)
    results.append({
        "query": row["query"],
        "positives": row["article_id"],
        "hits":  out["hits"],
        "metrics": out["metrics"],
    })
R@5 R@10 MRR@10 nDCG@10 hit@3 hit@10
fusion 0.82 0.85 0.650857 0.699083 0.70 0.85
hybrid 0.82 0.84 0.690524 0.727520 0.75 0.84
colbert 0.79 0.81 0.663024 0.699208 0.71 0.81
dense 0.66 0.75 0.528690 0.582100 0.60 0.75
bm25 0.52 0.57 0.443345 0.473788 0.49 0.57

4 Key observations

  1. Neural retrieval consistently outperforms lexical BM25. Both dense and ColBERT retrievers achieve substantial gains over BM25 across all metrics (e.g., R@10, MRR@10, nDCG@10), indicating that semantic matching is essential for legal queries with abstract or paraphrased formulations.

  2. Late-interaction retrieval (ColBERT) further improves ranking quality. Compared to standard dense retrieval, ColBERT yields higher nDCG@10 and hit@k scores, suggesting better fine-grained alignment between queries and statutory text.

  3. Multi-retriever fusion provides robust and consistent improvements. The fusion variant, which combines BM25, dense, and ColBERT retrieval results, achieves the strongest overall recall and ranking performance, benefiting from complementary retrieval signals.

  4. Hybrid retrieval with augmentation and reranking achieves the best performance. The hybrid retriever—built on top of fusion and further enhanced with optional graph-based augmentation and reranking—consistently attains the highest scores across nearly all metrics. In particular, it shows clear gains over pure BM25 and pure dense retrieval, confirming the effectiveness of combining heterogeneous retrievers with structural and ranking enhancements.

Overall, the results demonstrate that progressively enriched retrieval pipelines—from lexical to semantic, from single retriever to fused and augmented retrievers—lead to systematic and measurable improvements in legal retrieval quality.