This notebook focuses on the systematic evaluation of retrieval performance. It proceeds in three stages:
A synthetic legal question dataset is constructed by transforming a legal corpus into diverse natural-language queries, each paired with its reference law article(s).
The retrieval process is illustrated using a sampled query, demonstrating how candidate evidence is retrieved, expanded, and ranked in a step-by-step manner. This example provides intuition for the quantitative evaluations that follow.
Using the synthetic dataset, retrieval performance is evaluated comprehensively and quantitatively using retrieval metrics, including Recall@K, Hit@K, NDCG, and MRR.
1 Generate Synthetic Evaluation Dataset
Evaluating legal retrievers requires ground truth dataset with question–reference article pairs.
Generates synthetic legal questions by transforming legal corpus into natural-language queries makes it possible to evaluate retrieval behavior at scale.
Load the law corpus and sample an example
corpus = Path(cfg.paths.law_jsonl)chunks = [json.loads(l) for l in corpus.open("r", encoding="utf-8") if l.strip()]df_chunks = pd.DataFrame(chunks)print("Example law chunk:", )display(df_chunks.drop(columns=["subpart", "section", "article_key", "source"]).sample(1, random_state=1))print("Number of law chunks:", len(df_chunks))
The retrieval pipeline in Legal-RAG is designed as a multi-stage, hybrid, and graph-aware process, balancing recall, precision, and interpretability. Rather than relying on a single retrieval signal, the retriever progressively refines candidates through multi-channel coarse retrieval, score fusion, graph expansion, and reranking.
Sample a synthetic query to illustrate the retrieval pipeline
Late-interaction retrieval (ColBERT) encodes each token into a contextualized embedding and computes relevance through fine-grained token–token interactions at query time. This design preserves detailed lexical signals while retaining strong semantic generalization.
score
article_id
chapter
preview
0
22.08
495
二章 合同的订立
第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。...
1
19.97
502
三章 合同的效力
第五百零二条 依法成立的合同,自成立时生效,但是法律另有规定或者当事人另有约定的除外。 依照...
2
19.92
493
二章 合同的订立
第四百九十三条 当事人采用合同书形式订立合同的,最后签名、盖章或者按指印的地点为合同成立的地...
3
19.88
888
二十一章 保管合同
第八百八十八条 保管合同是保管人保管寄存人交付的保管物,并返还该物的合同。 寄存人到保管人处...
4
19.84
984
二十九章 不当得利
第九百八十四条 管理人管理事务经受益人事后追认的,从管理事务开始时起,适用委托合同的有关规定...
Multi-channel Retrieval Fusion
To integrate heterogeneous retrieval signals, Legal-RAG applies a fusion step over the multi-channel results. Two fusion strategies are supported:
Reciprocal Rank Fusion (RRF) Candidates are ranked based on the inverse of their rank positions across channels, emphasizing consensus among retrievers.
Normalized score blending (norm_blend) Raw scores from different retrievers are normalized into a comparable range and linearly combined.
This fusion step yields a unified ranked list of top candidates, mitigating the weaknesses of any single retrieval method.
score
article_id
chapter
preview
channel
0
1.18
495
二章 合同的订立
第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。...
[dense, bm25, colbert]
1
0.32
491
二章 合同的订立
第四百九十一条 当事人采用信件、数据电文等形式订立合同要求签订确认书的,签订确认书时合同成立...
[dense, colbert]
2
0.31
471
二章 合同的订立
第四百七十一条 当事人订立合同,可以采取要约、承诺方式或者其他方式。...
[dense, colbert]
3
0.30
493
二章 合同的订立
第四百九十三条 当事人采用合同书形式订立合同的,最后签名、盖章或者按指印的地点为合同成立的地...
[dense, colbert]
4
0.22
469
二章 合同的订立
第四百六十九条 当事人订立合同,可以采用书面形式、口头形式或者其他形式。 书面形式是合同书、...
[colbert, dense]
Graph-Augmented Retrieval
To capture structural legal relationships, such as:
General–specific rule hierarchies
Cross-article references
Implicit doctrinal dependencies
a graph-based enhancement stage is applied. Using the fusion top candidates as seeds, a graph walk augments the candidate set with structurally relevant but textually under-retrieved articles.
score
article_id
chapter
preview
0
0.36
490
二章 合同的订立
第四百九十条 当事人采用合同书形式订立合同的,自当事人均签名、盖章或者按指印时合同成立。在签...
1
0.34
494
二章 合同的订立
第四百九十四条 国家根据抢险救灾、疫情防控或者其他需要下达国家订货任务、指令性任务的,有关民...
2
0.34
472
二章 合同的订立
第四百七十二条 要约是希望与他人订立合同的意思表示,该意思表示应当符合下列条件: (一)内容...
3
0.34
470
二章 合同的订立
第四百七十条 合同的内容由当事人约定,一般包括下列条款: (一)当事人的姓名或者名称和住所;...
4
0.33
492
二章 合同的订立
第四百九十二条 承诺生效的地点为合同成立的地点。 采用数据电文形式订立合同的,收件人的主营业...
Reranking
A reranking stage using cross-encoder or LLM-based scoring refines ordering at the end. Reranking prioritizes answer-critical provisions while demoting peripheral or weakly connected articles based on higher-resolution signals, such as:
Query–article semantic alignment
Coverage of legally salient terms
score
article_id
chapter
preview
channel
0
1.11
495
二章 合同的订立
第四百九十五条 当事人约定在将来一定期限内订立合同的认购书、订购书、预订书等,构成预约合同。...
[dense, bm25, colbert]
1
0.21
491
二章 合同的订立
第四百九十一条 当事人采用信件、数据电文等形式订立合同要求签订确认书的,签订确认书时合同成立...
[dense, colbert]
2
0.20
471
二章 合同的订立
第四百七十一条 当事人订立合同,可以采取要约、承诺方式或者其他方式。...
[dense, colbert]
3
0.19
493
二章 合同的订立
第四百九十三条 当事人采用合同书形式订立合同的,最后签名、盖章或者按指印的地点为合同成立的地...
[dense, colbert]
4
0.15
469
二章 合同的订立
第四百六十九条 当事人订立合同,可以采用书面形式、口头形式或者其他形式。 书面形式是合同书、...
[colbert, dense]
3 Evaluating Retrieval Pipelines with Synthetic Questions
The generated synthetic dataset is used for systematic evaluation of retrieval performance in each stage:
Neural retrieval consistently outperforms lexical BM25. Both dense and ColBERT retrievers achieve substantial gains over BM25 across all metrics (e.g., R@10, MRR@10, nDCG@10), indicating that semantic matching is essential for legal queries with abstract or paraphrased formulations.
Late-interaction retrieval (ColBERT) further improves ranking quality. Compared to standard dense retrieval, ColBERT yields higher nDCG@10 and hit@k scores, suggesting better fine-grained alignment between queries and statutory text.
Multi-retriever fusion provides robust and consistent improvements. The fusion variant, which combines BM25, dense, and ColBERT retrieval results, achieves the strongest overall recall and ranking performance, benefiting from complementary retrieval signals.
Hybrid retrieval with augmentation and reranking achieves the best performance. The hybrid retriever—built on top of fusion and further enhanced with optional graph-based augmentation and reranking—consistently attains the highest scores across nearly all metrics. In particular, it shows clear gains over pure BM25 and pure dense retrieval, confirming the effectiveness of combining heterogeneous retrievers with structural and ranking enhancements.
Overall, the results demonstrate that progressively enriched retrieval pipelines—from lexical to semantic, from single retriever to fused and augmented retrievers—lead to systematic and measurable improvements in legal retrieval quality.