Run LegalRAG Pipeline

This notebook demonstrates the end-to-end behavior of the Legal RAG system, covering data preprocessing, query understanding, routing, hybrid retrieval, and evidence-backed response generation.

1 Repository Install

First, pull the project source code and initialize the execution environment.

git clone https://github.com/Fan-Luo/Legal-RAG.git
cd Legal-RAG
pip install -e .

2 Data preprocess

Before the LegalRAG pipeline can serve queries, the legal corpus must be transformed into structured, searchable representations.

In the offline preprocessing stage, raw legal texts are converted into normalized data artifacts, retrieval indices, and a legal knowledge graph through the following steps:

Preprocess law files

Raw legal documents are parsed, cleaned, and normalized into a unified JSONL format. Each legal provision is assigned a stable identifier and enriched with structural metadata such as law name, part, chapter, article number, and source file.
```
python -m scripts.preprocess_law
```
Build retrieval indices

Multiple complementary retrieval indices are constructed to support lexical, dense, and late-interaction search paradigms:
- BM25 for sparse keyword-based retrieval
- FAISS for dense vector similarity search
- ColBERT for token-level late interaction retrieval
```
python -m scripts.build_index
```
Construct the legal knowledge graph

Legal articles are connected via structured relationships (e.g., citation, reference, dependency), forming a directed legal knowledge graph that enables graph-based expansion and contextual reasoning.
```
python -m scripts.build_graph
```

Generated Data Artifacts:

	Lang	Category	Path	Exists	Size (MB)
11	en	Graph	data/graph/law_graph_en.jsonl	True	0.228
3	en	Index	data/index/en/bm25.pkl	True	1.796
9	en	Index	data/index/en/colbert/colbert_meta.jsonl	True	0.987
5	en	Index	data/index/en/faiss/faiss.index	True	1.884
7	en	Index	data/index/en/faiss/faiss_meta.jsonl	True	0.962
1	en	Processed	data/processed/law_en.jsonl	True	0.982
10	zh	Graph	data/graph/law_graph_zh.jsonl	True	0.405
2	zh	Index	data/index/zh/bm25.pkl	True	0.992
8	zh	Index	data/index/zh/colbert/colbert_meta.jsonl	True	0.654
4	zh	Index	data/index/zh/faiss/faiss.index	True	4.018
6	zh	Index	data/index/zh/faiss/faiss_meta.jsonl	True	0.601
0	zh	Processed	data/processed/law_zh.jsonl	True	0.658

Example entries from law_zh.jsonl and law_en.jsonl:

	id	law_name	lang	article_no	article_id	text	source
0	minfadian.txt::一	中华人民共和国民法典	zh	第一条	1	第一条为了保护民事主体的合法权益，调整民事关系，维护社会和经济秩序，适应中国特色社会主义发...	minfadian.txt
1	minfadian.txt::二	中华人民共和国民法典	zh	第二条	2	第二条民法调整平等主体的自然人、法人和非法人组织之间的人身关系和财产关系。	minfadian.txt
2	minfadian.txt::三	中华人民共和国民法典	zh	第三条	3	第三条民事主体的人身权利、财产权利以及其他合法权益受法律保护，任何组织或者个人不得侵犯。	minfadian.txt

	id	law_name	lang	article_no	article_id	text	source
0	ucc_1.txt::1-101	Uniform Commercial Code	en	§ 1-101	1-101	§ 1-101. Short Titles.(a) This [Act] may be ci...	ucc_1.txt
1	ucc_1.txt::1-102	Uniform Commercial Code	en	§ 1-102	1-102	§ 1-102. Scope of Article.This article applies...	ucc_1.txt
2	ucc_1.txt::1-103	Uniform Commercial Code	en	§ 1-103	1-103	§ 1-103. Construction of Uniform Commercial Co...	ucc_1.txt

Field	Description
`id`	Internal numeric identifier
`law_name`	Name of the law
`lang`	Language of the law article
`article_no`	Article number (human-readable)
`article_id`	Stable article identifier
`text`	Full article text
`source`	Original source file

3 Initialize the RAG pipeline

This section sets up the RAG pipeline and configures the underlying language model. The pipeline is constructed from a centralized configuration, ensuring that retrieval, ranking, and generation components are consistently parameterized.

from legalrag.pipeline.rag_pipeline import RagPipeline
from legalrag.config import AppConfig

cfg = AppConfig.load(None)
cfg.llm.provider = "qwen-local"
cfg.llm.model = "Qwen/Qwen2.5-3B-Instruct"
pipeline = RagPipeline(cfg)

The default configuration uses local Qwen model as the generation model by default, and it can be switched to an OpenAI model:

cfg.llm.provider = "openai"
cfg.llm.model = "gpt-4.1-mini"   
# Requires setting OPENAI_API_KEY in environment variables

4 Query Understanding and Routing

This step distinguishes between different query types and selects a retrieval mode accordingly.

Accurate query understanding is critical, as it directly influences downstream retrieval strategies and answer generation behavior.

from legalrag.routing.router import QueryRouter
from legalrag.llm.client import LLMClient

llm = LLMClient.from_config(cfg)
router = QueryRouter(llm_client=llm, llm_based=cfg.routing.llm_based)
question = '已经有两个亲生孩子的家庭可以再收养一个孩子吗?'
decision = router.route(question)
print('Issue Type: ', decision.issue_type)
print('Task Type: ', decision.task_type)
print('mode: ', decision.mode)

Issue Type:  IssueType.MARRIAGE_FAMILY
Task Type:  TaskType.JUDGE_STYLE
mode:  RoutingMode.RAG

5 Retrieve legal provisions

In this stage, the system retrieves candidate legal provisions relevant to the query.

LegalRAG uses a multi-channel retrieval strategy that combines dense semantic search, sparse lexical matching, and late-interaction models. Depending on the retrieval mode inferred during query understanding, a graph-based enhancement stage may be applied to expand the candidate set with structurally related legal provisions. Final ranking is refined using cross-encoder or LLM-based reranking.

question = '已经有两个亲生孩子的家庭可以再收养一个孩子吗?'
_, hits, _ = pipeline.retrieve(question, llm, top_k=10, decision=decision)
hits_to_dataframe(hits)

	score	channel	article_id	preview
0	1.04	[dense, colbert, bm25]	1100	第一千一百条无子女的收养人可以收养两名子女；有子女的收养人只能收养一名子女。收养孤儿、残疾未成年人或者儿童福利机构抚养的查找不到生父母的未成年人，可以不受前款和本法第一千零九十八条第一项规定的限制。...
1	0.67	[dense, colbert]	1103	第一千一百零三条继父或者继母经继子女的生父母同意，可以收养继子女，并可以不受本法第一千零九十三条第三项、...
2	0.55	[dense, colbert]	1099	第一千零九十九条收养三代以内旁系同辈血亲的子女，可以不受本法第一千零九十三条第三项、...
3	0.40	[dense, bm25]	1101	第一千一百零一条有配偶者收养子女，应当夫妻共同收养。...
4	0.31	[bm25, colbert]	699	第六百九十九条同一债务有两个以上保证人的，保证人应当按照保证合同约定的保证份额，承担保证责任；没有约定保证份额的，债权人可以请求任何一个保证人在其保证范围内承担保证责任。...
5	0.31	[dense, colbert]	1117	第一千一百一十七条收养关系解除后，养子女与养父母以及其他近亲属间的权利义务关系即行消除，与生父母以及其他近亲属间的权利义务关系自行恢复。但是，成年养子女与生父母以及其他近亲属间的权利义务关系是否恢复，可以协商确定。...
6	0.26	[colbert, dense]	1098	第一千零九十八条收养人应当同时具备下列条件：（一）无子女或者只有一名子女；（二）有抚养、教育和保护被收养人的能力；（三）未患有在医学上认为不应当收养子女的疾病；（四）无不利于被收养人健康成长的违法犯罪记录；（五）年满三十周岁。...
7	0.23	[dense, colbert]	1097	第一千零九十七条生父母送养子女，应当双方共同送养。生父母一方不明或者查找不到的，可以单方送养。...
8	0.23	[bm25, colbert]	1105	第一千一百零五条收养应当向县级以上人民政府民政部门登记。收养关系自登记之日起成立。收养查找不到生父母的未成年人的，办理登记的民政部门应当在登记前予以公告。收养关系当事人愿意签订收养协议的，可以签订收养协议。收养关系当事人各方或者一方要求办理收养公证的，应当办理收养公证。县级以上人民政府民政...
9	0.18	[bm25, colbert]	1114	第一千一百一十四条收养人在被收养人成年以前，不得解除收养关系，但是收养人、送养人双方协议解除的除外。养子女八周岁以上的，应当征得本人同意。收养人不履行抚养义务，有虐待、遗弃等侵害未成年养子女合法权益行为的，送养人有权要求解除养父母与养子女间的收养关系。送养人、收养人不能达成解除收养关系协议的，可...

question = 'What standards must goods satisfy to be merchantable？'
_, hits, _ = pipeline.retrieve(question, llm, top_k=10, decision=decision)
hits_to_dataframe(hits)

	score	channel	article_id	preview
0	1.07	[dense, bm25, colbert]	2A-212	§ 2A-212. IMPLIED WARRANTY OF MERCHANTABILITY. (1) Except in afinance lease, a warranty that the goods will be merchantable is implied in alease contrac...
1	1.01	[dense, colbert, bm25]	2-314	§ 2-314. Implied Warranty: Merchantability; Usage of Trade. (1) Unless excluded or modified (Section2-316), a warranty that the goods shall be merchanta...
2	0.34	[dense, colbert]	2A-511	§ 2A-511. MERCHANT LESSEE's DUTIES AS TO RIGHTFULLY REJECTED GOODS. (1) Subject to any security interest of a lessee (Section2A-508(5)), if a lessoror as...
3	0.34	[bm25, colbert]	2-105	§ 2-105. Definitions: Transferability; "Goods"; "Future" Goods; "Lot"; "Commercial Unit". (1)"Goods" means all things (including specially manufacture...
4	0.33	[dense, colbert]	2-603	§ 2-603. Merchant Buyer's Duties as to Rightfully Rejected Goods. (1) Subject to any security interest in the buyer (subsection (3) of Section2-711), wh...
5	0.18	[colbert, dense]	7-207	§ 7-207. Goods Must Be Kept Separate; Fungible Goods.(a) Unless the warehouse receipt provides otherwise, a warehouse shall keep separate the goods covere...

The result shows how different retrieval signals contributed to the final ranking for debugging, evaluation, and trust in legal AI systems.

6 Answer legal questions

Using the retrieved provisions as evidence, the system generates a final answer.

Answer generation is explicitly grounded in the retrieved legal texts, ensuring that conclusions are traceable to authoritative sources.

Question 1: 已经有两个孩子的家庭可以再收养一个孩子吗?

question = '已经有两个孩子的家庭可以再收养一个孩子吗?'
answer = pipeline.answer(question).answer
render_legal_json_md(json.loads(answer))

结论

已有两个孩子的家庭不可以再收养一个孩子。

分析论证

第一千一百条无子女的收养人可以收养两名子女；有子女的收养人只能收养一名子女。

根据《中华人民共和国民法典》第一百一百条规定，已婚人士即使有两个孩子也不能再收养另一个孩子。

参考法条

民法典 · 第「1100」条

Question 2: What standards must goods satisfy to be merchantable？

question = 'What standards must goods satisfy to be merchantable？'
answer = pipeline.answer(question).answer
render_legal_json_md(json.loads(answer))

Conclusion

The goods must meet the criteria outlined in the Uniform Commercial Code to be considered merchantable.

Reasoning

Except in a finance lease, a warranty that the goods will be merchantable is implied in a lease contract if the lessor is a merchant with respect to goods of that kind.

For goods to be merchantable according to the Uniform Commercial Code, the lessor must be deemed a merchant concerning the type of goods involved.

(a) Pass without objection in the trade under the description in the lease agreement;

Goods are considered merchantable if they pass inspection without objections during trading under the agreed-upon description.

(b) In the case of fungible goods, are of fair average quality within the description;

Fungible goods are deemed merchantable if they maintain fair average quality relative to the described standard.

(c) Are fit for the ordinary purposes for which goods of that type are used;

Merchantable goods must fulfill the typical uses intended for goods of their type.

(d) Run, within the variation permitted by the lease agreement, of even kind, quality, and quantity within each unit and among all units involved;

Each unit and all units of the goods must adhere to the specified variance limits set forth in the lease agreement.

(e) Are adequately contained, packaged, and labeled as the lease agreement may require;

The goods must be properly stored, packed, and labeled in accordance with the lease agreement.

(f) Conform to any promises or affirmations of fact made on the container or label;

The goods must comply with any explicit representations about the product on its packaging or labels.

Referenced Provisions

Uniform Commercial Code · 2A-212