This notebook demonstrates the end-to-end behavior of the Legal RAG system, covering data preprocessing, query understanding, routing, hybrid retrieval, and evidence-backed response generation.

1 Repository Install

First, pull the project source code and initialize the execution environment.

git clone https://github.com/Fan-Luo/Legal-RAG.git
cd Legal-RAG
pip install -e .

2 Data preprocess

Before the LegalRAG pipeline can serve queries, the legal corpus must be transformed into structured, searchable representations.

In the offline preprocessing stage, raw legal texts are converted into normalized data artifacts, retrieval indices, and a legal knowledge graph through the following steps:

  1. Preprocess law files

    Raw legal documents are parsed, cleaned, and normalized into a unified JSONL format. Each legal provision is assigned a stable identifier and enriched with structural metadata such as law name, part, chapter, article number, and source file.

    python -m scripts.preprocess_law
  2. Build retrieval indices

    Multiple complementary retrieval indices are constructed to support lexical, dense, and late-interaction search paradigms:

    • BM25 for sparse keyword-based retrieval
    • FAISS for dense vector similarity search
    • ColBERT for token-level late interaction retrieval
    python -m scripts.build_index
  3. Construct the legal knowledge graph

    Legal articles are connected via structured relationships (e.g., citation, reference, dependency), forming a directed legal knowledge graph that enables graph-based expansion and contextual reasoning.

    python -m scripts.build_graph

Generated Data Artifacts:

Lang Category Path Exists Size (MB)
11 en Graph data/graph/law_graph_en.jsonl True 0.228
3 en Index data/index/en/bm25.pkl True 1.796
9 en Index data/index/en/colbert/colbert_meta.jsonl True 0.987
5 en Index data/index/en/faiss/faiss.index True 1.884
7 en Index data/index/en/faiss/faiss_meta.jsonl True 0.962
1 en Processed data/processed/law_en.jsonl True 0.982
10 zh Graph data/graph/law_graph_zh.jsonl True 0.405
2 zh Index data/index/zh/bm25.pkl True 0.992
8 zh Index data/index/zh/colbert/colbert_meta.jsonl True 0.654
4 zh Index data/index/zh/faiss/faiss.index True 4.018
6 zh Index data/index/zh/faiss/faiss_meta.jsonl True 0.601
0 zh Processed data/processed/law_zh.jsonl True 0.658

Example entries from law_zh.jsonl and law_en.jsonl:

id law_name lang article_no article_id text source
0 minfadian.txt::一 中华人民共和国民法典 zh 第一条 1 第一条 为了保护民事主体的合法权益,调整民事关系,维护社会和经济秩序,适应中国特色社会主义发... minfadian.txt
1 minfadian.txt::二 中华人民共和国民法典 zh 第二条 2 第二条 民法调整平等主体的自然人、法人和非法人组织之间的人身关系和财产关系。 minfadian.txt
2 minfadian.txt::三 中华人民共和国民法典 zh 第三条 3 第三条 民事主体的人身权利、财产权利以及其他合法权益受法律保护,任何组织或者个人不得侵犯。 minfadian.txt
id law_name lang article_no article_id text source
0 ucc_1.txt::1-101 Uniform Commercial Code en § 1-101 1-101 § 1-101. Short Titles.(a) This [Act] may be ci... ucc_1.txt
1 ucc_1.txt::1-102 Uniform Commercial Code en § 1-102 1-102 § 1-102. Scope of Article.This article applies... ucc_1.txt
2 ucc_1.txt::1-103 Uniform Commercial Code en § 1-103 1-103 § 1-103. Construction of Uniform Commercial Co... ucc_1.txt
Field Description
id Internal numeric identifier
law_name Name of the law
lang Language of the law article
article_no Article number (human-readable)
article_id Stable article identifier
text Full article text
source Original source file

3 Initialize the RAG pipeline

This section sets up the RAG pipeline and configures the underlying language model. The pipeline is constructed from a centralized configuration, ensuring that retrieval, ranking, and generation components are consistently parameterized.

from legalrag.pipeline.rag_pipeline import RagPipeline
from legalrag.config import AppConfig

cfg = AppConfig.load(None)
cfg.llm.provider = "qwen-local"
cfg.llm.model = "Qwen/Qwen2.5-3B-Instruct"
pipeline = RagPipeline(cfg)

The default configuration uses local Qwen model as the generation model by default, and it can be switched to an OpenAI model:

cfg.llm.provider = "openai"
cfg.llm.model = "gpt-4.1-mini"   
# Requires setting OPENAI_API_KEY in environment variables

4 Query Understanding and Routing

This step distinguishes between different query types and selects a retrieval mode accordingly.

Accurate query understanding is critical, as it directly influences downstream retrieval strategies and answer generation behavior.

from legalrag.routing.router import QueryRouter
from legalrag.llm.client import LLMClient

llm = LLMClient.from_config(cfg)
router = QueryRouter(llm_client=llm, llm_based=cfg.routing.llm_based)
question = '已经有两个亲生孩子的家庭可以再收养一个孩子吗?'
decision = router.route(question)
print('Issue Type: ', decision.issue_type)
print('Task Type: ', decision.task_type)
print('mode: ', decision.mode)
Issue Type:  IssueType.MARRIAGE_FAMILY
Task Type:  TaskType.JUDGE_STYLE
mode:  RoutingMode.RAG