LightRAG 源码简要分享

Guo, Z., Xia, L., Yu, Y., Ao, T., & Huang, C. (2024). Lightrag: Simple and fast retrieval-augmented generation.

大体流程:

  • 用 LLM 提取 chunks 中的实体和关系, 并存成一个图
  • 用 LLM 从 query 中提取关键词, 根据关键词召回实体或关系, 再找到最相关的 chunks, 最后把所有东西都拼起来给 LLM 输出答案

提取实体和关系并存为图

提示词位于 lightrag/prompt.py. 文档分片后, 让 LLM 按照特定格式提取实体和关系 (以及关键词), 再把 output 解析出来存储. 看代码下面的 step 3 content_keywords 好像全程都没用到.

1. Identify all entities. 
...
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
...
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

...

5. When finished, output {completion_delimiter}
Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
...
################
Output:
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
...
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter}
...
("content_keywords"{tuple_delimiter}"power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}

实体存储如下, 后面会把同名的实体合并 (description 也合并, 过长就 LLM 摘要), source_id 是所有来源的 chunk id. 根据 dp["entity_name"] + dp["description"] 做 embedding.

dict(
    entity_name=entity_name,
    entity_type=entity_type,
    description=entity_description,
    source_id=entity_source_id,
)

关系存储如下, 其中 edge_keywords 和 weight 分别为之前 LLM 生成的 relationship_keywordsrelationship_strength. 之后合并时 weight 会相加. 根据 dp["keywords"] + dp["src_id"] + dp["tgt_id"] + dp["description"] 做 embedding.

dict(
    src_id=source,
    tgt_id=target,
    weight=weight,
    description=edge_description,
    keywords=edge_keywords,
    source_id=edge_source_id,
    metadata={"created_at": time.time()},
)

把实体和关系分别作为节点和边, 存为图.

召回

用 LLM 从用户 query 中提取 high-level 与 low-level 关键词.

Given the query, list both high-level and low-level keywords. High-level keywords focus on overarching concepts or themes, while low-level keywords focus on specific entities, details, or concrete terms.
Example 1:

Query: "How does international trade influence global economic stability?"
################
Output:
{
  "high_level_keywords": ["International trade", "Global economic stability", "Economic impact"],
  "low_level_keywords": ["Trade agreements", "Tariffs", "Currency exchange", "Imports", "Exports"]
}
if query_param.mode == "local":
    entities_context, relations_context, text_units_context = await _get_node_data(
        ll_keywords,
        knowledge_graph_inst,
        entities_vdb,
        text_chunks_db,
        query_param,
    )
elif query_param.mode == "global":
    entities_context, relations_context, text_units_context = await _get_edge_data(
        hl_keywords,
        knowledge_graph_inst,
        relationships_vdb,
        text_chunks_db,
        query_param,
    )

先看所谓的 local 召回.

  1. 把 low-level 关键词拼接成字符串 (是的, 虽然抽出来格式是列表, 但没什么意义), 比如 Trade agreements, Tariffs, Currency exchange, Imports, Exports, 代码中记为 query (lightrag/operate.py_get_node_data 函数).
  2. 根据 query, 从实体向量库中召回 top k 实体.
  3. _find_most_related_text_unit_from_entities. 找出之前召回实体的所有边 (关系), 把所有 chunks 按照这些边的数量 (relation_counts) 从大到小排序, 根据限制 token 数截取前若干个 chunks.
  4. _find_most_related_edges_from_entities. 找出之前召回实体的所有边 (关系), 根据 tuple(实体节点的 degree 代码记为 rank, 边的 weight 即 relationship_strength) 从大到小排序, 根据限制 token 数截取前若干个关系的 description.
  5. 最后将实体, 关系, 以及 chunks 信息以 csv 的格式拼接起来给 LLM 推理.

entities_context

id,entity,type,description,rank
0,"""A CHRISTMAS CAROL""","""EVENT""","""A Christmas Carol is a literary event, being a classic story written by Charles Dickens and published in various editions.""",12

relations_context

id,source,target,description,keywords,weight,rank,created_at
0,"""A CHRISTMAS CAROL""","""CHARLES DICKENS""","""Charles Dickens is the author of 'A Christmas Carol,' making him the creator of this literary work.""","""authorship, literary creation""",10.0,13,UNKNOWN

text_units_context (chunks)

id,content
0,"The Project Gutenberg eBook of A Christmas Carol..."

再看 global 召回, 和之前类似, 从略.

  1. 用 high-level 关键词拼成字符串.
  2. 召回边 (关系).
  3. _find_related_text_unit_from_relationships
  4. _find_most_related_entities_from_relationships
  5. 最终拼接