Guo, Z., Xia, L., Yu, Y., Ao, T., & Huang, C. (2024). Lightrag: Simple and fast retrieval-augmented generation.
大体流程:
- 用 LLM 提取 chunks 中的实体和关系, 并存成一个图
- 用 LLM 从 query 中提取关键词, 根据关键词召回实体或关系, 再找到最相关的 chunks, 最后把所有东西都拼起来给 LLM 输出答案
提取实体和关系并存为图
提示词位于 lightrag/prompt.py
. 文档分片后, 让 LLM 按照特定格式提取实体和关系 (以及关键词), 再把 output 解析出来存储. 看代码下面的 step 3 content_keywords
好像全程都没用到.
1. Identify all entities.
...
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
...
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)
3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)
...
5. When finished, output {completion_delimiter}
Example 1:
Entity_types: [person, technology, mission, organization, location]
Text:
...
################
Output:
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
...
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter}
...
("content_keywords"{tuple_delimiter}"power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}
实体存储如下, 后面会把同名的实体合并 (description 也合并, 过长就 LLM 摘要), source_id
是所有来源的 chunk id. 根据 dp["entity_name"] + dp["description"]
做 embedding.
dict(
entity_name=entity_name,
entity_type=entity_type,
description=entity_description,
source_id=entity_source_id,
)
关系存储如下, 其中 edge_keywords
和 weight 分别为之前 LLM 生成的 relationship_keywords
和 relationship_strength
. 之后合并时 weight 会相加. 根据 dp["keywords"] + dp["src_id"] + dp["tgt_id"] + dp["description"]
做 embedding.
dict(
src_id=source,
tgt_id=target,
weight=weight,
description=edge_description,
keywords=edge_keywords,
source_id=edge_source_id,
metadata={"created_at": time.time()},
)
把实体和关系分别作为节点和边, 存为图.
召回
用 LLM 从用户 query 中提取 high-level 与 low-level 关键词.
Given the query, list both high-level and low-level keywords. High-level keywords focus on overarching concepts or themes, while low-level keywords focus on specific entities, details, or concrete terms.
Example 1:
Query: "How does international trade influence global economic stability?"
################
Output:
{
"high_level_keywords": ["International trade", "Global economic stability", "Economic impact"],
"low_level_keywords": ["Trade agreements", "Tariffs", "Currency exchange", "Imports", "Exports"]
}
if query_param.mode == "local":
entities_context, relations_context, text_units_context = await _get_node_data(
ll_keywords,
knowledge_graph_inst,
entities_vdb,
text_chunks_db,
query_param,
)
elif query_param.mode == "global":
entities_context, relations_context, text_units_context = await _get_edge_data(
hl_keywords,
knowledge_graph_inst,
relationships_vdb,
text_chunks_db,
query_param,
)
先看所谓的 local 召回.
- 把 low-level 关键词拼接成字符串 (是的, 虽然抽出来格式是列表, 但没什么意义), 比如
Trade agreements, Tariffs, Currency exchange, Imports, Exports
, 代码中记为 query (lightrag/operate.py
的_get_node_data
函数). - 根据 query, 从实体向量库中召回 top k 实体.
_find_most_related_text_unit_from_entities
. 找出之前召回实体的所有边 (关系), 把所有 chunks 按照这些边的数量 (relation_counts) 从大到小排序, 根据限制 token 数截取前若干个 chunks._find_most_related_edges_from_entities
. 找出之前召回实体的所有边 (关系), 根据 tuple(实体节点的 degree 代码记为 rank, 边的 weight 即relationship_strength
) 从大到小排序, 根据限制 token 数截取前若干个关系的 description.- 最后将实体, 关系, 以及 chunks 信息以 csv 的格式拼接起来给 LLM 推理.
entities_context
id,entity,type,description,rank
0,"""A CHRISTMAS CAROL""","""EVENT""","""A Christmas Carol is a literary event, being a classic story written by Charles Dickens and published in various editions.""",12
relations_context
id,source,target,description,keywords,weight,rank,created_at
0,"""A CHRISTMAS CAROL""","""CHARLES DICKENS""","""Charles Dickens is the author of 'A Christmas Carol,' making him the creator of this literary work.""","""authorship, literary creation""",10.0,13,UNKNOWN
text_units_context (chunks)
id,content
0,"The Project Gutenberg eBook of A Christmas Carol..."
再看 global 召回, 和之前类似, 从略.
- 用 high-level 关键词拼成字符串.
- 召回边 (关系).
_find_related_text_unit_from_relationships
_find_most_related_entities_from_relationships
- 最终拼接