ChatDoc 系列有很多实现, 典型流程如下:
- 解析文档获得文本, 将文本切成小块, 存储各文本块的 embedding 作为索引
- 获得用户 query 的 embedding, 从文本块 embedding 索引中召回 (kNN) 相关文本块
- 将 query 和召回的文本块组装成 prompt 调用 LLM 获得答案
下面是一些开源项目及其实现方式区别
pdfGPT
Embedding
句子 embedding 用了 universal-sentence-encoder, 只支持英文.
Retrieval
kNN
Prompt
prompt = ""
prompt += 'search results:\n\n'
for c in topn_chunks:
prompt += c + '\n\n'
prompt += (
"Instructions: Compose a comprehensive reply to the query using the search results given. "
"Cite each reference using [ Page Number] notation (every result has this number at the beginning). "
"Citation should be done at the end of each sentence. If the search results mention multiple subjects "
"with the same name, create separate answers for each. Only include information found in the results and "
"don't add any additional information. Make sure the answer is correct and don't output false content. "
"If the text does not relate to the query, simply state 'Text Not Found in PDF'. Ignore outlier "
"search results which has nothing to do with the question. Only answer what is asked. The "
"answer should be short and concise. Answer step-by-step. \n\nQuery: {question}\nAnswer: "
)
prompt += f"Query: {question}\nAnswer:"
最有趣的点是这个 prompt 让 LLM 回答时每个句末给出来源编号.
观察 PandaGPT (非开源) 给出的 sources, 固定是四个, 而且包含无关来源, 可以推断其实现方式应该是给出所有召回文本块的位置. ChatDOC 给出的 sources 数量不固定, 但也有无关来源, 估计也是这么做的, 而且可能是按相关度排序的.
ChatFiles
基于 llama_index 开发, 这个包封装了开头说的 ChatDoc 典型流程.
Embedding
默认调 openai 接口
Retrieval
可以配置多种方式, 默认是 kNN
Prompt
DEFAULT_PROMPT = (
"We have provided context information below: \n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given this information, Please answer my question in the same language that I used to ask you.\n"
"Please answer the question: {query_str}\n"
)
llama_index 中根据使用场景还有别的 prompt; 如果 prompt 过长 (因为召回的文本块过长), 则会切分成多个 prompts, 后续的 prompt 中加入之前的回答, 并让 LLM 优化之前的回答.
llama_index 的一些原始 prompt
DEFAULT_TEXT_QA_PROMPT_TMPL = (
"Context information is below. \n"
"---------------------\n"
"{context_str}"
"\n---------------------\n"
"Given the context information and not prior knowledge, "
"answer the question: {query_str}\n"
)
DEFAULT_REFINE_PROMPT_TMPL = (
"The original question is as follows: {query_str}\n"
"We have provided an existing answer: {existing_answer}\n"
"We have the opportunity to refine the existing answer "
"(only if needed) with some more context below.\n"
"------------\n"
"{context_msg}\n"
"------------\n"
"Given the new context, refine the original answer to better "
"answer the question. "
"If the context isn't useful, return the original answer."
)
DEFAULT_SUMMARY_PROMPT_TMPL = (
"Write a summary of the following. Try to use only the "
"information provided. "
"Try to include as many key details as possible.\n"
"\n"
"\n"
"{context_str}\n"
"\n"
"\n"
'SUMMARY:"""\n'
)
langchain-ChatGLM
imClumsyPanda/langchain-ChatGLM
基于 langchain 开发, 针对中文优化.
Embedding
embedding_model_dict = {
"ernie-tiny": "nghuyong/ernie-3.0-nano-zh",
"ernie-base": "nghuyong/ernie-3.0-base-zh",
"text2vec-base": "shibing624/text2vec-base-chinese",
"text2vec": "GanymedeNil/text2vec-large-chinese",
}
langchain 的 HuggingFaceEmbeddings 的默认模型是 sentence-transformers/all-mpnet-base-v2, 其中 “The all-* models are only trained on English. The all means ‘all-training-datasets’” (#1232) 所以中文可能不够好.
Prompt
PROMPT_TEMPLATE = """已知信息:
{context}
根据上述已知信息,简洁和专业的来回答用户的问题。如果无法从中得到答案,请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”,不允许在答案中添加编造成分,答案请使用中文。 问题是:{question}"""
Miscs
General prompt design
上面几个例子基本上是这样的
context
instruction
query
Retrieval
对每个文本块总结或者扩写问题 (问 LLM 这个文本块可以问什么问题), 再把这些信息和文本块一起存储起来, 期待改善召回.
搜索引擎化
- LLM+Embedding构建问答系统的局限性及优化方案
- 2023-06-13 Large Language Models and Search
- RAG探索之路的血泪史及曙光 (我还没看)