ChatDoc 系列简要

ChatDoc 系列有很多实现, 典型流程如下:

  • 解析文档获得文本, 将文本切成小块, 存储各文本块的 embedding 作为索引
  • 获得用户 query 的 embedding, 从文本块 embedding 索引中召回 (kNN) 相关文本块
  • 将 query 和召回的文本块组装成 prompt 调用 LLM 获得答案

下面是一些开源项目及其实现方式区别

pdfGPT

bhaskatripathi/pdfGPT

Embedding

句子 embedding 用了 universal-sentence-encoder, 只支持英文.

Retrieval

kNN

Prompt

prompt = ""
prompt += 'search results:\n\n'
for c in topn_chunks:
    prompt += c + '\n\n'

prompt += (
    "Instructions: Compose a comprehensive reply to the query using the search results given. "
    "Cite each reference using [ Page Number] notation (every result has this number at the beginning). "
    "Citation should be done at the end of each sentence. If the search results mention multiple subjects "
    "with the same name, create separate answers for each. Only include information found in the results and "
    "don't add any additional information. Make sure the answer is correct and don't output false content. "
    "If the text does not relate to the query, simply state 'Text Not Found in PDF'. Ignore outlier "
    "search results which has nothing to do with the question. Only answer what is asked. The "
    "answer should be short and concise. Answer step-by-step. \n\nQuery: {question}\nAnswer: "
)

prompt += f"Query: {question}\nAnswer:"

最有趣的点是这个 prompt 让 LLM 回答时每个句末给出来源编号.

观察 PandaGPT (非开源) 给出的 sources, 固定是四个, 而且包含无关来源, 可以推断其实现方式应该是给出所有召回文本块的位置. ChatDOC 给出的 sources 数量不固定, 但也有无关来源, 估计也是这么做的, 而且可能是按相关度排序的.

ChatFiles

guangzhengli/ChatFiles

基于 llama_index 开发, 这个包封装了开头说的 ChatDoc 典型流程.

Embedding

默认调 openai 接口

Retrieval

可以配置多种方式, 默认是 kNN

Prompt

DEFAULT_PROMPT = (
    "We have provided context information below: \n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given this information, Please answer my question in the same language that I used to ask you.\n"
    "Please answer the question: {query_str}\n"
)

llama_index 中根据使用场景还有别的 prompt; 如果 prompt 过长 (因为召回的文本块过长), 则会切分成多个 prompts, 后续的 prompt 中加入之前的回答, 并让 LLM 优化之前的回答.

llama_index 的一些原始 prompt

DEFAULT_TEXT_QA_PROMPT_TMPL = (
    "Context information is below. \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)
DEFAULT_REFINE_PROMPT_TMPL = (
    "The original question is as follows: {query_str}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "We have the opportunity to refine the existing answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the question. "
    "If the context isn't useful, return the original answer."
)

DEFAULT_SUMMARY_PROMPT_TMPL = (
    "Write a summary of the following. Try to use only the "
    "information provided. "
    "Try to include as many key details as possible.\n"
    "\n"
    "\n"
    "{context_str}\n"
    "\n"
    "\n"
    'SUMMARY:"""\n'
)

langchain-ChatGLM

imClumsyPanda/langchain-ChatGLM

基于 langchain 开发, 针对中文优化.

Embedding

embedding_model_dict = {
    "ernie-tiny": "nghuyong/ernie-3.0-nano-zh",
    "ernie-base": "nghuyong/ernie-3.0-base-zh",
    "text2vec-base": "shibing624/text2vec-base-chinese",
    "text2vec": "GanymedeNil/text2vec-large-chinese",
}

langchain 的 HuggingFaceEmbeddings 的默认模型是 sentence-transformers/all-mpnet-base-v2, 其中 “The all-* models are only trained on English. The all means ‘all-training-datasets’” (#1232) 所以中文可能不够好.

Prompt

PROMPT_TEMPLATE = """已知信息:
{context} 

根据上述已知信息,简洁和专业的来回答用户的问题。如果无法从中得到答案,请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”,不允许在答案中添加编造成分,答案请使用中文。 问题是:{question}"""

Miscs

General prompt design

上面几个例子基本上是这样的

context
instruction
query

Retrieval

对每个文本块总结或者扩写问题 (问 LLM 这个文本块可以问什么问题), 再把这些信息和文本块一起存储起来, 期待改善召回.

搜索引擎化