读下来感觉尬吹 Hermes. 其实作者讲的 memory 的点 Claude Code 早就做到了. 作者对 CC memory 的逆向工程是去年做的, 不是基于泄露的代码.
关于 AI 产品是否需要推出记忆功能的决策点可以参考.
Reverse Engineering ChatGPT, Claude, OpenClaw, and Hermes Convinced Me Most AI Products Shouldn’t Ship Memory
逆向工程 ChatGPT、Claude、OpenClaw 和 Hermes 后,我确信大多数 AI 产品不应该推出记忆功能
Apr 22, 2026 · Manthan Gupta
The first time I asked ChatGPT what it remembered about me, it listed 33 facts. Name, career goals, fitness routine, names of side projects I had mentioned weeks earlier, a throwaway line about my sabbatical from a completely unrelated chat. I was impressed. I spent the next few weeks reverse-engineering how it actually works, then did the same for Claude, OpenClaw, and Hermes. Somewhere around the third system, I stopped noticing how clever the designs were and started noticing how often memory was degrading my own outputs. 第一次我问 ChatGPT 它记得关于我的什么事情时,它列出了 33 个事实。姓名、职业目标、健身习惯、几周前我提到的个人项目名称,甚至还有从一个完全无关的对话中随意带过的一句关于我休整期的话。我印象深刻。接下来的几周时间里,我逆向工程了它到底是如何运作的,随后又对 Claude、OpenClaw 和 Hermes 做了同样的事情。大约在研究到第三个系统时,我不再关注它们的设计有多巧妙,而是开始注意到记忆功能有多频繁地降低了我自己的输出质量。
Explicit state beats implicit memory, surprisingly often. Memory is not a default feature, it is a product and systems tax, and most AI products have not earned the right to pay it. 显式状态往往出奇地频繁地胜过隐式记忆。 记忆不是一项默认功能,它是产品和系统层面的“税”,而大多数 AI 产品还没有赚到能够支付这笔“税”的资格。
What I Saw Inside Four Memory Systems
我在四个记忆系统的内部看到了什么
I want to start with the synthesis, because it is the part I could not have written without doing the reverse-engineering work. 我想从综合性结论开始讲起,因为如果不做前面的逆向工程,我是写不出这部分内容的。
Each of the four systems I mapped takes a fundamentally different approach: 我所拆解的这四个系统,在根本上采取了截然不同的方法:
-
ChatGPT uses an injected profile. A long-term fact store (33 facts, in my case), plus pre-computed summaries of recent chats, plus session metadata. All glued into every prompt. Just a curated block that rides along on every turn.
ChatGPT 使用的是注入式画像(injected profile)。一个长期事实存储库(在我的例子中是 33 个事实),加上近期对话的预计算摘要,以及会话的元数据。所有这些内容都被粘合进每一个提示词(prompt)中。它就像一个经过整理的文本块,在每一轮对话中顺路“搭车”。
-
Claude uses on-demand retrieval. A small
<userMemories>block is always present, but past conversations are not injected by default. The model can invokeconversation_searchorrecent_chatsas tools when it decides context is relevant.Claude 使用的是按需检索(on-demand retrieval)。它始终保留一个很小的
<userMemories>块,但默认情况下不会注入过去的对话。当模型判定上下文有相关性时,它可以调用conversation_search或recent_chats作为工具。 -
OpenClaw uses a Markdown workspace. Everything is plain files on disk:
MEMORY.mdfor durable knowledge,memory/YYYY-MM-DD.mdfor daily logs, indexed for hybrid semantic + keyword search. The agent searches its own notes on demand.OpenClaw 使用的是 Markdown 工作区(Markdown workspace)。一切内容都是磁盘上的纯文本文件:
MEMORY.md用于持久化知识,memory/YYYY-MM-DD.md用于日常日志,并为“混合语义 + 关键字”搜索建立了索引。智能体(agent)会按需搜索自己的笔记。 -
Hermes uses a hot/cold split. A tiny frozen prompt memory —
MEMORY.mdcapped at 2,200 characters andUSER.mdat 1,375 characters, about 1,300 tokens combined, plus a SQLite-backedsession_searchfor episodic recall, plus a skills system for procedural memory, plus an optional user-modeling layer.Hermes 使用的是冷热拆分(hot/cold split)。它包含一个极小的冻结提示词记忆——上限为 2,200 个字符的
MEMORY.md和 1,375 个字符的USER.md,合计约 1,300 个 token;外加一个由 SQLite 支持的session_search用于情景记忆召回,一个用于程序性记忆的技能系统,以及一个可选的用户建模层。
Four systems, four different answers. Put them next to each other and a pattern becomes obvious. The simplest approach, injected profile, is also the one with the worst failure mode, every old fact competes for attention on every future prompt, and you have no principled story for when to stop. It is also the one most product teams copy, because it is the easiest to demo and the easiest to ship. That is how most AI products ended up with memory that degrades their own outputs. 四个系统,四种不同的答案。把它们放在一起比较,模式就很明显了。最简单的方法——注入式画像——恰好也是具有最糟糕失效模式的方法:每一个旧事实都在未来的每一个提示词中争夺注意力,而且你并没有一套原则性的依据来决定何时该停。但这也是大多数产品团队照搬的方法,因为它最容易演示,也最容易推出。这就是为什么大多数 AI 产品最终搭载的记忆功能,反而会降低它们自身的输出质量。
The most deliberate of the four is Hermes, and the design choice holding it together is a single sentence from the source: keep the prompt stable for caching, and push everything else to tools. Memory stops being ambient and becomes a choice the model has to make. That is the direction I think most teams should copy, and it is the opposite of what the easy path gives you. 这四个系统中设计最为审慎的是 Hermes,而支撑其整体设计的核心逻辑是原文中的一句话:保持提示词稳定以利于缓存,将其他一切推给工具。 记忆不再弥漫在背景里,而是变成了模型必须做出的一种选择。我认为大多数团队应该效仿的是这个方向,它与那条简单捷径所带来的结果截然相反。
Storage is Easy, Retrieval Policy is Hard.
存储很简单,检索策略很困难。
Once I had looked inside four of these systems, one observation would not go away. The storage side is actually not the hard part. Storing facts is easy. You can do it with a JSON blob, a Markdown file, a SQLite table, or a vector index. The interesting part is the part that decides whether memory helps or hurts the output, the retrieval policy, the heuristic that decides which remembered thing gets pulled into which future prompt. 当我深入探究了这四个系统的内部后,有一个观察结果始终挥之不去。存储端其实并不是难点。存储事实很容易,你可以用 JSON 块、Markdown 文件、SQLite 数据表或向量索引来实现。真正有趣的部分,是决定记忆究竟是有助于还是有损于输出的那个环节——也就是检索策略(retrieval policy),这是一种启发式规则,决定了哪些被记住的内容会被拉入未来的哪一个提示词中。
Walk through the four systems through that lens and the picture gets clearer: 透过这个视角逐一梳理这四个系统,情况就变得更加清晰了:
-
ChatGPT’s retrieval policy is “always inject.” Every stored fact rides along on every prompt. Cheap, fast, and the reason old context keeps shaping new answers whether it is relevant or not.
ChatGPT 的检索策略是“始终注入”。每一个存储的事实都会跟随每一个提示词。这种方式便宜、快速,但也正是为什么既往的上下文会不断塑造新回答的原因,无论它是否相关。
-
Claude’s retrieval policy is “model decides.” The model has to recognize when past conversations matter and call a tool. Cleaner prompts when it works, but dependent on the model getting the “do I need to search?” call right.
Claude 的检索策略是“模型决定”。模型必须识别出过去的对话在何时起作用,并调用工具。如果奏效,提示词会更干净,但这也依赖于模型能否在“我需要搜索吗?”这个问题上做出正确的判断。
-
OpenClaw’s retrieval policy is “agent issues semantic + keyword search.” Better than always-inject, but the more notes you accumulate, the harder the search has to work, and the more likely you are to pull stale or redundant material.
OpenClaw 的检索策略是“智能体发起语义 + 关键字搜索”。这比“始终注入”要好,但是你积累的笔记越多,搜索越费劲,你也越有可能调取出过时或多余的资料。
-
Hermes’s retrieval policy is tiered and explicit. A tiny hot set for durable facts, a separate cold store for episodic history, a separate skills index for procedural knowledge, and clear rules about what belongs where. (“Save user preferences, environment facts, recurring corrections, stable conventions. Do not save task progress, session outcomes, temporary TODO state.”)
Hermes 的检索策略是分层且明确的。一个极小的热集用于存储持久事实,一个独立的冷存储库用于记录情景历史,一个独立的技能索引用作程序性知识,并且有明确的规则规定什么内容属于哪里。(“保存用户偏好、环境事实、重复性纠正、稳定的约定。不要保存任务进度、会话结果、临时的 TODO 状态。”)
Every failure I saw, and most of the failures the rest of this post describes, comes from a weak retrieval policy and not a weak storage layer. ChatGPT’s failure is architectural: the policy is “always inject,” which is why stale context keeps bleeding through. A better storage scheme would not fix that. Only a better retrieval policy would. 我所看到的每一次失败,以及本文其余部分描述的大多数失败,都源于薄弱的检索策略,而不是薄弱的存储层。ChatGPT 的失败是架构层面的:它的策略本身就是“始终注入”,这就是为什么过时的上下文会不断发生渗漏。更好的存储方案无法解决这个问题,只有更好的检索策略才能解决。
This matters because most teams shipping memory are spending their complexity budget on the wrong half. They compare vector stores, they design embedding pipelines, they debate chunk sizes. The retrieval policy gets one sentence in the design doc: “we’ll retrieve the top-k relevant items and inject them.” That one sentence is where the product quality lives. Most teams are flying blind on it. 这一点至关重要,因为大多数推出记忆功能的团队把他们的“复杂性预算”花在了错误的那一半上。他们去比较各种向量数据库、设计嵌入(embedding)流水线、争论文本分块的大小。而检索策略在设计文档中却只有一句话:“我们将检索 top-k 的相关项并将它们注入。” 就是这一句话决定了产品质量的命脉。而大多数团队在这方面完全是盲目飞行的。
The Best Case for Memory
支持记忆功能的最佳论据
I want to take the opposite view seriously before arguing against it. 在反驳相反观点之前,我想先认真对待它。
The strongest case for default memory is friction reduction. Not having to re-enter preferences every session is genuinely nice, especially for casual users. “Remember I’m vegetarian” should not need to be said twice. 支持默认开启记忆功能的最强理由是减少摩擦(减少使用阻力)。不必在每次会话中重新输入偏好设定,这确实不错,尤其是对于普通用户而言。“记住我是素食主义者”这种事确实不应该让人说两遍。
The next strongest case is continuity for inherently longitudinal products. Meeting tools like Granola, personal knowledge products like Reflect and Mem.ai, relationship products like Replika, therapy companions. For these, memory is not a feature, it is the product. 第二强的理由是本质上是长周期产品的连续性。例如 Granola 这类会议工具、Reflect 和 Mem.ai 这类个人知识产品、Replika 这类关系型/陪伴型产品以及治疗陪伴类产品等。对于这些产品来说,记忆不是一项功能,它就是产品本身。
The third is the retention wedge. Mike Taylor’s piece notes that the “it knows me so well” feeling is exactly what locks ChatGPT users in. That is real. Users do not switch to Gemini or Claude partly because they do not want to rebuild the profile. Memory makes your product stickier whether or not it makes the outputs better. 第三个理由是它能作为留存抓手。Mike Taylor 的文章指出,那种“它好懂我”的感觉正是锁定 ChatGPT 用户的关键。这确实如此。用户不换用 Gemini 或 Claude,部分原因是他们不想重新建立个人画像。无论记忆是否让输出质量变得更好,它确实让你的产品更具粘性。
The fourth is low-stakes drift. For casual tasks like recipe ideas, travel suggestions, chit-chat being slightly wrong because of stale memory does not really hurt the user. 第四个理由是低风险的漂移。对于获取食谱建议、旅游推荐、闲聊等随意性任务,仅仅因为陈旧的记忆而稍微产生一点偏差,并不会真正伤害到用户。
Each of these has a counter. Friction reduction does not require implicit memory; a settings panel does it without the tax. The longitudinal case is exactly the one this post concedes for those products, ship memory. The retention wedge is a business case, not a quality case; you are trading output quality for stickiness, which is a legitimate choice but should be made consciously. And low-stakes drift assumes your product only serves casual tasks, which is rarely true the same ChatGPT user doing recipe lookups is also doing code reviews and performance reviews and therapy-adjacent venting, and stale memory does not know which of those it is in. 但这些理由都有相应的反面论点。减少摩擦并不一定需要隐式记忆;一个设置面板就能做到,且不用付出这笔“税”。长周期情形正是本文对这些特定产品作出的让步——它们确实应该提供记忆功能。而“留存抓手”是一个商业维度的理由,而不是质量维度的理由;你是在用输出质量换取产品粘性,这是一种合法的选择,但应该有意识地做出。此外,“低风险的漂移”假设你的产品只服务于随意的任务,但这往往是不成立的——同一个查食谱的 ChatGPT 用户,同时也在进行代码审查、绩效评估以及类似心理治疗的倾诉,而陈旧的记忆并不知道自己当前正处于哪种场景中。
The strongest version of the pro-memory argument is real. It is just much narrower than the scope most products ship memory at. 支持记忆功能的最强表述是真实存在的。只不过它的适用范围,远比大多数产品推出记忆功能时的实际作用范围要窄得多。
Where Memory Goes Wrong in Practice
记忆功能在实践中出错的地方
Once you ship memory with a weak retrieval policy, it fails in predictable, documented ways. 一旦你在搭配薄弱检索策略的情况下推出记忆功能,它就会以可预测的、已有充分记载的方式出问题。
Output quality degrades. Mike Taylor’s Why I Turned Off ChatGPT’s Memory is the best user-side documentation of this. He put a Kanye quote about “dopeness” into his custom instructions, and ChatGPT started claiming it had built a collapsible website section “as dope as possible”, applying the same quote to interior decor, marketing plans, and Python debugging. When he turned memory back on to write the piece, a request for barbecue rib advice came back as “Hoboken Dinner Upgrade Ideas” because the assistant knew he had just moved. OP-Bench shows the pattern at benchmark scale: memory-augmented agents retrieve user details even when unnecessary, then over-attend to them until the details overshadow the actual query. 输出质量下降。 Mike Taylor 的文章《为什么我关掉了 ChatGPT 的记忆功能》 是关于这一点最好的用户端文档。他把 Kanye 一句关于“dopeness(酷毙了)”的名言放进了自定义指令里,结果 ChatGPT 开始声称它构建了一个“尽可能酷毙了”的可折叠网站区域,并将同一句话强行套用在室内装饰、营销计划以及 Python 调试中。当他为了写这篇文章而重新打开记忆功能时,一个关于烤排骨建议的请求,返回的却是“霍博肯晚餐升级创意”,因为助手“记得”他刚搬家。OP-Bench 在基准测试的规模上展示了这种模式:具备记忆增强的智能体会检索用户细节(哪怕并不需要),然后过度关注这些细节,直到这些细节掩盖了实际的查询意图。
Debugging gets dramatically harder. I have spent enough time inside agent codebases to know the signature of a memory bug: the live trace is clean. Prompt, retrieval, tool logs are all fine. The weird behavior still happens. The reason is architectural. 调试会急剧变难。 我在智能体代码库里花了足够多的时间,非常清楚记忆类 bug 的特征:实时追踪日志(live trace)看起来干干净净。提示词、检索、工具调用的日志都没问题。但奇怪的行为依然会发生。原因在于架构。
REQUEST PIPELINE (what your logs see)
─────────────────────────────────────────────────
user msg → system prompt → retrieval → tool calls → response
▲
│ logs end here
MEMORY PIPELINE (what your logs do NOT see)
─────────────────────────────────────────────────
session end → summarizer → memory store
│
└──→ next session's system prompt
The thing shaping tomorrow's answer lives in a
pipeline you are not logging.
请求流水线 (你的日志能看到的)
─────────────────────────────────────────────────
用户消息 → 系统提示词 → 检索 → 工具调用 → 回复
▲
│ 日志到此结束
记忆流水线 (你的日志看不到的)
─────────────────────────────────────────────────
会话结束 → 摘要生成器 → 记忆存储
│
└──→ 下次会话的系统提示词
塑造明天回答的东西,存在于一条
你根本没有记录日志的流水线中。
The Unit 42 writeup on Amazon Bedrock Agents describes exactly this shape: memory is produced by a separate session summarization process that runs at session end and merges into the next session’s system prompt. Every memory system I reverse-engineered has some version of this split. With memory, you are not debugging a request, you are debugging a relationship, and the relationship is logged somewhere you are not looking. Unit 42 关于 Amazon Bedrock Agents 的报告 准确描述了这种形态:记忆由一个独立的会话摘要过程产生,该过程在会话结束时运行,并合并到下一个会话的系统提示词中。我所逆向工程的每一个记忆系统,都采用了这种分离模式的某种变体。引入记忆功能后,你调试的就不再是一个单纯的请求,而是一段关系,而这种关系被记录在你根本不去看的地方。
Context rot. The context window is not free intelligence, it is scarce working memory. Chroma’s Context Rot research evaluates 18 leading models on deliberately controlled tasks, holding task difficulty constant and varying only input length. Performance degrades with input length across every model they test, and distractors hurt more as context grows. A Databricks study shows accuracy dropping well before the window is full, sometimes as early as 32k tokens. A Microsoft/Salesforce paper shows splitting a prompt into a multi-turn conversation instead of one shot drops performance by 39% on average. Most memory systems are context inflation mechanisms in disguise. 上下文退化(Context rot)。 上下文窗口并不是免费的智力,而是稀缺的工作记忆。Chroma 的 上下文退化研究 对 18 种主流模型在精心控制的任务上进行了评估,保持任务难度不变,仅改变输入长度。在他们测试的每个模型中,性能都随着输入长度的增加而下降,且随着上下文的增加,干扰项造成的伤害更大。一份 Databricks 的研究 表明,准确率在窗口填满之前就已经下降,有时早至 32k 个 token 时就会发生。一篇 微软/Salesforce 的论文 显示,把提示词拆分成多轮对话而不是一次性输入,平均性能会下降 39%。大多数记忆系统其实都是伪装的“上下文通胀”机制。
Privacy gets weird fast. I built BYOM because, once you look at a memory system clearly, it stops being a convenience feature and becomes a persistent user-profiling system. The CIMemories paper calls out the specific failure mode: memory-augmented LLMs often pick the right domain to talk about but cannot tell which details inside that domain are relevant. Right domain (life logistics), wrong granularity (a therapy schedule bleeding into a work email draft). Personalization and contextual integrity are not the same thing. 隐私问题很快就会变得诡异。 我开发 BYOM 的原因是,一旦你清晰地审视一个记忆系统,它就不再是一个便利功能,而变成了一个持久的用户侧写(profiling)系统。CIMemories 论文指出了具体的失效模式:经过记忆增强的 LLM 通常能选对要讨论的领域,却无法分辨该领域内哪些具体细节才是相关的。找对了领域(生活琐事),却搞错了粒度(心理治疗的时间安排意外地渗入了工作邮件的草稿中)。个性化和语境完整性(contextual integrity)并不是一回事。
A brand new attack surface. The Unit 42 proof-of-concept against Bedrock Agents shows what “poisoned memory” actually looks like. An attacker hides a prompt injection in a webpage. The victim asks their travel agent to read the URL. Nothing goes wrong in the live session, the payload is crafted to target the session summarization prompt, not the orchestration prompt. When the session closes, the summarizer writes the attacker’s instructions into memory as a normal-looking topic. Days later, the user returns, books a trip, and the agent exfiltrates the booking to an attacker-controlled domain by calling its own scrape_url tool. Prompt injection against a stateless chat is transient. Prompt injection into memory is persistent. The MINJA paper shows the attacker does not even always need access to the memory store, user-style interaction alone can land the payload.
全新的攻击面。 Unit 42 针对 Bedrock Agents 的概念验证展示了“被污染的记忆”到底长什么样子。攻击者将一个提示词注入隐藏在网页中。受害者让他们的旅行代理读取这个 URL。在实时会话中一切正常,因为攻击载荷的目标是会话摘要提示词,而不是编排提示词。当会话结束时,摘要器将攻击者的指令作为一个看似正常的话题写入记忆中。几天后,用户回来预订行程,该智能体通过调用自己的 scrape_url 工具,将预订信息窃取到了攻击者控制的域名上。针对无状态聊天的提示词注入是短暂的。而针对记忆的提示词注入是持久的。MINJA 论文 显示,攻击者甚至不总是需要访问记忆存储,仅靠模拟正常用户的交互就能成功植入载荷。
Personality drift. PersistBench reports median failure rates of 53% on cross-domain leakage and 97% on sycophancy samples in long-term-memory systems. The sycophancy number is the scary one. If what the assistant remembers about you nudges it toward agreement and accommodation instead of honest judgment, you do not have a memory problem, you have a judgment problem. And it will still feel personal to the user. 性格漂移(Personality drift)。 PersistBench 报告指出,在长期记忆系统中,跨领域信息泄漏的中位数失败率高达 53%,阿谀奉承样本的中位数失败率高达 97%。其中关于阿谀奉承的数据令人毛骨悚然。如果助手记得的关于你的事情促使它倾向于顺从和迎合你,而不是给出诚实的判断,那么你遇到的就不再是记忆问题,而是判断力问题。而对用户来说,这仍然会很“懂我”、很贴身。
The Hermes Pattern: Memory as a Tool, Not Ambient Context
Hermes 模式:记忆作为工具,而非弥漫式背景
If the previous section is the symptom list, Hermes is the design that treats the disease. 如果上一节是症状清单,那么 Hermes 就是针对这种疾病的设计方案。
Three principles hold it together. First, it separates hot memory from cold recall. A tiny always-injected block for durable facts, a searchable cold store for episodic history, a skills index for procedural memory. Nothing in the “always-injected” tier is allowed to grow unbounded, because prompt memory is cache-sensitive working set, not a diary. 它的设计由三个原则支撑。首先,它将热记忆与冷层检索分离开来。一个极小的、始终注入的文本块用于保存持久事实,一个可搜索的冷存储库用于记录情景历史,一个技能索引用作程序性记忆。“始终注入”层级里的任何东西都不允许无限制增长,因为提示词记忆是对缓存敏感的工作集,而不是用来写日记的。
Second, it treats prompt stability as a first-class constraint. Memory is frozen into a snapshot at session start and not mutated mid-session. Writes go to disk immediately, but the prompt stays stable until a natural rebuild point (new session, post-compression). Every agent system ships memory without thinking about caching. Hermes does. 其次,它将提示词稳定性视为一等约束条件。记忆在会话开始时被冻结为一个快照,在会话中途不会被改写。写入操作会立即存入磁盘,但提示词会保持稳定,直到达到一个自然的重建点(如开启新会话、压缩后)。每个智能体系统在上线记忆时都没有考虑过缓存问题,而 Hermes 考虑了。
Third, it acknowledges that memory is plural. Facts, episodes, skills, and deeper user modeling are distinct retrieval problems with distinct policies. One store does not solve them all. 第三,它承认记忆是多元的。事实、情景、技能以及更深层的用户建模,是各自独立的检索问题,需要独立的策略。一个单一的存储库无法解决所有问题。
This is what memory done right looks like in practice. Not a bigger vector DB. Not smarter auto-promotion. Fewer things in the system prompt, more things in tools, explicit rules about what belongs where. Most AI products that ship memory could cut their memory surface by 80% and end up with better outputs. 这就是在实践中“正确做记忆”的样子。不是搞一个更大的向量数据库。也不是搞更聪明的自动提升(到热层)。而是尽量减少系统提示词里的内容,把更多东西放进工具里,并对什么内容归属于哪里制定明确的规则。大多数推出了记忆功能的 AI 产品,如果把它们的记忆面削减 80%,最终反而会得到更好的输出结果。
Before You Ship Memory, Answer These
在推出记忆功能之前,请回答这些问题
The synthesis above turns into a checklist. Before shipping memory in your product, answer these honestly: 上述综合分析可以转化为一份清单。在为你的产品推出记忆功能之前,请诚实地回答以下问题:
-
Is your product inherently longitudinal? Do users get less value from session one than from session ten? If no, you do not need memory.
你的产品本质上是长周期的吗? 用户在第一次会话中获得的价值是否低于第十次会话?如果不是,你就不需要记忆功能。
-
Can you draw a clear line between your storage system and your retrieval policy? If no, you are about to ship the ChatGPT failure mode and call it personalization.
你能否在你的存储系统和检索策略之间划出一条清晰的界线? 如果不能,你即将推出类似 ChatGPT 的那种失效模式,还会把它美其名曰“个性化”。
-
Will the stored state be visible to users and directly editable by them? If no, you are building implicit profiling, not memory.
存储的状态对用户是否可见,且可以直接由他们编辑? 如果不是,你构建的就是隐式的用户侧写,而不是记忆。
-
Can you scope recall to a specific task, project, or explicit tool invocation? If no, ambient memory will bleed across contexts.
你能否将召回范围限制在特定任务、项目或显式的工具调用中? 如果不能,环境记忆将会跨越不同的上下文发生泄漏。
-
Is your team willing to own the privacy, security, and debugging tax for the next three years? If no, you are not shipping memory, you are shipping a liability.
你的团队是否愿意在未来三年内扛住隐私、安全和调试这笔“税”? 如果不愿意,你推出的就不是记忆功能,而是一个负债。
If the honest answer to most of these is no, do not ship memory. Ship visible settings, scoped project state, and explicit task briefs instead. Cursor’s .cursorrules and AGENTS.md, Claude Projects, Zed’s .rules, ChatGPT Custom Instructions, Linear task context, all of these work because they are legible, editable, and scoped. None of them need a memory layer to do their job.
如果对这些问题的大多数诚实回答是“否”,请不要推出记忆功能。相反,交付可见的设置、具有范围限制的项目状态以及明确的任务简报。Cursor 的 .cursorrules 和 AGENTS.md、Claude 的 Projects、Zed 的 .rules、ChatGPT 的自定义指令(Custom Instructions)、Linear 的任务上下文,所有这些功能之所以奏效,是因为它们透明可查、可编辑且有范围限制。它们根本不需要记忆层就能胜任。
Conclusion
结论
Memory sounds like intelligence because humans associate memory with understanding. Product memory is not human memory. It is stored context with retrieval rules, summarization errors, privacy trade-offs, security exposure, and a constant tendency to turn old signals into future bias. That does not make it useless. It makes it expensive. 记忆听起来像是一种智能,因为人类习惯将记忆与理解联系起来。但产品记忆不是人类记忆。它是带有检索规则、总结错误、隐私折中、安全暴露面的存储上下文,并且存在着将旧信号转化为未来偏见的恒定趋势。这并不意味着它是无用的。但它确实非常昂贵。
If your AI product still struggles with basic workflow design, explicit settings, clean state management, and reliable task execution, adding memory will not make it smarter. It will make it harder to understand when it fails, harder to debug when it drifts, and harder to trust when it confidently carries the wrong things forward. 如果你的 AI 产品仍然在基本的工作流设计、显式设置、清晰的状态管理以及可靠的任务执行上挣扎,那么增加记忆功能并不会让它变得更聪明。它只会让产品在失败时变得更难理解,在发生漂移时更难调试,并在它自信满满地把错误内容带到后续交互中时更难以被信任。
The truth is that most AI products do not need better memory. They need better product design. 事实是,大多数 AI 产品不需要更好的记忆。它们需要的是更好的产品设计。
References
参考资料
-
Why I Turned Off ChatGPT’s Memory - Mike Taylor, Every
为什么我关掉了 ChatGPT 的记忆功能 (Why I Turned Off ChatGPT’s Memory) - Mike Taylor, Every
-
Context Rot: How Increasing Input Tokens Impacts LLM Performance - Chroma
-
Long Context RAG Performance of LLMs - Databricks
大语言模型的长上下文 RAG 性能表现 (Long Context RAG Performance of LLMs) - Databricks
-
When AI Remembers Too Much: Persistent Behaviors in Agents’ Memory - Unit 42
当 AI 记住太多:智能体记忆中的持久行为 (When AI Remembers Too Much) - Unit 42
-
CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs
-
PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?
-
OP-Bench: Benchmarking Over-Personalization in Memory-Augmented Conversational Agents
-
MINJA: Memory Injection Attacks on LLM Agents via Query-Only Interaction