The short version: MemPalace is the lightweight pick at 170 tokens startup and 96.6% LongMemEval retrieval. Hindsight wins at scale with 64.1% on BEAM at 10M tokens. Stash offers the deepest consolidation pipeline (8 stages) for structured knowledge extraction. agentmemory ships the most tools (53) for coding-agent workflows. None of them replace a context layer for cross-system institutional knowledge. Your choice depends on whether you need session memory (any of these) or organizational context (a different category entirely).
You install your third memory MCP server this month. MemPalace promises 170 tokens at startup. Hindsight claims scale to ten million tokens. agentmemory ships 53 tools. Every GitHub README reads like the last one you'll ever need. But none of them tell you which to actually use, or what they can't do at all. The memory MCP server space now spans 21 frameworks and 20 vector stores, according to Mem0's State of AI Agent Memory report (2026). This comparison tests four open-source options on benchmarks, token cost, and architecture, then names what all four leave on the table.
What Does a Memory MCP Server Actually Do?#
A memory MCP server gives an AI agent persistent recall across sessions. Without one, every conversation starts from zero. The Mem0 research team published the first broad comparison of ten memory approaches at ECAI 2025, finding that token-efficient retrieval is the critical differentiator for production use (Mem0, 2025).
The concept is straightforward. Your agent finishes a session, the memory server extracts key facts, and it stores them in a vector database or structured file. Next session, the server injects relevant memories into the context window so the agent picks up where it left off. Processing a 10M-token context at 2026 prices costs roughly $5 per inference call (Mem0, 2026). Memory servers exist to avoid paying that bill by retrieving only what matters.
Worth saying upfront: these servers remember what happened in your conversations. They have no idea what your team discussed in Slack, decided in Jira, or documented in Confluence. That's a different problem, and we'll get to it.
Pricing: What Does Each Memory MCP Server Cost?#
Every open-source option in this comparison is free to self-host, which makes the real cost operational, not licensing. Mem0's managed cloud starts at $0.01 per memory operation, with 1,000 free operations monthly (Mem0, 2026).
| Tool | Starting Price | Free Tier | Contract Minimum |
| MemPalace | Free (open-source) | Yes, fully local, no API costs | None |
| Hindsight | Free (open-source) | Yes, self-hosted | None |
| Stash | Free (open-source) | Yes, self-hosted (Postgres + Docker) | None |
| agentmemory | Free (open-source) | Yes, self-hosted | None |
| Mem0 (managed cloud) | $0.01/memory operation | 1,000 free ops/month | None |
The hidden cost isn't the server. It's the tokens each server consumes from your context window on every session. MemPalace loads in roughly 170 tokens. agentmemory's 53-tool surface area can consume significantly more. If you're running multiple MCP servers alongside memory, that overhead compounds fast against the context tax you're already paying.
How Do These Servers Compare on Benchmarks?#
MemPalace scores 96.6% on LongMemEval retrieval using its ChromaDB baseline (MemPalace docs, 2026). Hindsight scores 91.4% on LongMemEval and 64.1% on BEAM at 10M tokens (Vectorize, 2026). Those numbers need caveats before they mean anything useful, though.
Here's why direct comparison is tricky. MemPalace's 96.6% is recall_any@5, a retrieval-only metric. It measures whether the right memory chunk surfaces in the top five results, not whether the agent produces a correct end-to-end answer. LongMemEval-S and LoCoMo are different benchmarks entirely, so scores across them aren't apples-to-apples. Mem0 reports 93.4% on LongMemEval and 91.6 on LoCoMo (Mem0, 2026), but those numbers sit on different scales.
Only Hindsight has been tested on BEAM, the benchmark that evaluates memory at 10 million tokens, roughly a year of daily agent conversations. Its 64.1% score there beats the next-best system by a 58% margin (Vectorize, 2026). If you're evaluating for long-running production workloads, BEAM is the only benchmark that tests at realistic scale.
Why Does MemPalace Load in Just 170 Tokens?#
MemPalace loads its L0 and L1 layers in approximately 170 tokens at startup, making it the most context-window-friendly memory server available (MemPalace docs, 2026). For teams already fighting context tax bloat, that number matters.
What makes it light#
The entire project ships as 21 Python files with two runtime dependencies: ChromaDB and PyYAML. No Docker. No Postgres. No cloud API. It runs fully offline, which means zero ongoing cost and zero data leaving your machine. The architecture uses a tiered memory model: L0 for hot context, L1 for warm references, deeper layers for long-term storage. Only the first two tiers load at startup.
Where it falls short#
The 96.6% LongMemEval figure is retrieval-only (recall_any@5), not end-to-end QA accuracy. That's an important distinction. Retrieving the right chunk doesn't guarantee the agent produces the right answer. The project is also young, with a small community. And there are no BEAM results, so its behavior at scale beyond a few thousand memories is unproven.
Best for: developers who need lightweight session memory without token overhead.
Can Hindsight Handle 10 Million Tokens?#
Hindsight is the only memory system tested on BEAM, a benchmark evaluating memory at 10 million tokens (Vectorize, 2026). Its 64.1% score beats the next-best system by a 58% margin. If your agents accumulate months of conversation history, this is the server built for that problem.
Why scale matters#
Memory servers generally work fine at a few hundred memories. What happens at 50,000? Or 200,000? BEAM simulates roughly a year of daily agent conversations, compressed into a single evaluation. Hindsight's architecture is designed to handle that volume without retrieval quality collapsing. It also scores 91.4% on LongMemEval, so it doesn't sacrifice short-session recall to win at scale.
The trade-offs#
Hindsight requires more infrastructure than MemPalace. Setup is heavier, dependencies are broader, and the startup token cost is higher. For a weekend side project, that's overkill. For a team running coding agents across months of accumulated context, the infrastructure cost pays for itself in retrieval quality at scale.
Best for: teams running agents across months of accumulated context.
Stash: The Structured Knowledge Extractor#
Stash processes memories through an 8-stage consolidation pipeline: episodic extraction, relationship mapping, causal linking, contradiction detection, confidence decay, goal inference, failure detection, and hypothesis scanning (Stash GitHub, 2026). Where other servers store memories, Stash refines them.
The pipeline difference#
Other memory servers treat storage as append-only. New memory goes in, old memory gets retrieved by vector similarity. Stash takes a different approach. It actively processes stored memories to detect contradictions, decay confidence on stale information, and infer relationships between facts. The result is structured knowledge, not a pile of embeddings.
Storage runs on Postgres with pgvector, and the consolidation pipeline executes asynchronously via Docker. One command gets you running: docker compose up. That's heavier than MemPalace's zero-dependency approach, but it's still self-hosted and free.
When it's too much#
If you just need "remember what I said last session," Stash's 8-stage pipeline is overbuilt. The Postgres requirement also means you're running infrastructure that needs monitoring. For structured knowledge extraction across long-running workflows, though, no other server in this category offers that depth of post-processing.
Best for: teams that need structured knowledge extraction, not just raw recall.
agentmemory: The Coding-Agent Toolkit#
agentmemory ships 53 tools, 6 resources, 3 prompts, and 4 skills, making it the most comprehensive MCP memory toolkit purpose-built for coding agents (agentmemory GitHub, 2026). Where other servers offer a handful of memory operations, agentmemory gives you a full Swiss-army knife.
The breadth advantage#
The tool surface covers code-specific memory operations that the other servers don't attempt. You get semantic search across memories, code-snippet recall, project-context persistence, and workflow-specific memory types. For a developer who wants fine-grained control over what the agent remembers and how it retrieves, this is the most capable option.
The cost of 53 tools#
Here's the catch. Every tool definition costs tokens. A 53-tool MCP server adds significant context window overhead, exactly the tool overload problem that causes accuracy to degrade past 50 tools. If you're already running GitHub MCP, Slack MCP, and a few others, adding 53 more tool definitions can push your context tax past the point where your agent starts making mistakes.
Does that mean you shouldn't use it? Not necessarily. But it means you need to be deliberate about which tools you enable and which you leave inactive.
Best for: coding-agent power users who want maximum memory control.
What Don't These Servers Solve?#
Every server in this comparison solves the same problem: giving your agent recall across sessions. None of them solve the harder problem, which is giving your agent access to what your team already knows.
Session memory and institutional memory are fundamentally different problems. Session memory is personal. It remembers that you refactored the auth module last Tuesday. Institutional memory knows why the team chose OAuth over SAML, that the decision was discussed across three Slack threads, documented in a Notion RFC, and linked to a Jira epic that shipped six months ago.
Gustavo Alvarez, Software Engineer at Sixfold, put it this way:
"My biggest use of Unblocked MCP has been AI governance - searching across Slack, fourteen Notion docs, S3, trying to understand where data lives and where the gaps are. There is no other way to humanly accomplish this task. It's an absolute godsend for getting context out of sources that don't talk to each other."
That's the gap. You can pair MemPalace or Hindsight with a context layer that surfaces what you didn't know to look for. They're complementary, not competing. A memory server remembers your sessions. A context engine connects code, discussions, tickets, and docs across your entire organization.
How Should You Choose a Memory MCP Server?#
The right choice depends on your workload, not on benchmark headlines. A solo developer on a weekend project has different needs than a team running agents across six months of accumulated context. Here's a decision framework.
Match the server to the problem#
If your constraint is context-window budget, start with MemPalace. Its 170-token startup means you can add persistent memory without meaningfully increasing your token cost. If your constraint is long-running scale, Hindsight's BEAM-tested architecture is the only option with proven results at 10M tokens. If you need structured knowledge output, Stash's 8-stage pipeline is unique. If you need maximum tool coverage for coding workflows, agentmemory has the broadest surface.
What about using more than one?#
You can stack memory servers, but every additional server adds to your context tax. The practical answer is: pick one server for session recall and pair it with whatever organizational context source your team actually needs.
Frequently Asked Questions#
Can I use a memory MCP server alongside Unblocked?#
Yes. They solve different problems. A memory server handles session recall, remembering what happened in your agent conversations. Unblocked handles organizational context, surfacing knowledge from Slack, Jira, Notion, Confluence, and other tools your team already uses. Running both gives your agent personal memory and institutional memory.
Which memory server has the lowest token overhead?#
MemPalace at approximately 170 tokens startup (MemPalace docs, 2026). But startup cost is only one part of the picture. Per-query token overhead, storage format, and retrieval granularity all affect total cost. A server that loads cheaply but retrieves poorly will cost more in wasted turns.
Are these benchmarks directly comparable?#
Partially. LongMemEval-S and LoCoMo are different benchmarks testing different capabilities, so scores across them aren't apples-to-apples. MemPalace's 96.6% is recall_any@5, a retrieval-only metric. Mem0's 93.4% LongMemEval comes from their own evaluation framework. Only Hindsight has BEAM results at the 10M-token scale, making it the only system tested at production-length memory volumes.
Do I need a memory server if my agent has a large context window?#
Yes. Context windows are temporary. Close the session and everything disappears. Persistent memory survives across sessions, which is the entire point. And stuffing everything into the context window gets expensive fast: at 10M tokens, full-context processing costs roughly $5 per inference call (Mem0, 2026).
Pick Your Memory Architecture#
Four servers, each built around a different architecture tradeoff.
Lightweight session memory with minimal token cost: MemPalace. Long-running scale across months of agent history: Hindsight. Structured knowledge extraction with contradiction detection and confidence decay: Stash. Maximum tool coverage for coding-agent workflows: agentmemory.
Session memory only covers half the problem, though. Your agent can remember every conversation you've had and still miss the Slack thread where your team decided to deprecate the API you're building on. That's institutional memory, the context that lives across systems, teams, and months of organizational decisions. It's the context engine that connects code, discussions, tickets, and docs.
Pick the memory server that fits your workload. Then ask whether your agent also needs to know what the rest of your team already knows.



