Skip to main content

Does AI Memory Actually Work for Coding Agents?

A Controlled Benchmark of Persistent Memory in Production Codebase Tasks

Markus Sandelin — Independent Research, February 2026

Three Key Findings

1

Memory doesn't improve code quality

Quality ceiling is a property of the model (84-96% across all conditions).

2

Memory reduces exploration overhead

28-40% fewer turns, 22-32% lower cost on complex tasks.

3

There's a complexity threshold

On trivial codebases, memory is pure overhead.

What Everyone Else Claims

Industry claims vs. what they actually measure.

SystemClaimed SavingActually MeasuresBenchmark
Mem090% tokensMemory compressionLOCOMO
A-Mem85-93% tokensPer-operation costDialogue QA
MemMachine80% tokensRecall accuracyLOCOMO
Zep94.8% accuracyRetrieval precisionCustom
LettaN/A (honest)Agent capabilityTerminal-Bench
Stompy15-28%Task efficiencyCoding tasks

Methodology

  • System: MCP-based, PostgreSQL, VoyageAI embeddings
  • Codebase: 4,895 lines Python/FastAPI, 158 source files
  • Three conditions: stompy (MCP recall), file (static CONTEXT.md), nomemory (cold start)
  • Three tasks of increasing complexity
  • Scoring: 25-point rubric (5 criteria x 5 points)
  • All runs: Claude Opus 4.6, identical codebase snapshot

Results

Table 4 — Per-Task Results

TaskStompyFileNoMemory
Task 1 (Moderate)23/25 $1.90 35t23/25 $1.80 29t24/25 $1.17 19t
Task 2 (High)21/25 $1.33 31t22/25 $3.22 51t21/25 $3.51 58t
Task 3 (Very High)23/25 $3.52 47t23/25 $3.16 44t22/25 $3.18 54t

Table 5 — Aggregate

ConditionQualityCostTurnsCost/Point
Stompy67/75$6.75113$0.101/pt
File68/75$8.18124$0.120/pt
NoMemory67/75$7.86131$0.117/pt

Table 6 — Complexity Gradient

TaskWinnerTurn SavingsCost Savings
Task 1 (Moderate)nomemory-35% turns-20% cost
Task 2 (High)stompy+28% fewer turns+22% cost savings
Task 3 (Very High)stompy+18% fewer turns+22% cost savings, +32% time savings

When Memory Hurts

Phase 1: Toy Codebase Results

nomemory won: 70.3% quality vs 59.5% for stompy.

We're showing this because cherrypicking is dishonest.

There is a complexity threshold below which memory is pure overhead. On a trivial 800-line Express/TypeScript codebase, the model can hold the entire context in its window. Memory retrieval adds latency and noise without reducing exploration — because there is nothing to explore. The breakeven point appears to be around 2,000-3,000 lines of meaningful code with non-obvious architecture.

Use Cases

Multi-Agent Swarms

Problem: 6 agents independently explore codebase. 6x redundant exploration.

How Stompy helps: Lead locks architecture, workers recall. Phase 3: $2.34 vs $3.30 (29%), 40/40 quality.

# Lead agent stores architecture
lock_context("service_layer: PostgreSQLAdapter pattern, execute_query for reads, execute_update for writes...")
# Worker agents recall before coding
recall_context("service_layer")

Ticketing & Project Management

Problem: Re-explain ticket schema every session.

How Stompy helps: Lock conventions once. All ticket tools benefit.

lock_context("ticketing_workflow: states=[open,in_progress,review,done], priorities=[critical,high,medium,low]...")
ticket(action="create", title="Fix auth bug", type="bug")
ticket_board(status="in_progress")

Admin & Operations

Problem: Infra decisions in Slack/heads. 3am incidents.

How Stompy helps: Lock operational knowledge.

lock_context("deployment: DO App Platform, NYC region, auto-deploy on main...")
recall_context("deployment")
db_query("SELECT * FROM mcp_global.mcp_sessions WHERE status = 'active'")

Cross-Session Development

Problem: Monday morning, explain project for 14th time.

How Stompy helps: Stompy remembers across sessions.

lock_context("api_conventions: REST endpoints at /api/v1/, Pydantic models...")
recall_context("api_conventions")
context_search("how do we handle auth")

Codebase Onboarding

Problem: New dev asks same questions previous dev already answered.

How Stompy helps: Previous dev's knowledge persists.

project_brief()
recall_batch(topics=["service_layer", "database_conventions", "test_patterns"])

Limitations

  • Single model (Opus 4.6), single codebase, N=1 per cell
  • Our own system on our own codebase
  • Pilot study, not statistical proof

What's Next

  • Multi-agent swarm benchmark (Phase 3 — early results: 29% savings)
  • Multi-model validation (Sonnet, GPT-5-Codex, Gemini 2.5 Pro)
  • TOON serialization format efficiency
  • Longitudinal study: 27 sessions on sustained development
We built a memory system. We tested it honestly. The results were modest. We think modesty, grounded in controlled measurement, is what this field needs.