Does AI Memory Actually Work for Coding Agents?

A Controlled Benchmark of Persistent Memory in Production Codebase Tasks

Markus Sandelin — Independent Research, February 2026

Download PDF

Three Key Findings

Memory doesn't improve code quality

Quality ceiling is a property of the model (84-96% across all conditions).

Memory reduces exploration overhead

28-40% fewer turns, 22-32% lower cost on complex tasks.

There's a complexity threshold

On trivial codebases, memory is pure overhead.

What Everyone Else Claims

Industry claims vs. what they actually measure.

System	Claimed Saving	Actually Measures	Benchmark
Mem0	90% tokens	Memory compression	LOCOMO
A-Mem	85-93% tokens	Per-operation cost	Dialogue QA
MemMachine	80% tokens	Recall accuracy	LOCOMO
Zep	94.8% accuracy	Retrieval precision	Custom
Letta	N/A (honest)	Agent capability	Terminal-Bench
Stompy	15-28%	Task efficiency	Coding tasks

Methodology

System: MCP-based, PostgreSQL, VoyageAI embeddings
Codebase: 4,895 lines Python/FastAPI, 158 source files
Three conditions: stompy (MCP recall), file (static CONTEXT.md), nomemory (cold start)
Three tasks of increasing complexity
Scoring: 25-point rubric (5 criteria x 5 points)
All runs: Claude Opus 4.6, identical codebase snapshot

Results

Table 4 — Per-Task Results

Task	Stompy	File	NoMemory
Task 1 (Moderate)	23/25 $1.90 35t	23/25 $1.80 29t	24/25 $1.17 19t
Task 2 (High)	21/25 $1.33 31t	22/25 $3.22 51t	21/25 $3.51 58t
Task 3 (Very High)	23/25 $3.52 47t	23/25 $3.16 44t	22/25 $3.18 54t

Table 5 — Aggregate

Condition	Quality	Cost	Turns	Cost/Point
Stompy	67/75	$6.75	113	$0.101/pt
File	68/75	$8.18	124	$0.120/pt
NoMemory	67/75	$7.86	131	$0.117/pt

Table 6 — Complexity Gradient

Task	Winner	Turn Savings	Cost Savings
Task 1 (Moderate)	nomemory	-35% turns	-20% cost
Task 2 (High)	stompy	+28% fewer turns	+22% cost savings
Task 3 (Very High)	stompy	+18% fewer turns	+22% cost savings, +32% time savings

When Memory Hurts

Phase 1: Toy Codebase Results

nomemory won: 70.3% quality vs 59.5% for stompy.

We're showing this because cherrypicking is dishonest.

There is a complexity threshold below which memory is pure overhead. On a trivial 800-line Express/TypeScript codebase, the model can hold the entire context in its window. Memory retrieval adds latency and noise without reducing exploration — because there is nothing to explore. The breakeven point appears to be around 2,000-3,000 lines of meaningful code with non-obvious architecture.

Use Cases

Multi-Agent Swarms

Problem: 6 agents independently explore codebase. 6x redundant exploration.

How Stompy helps: Lead locks architecture, workers recall. Phase 3: $2.34 vs $3.30 (29%), 40/40 quality.

# Lead agent stores architecture
lock_context("service_layer: PostgreSQLAdapter pattern, execute_query for reads, execute_update for writes...")
# Worker agents recall before coding
recall_context("service_layer")

Ticketing & Project Management

Problem: Re-explain ticket schema every session.

How Stompy helps: Lock conventions once. All ticket tools benefit.

lock_context("ticketing_workflow: states=[open,in_progress,review,done], priorities=[critical,high,medium,low]...")
ticket(action="create", title="Fix auth bug", type="bug")
ticket_board(status="in_progress")

Admin & Operations

Problem: Infra decisions in Slack/heads. 3am incidents.

How Stompy helps: Lock operational knowledge.

lock_context("deployment: DO App Platform, NYC region, auto-deploy on main...")
recall_context("deployment")
db_query("SELECT * FROM mcp_global.mcp_sessions WHERE status = 'active'")

Cross-Session Development

Problem: Monday morning, explain project for 14th time.

How Stompy helps: Stompy remembers across sessions.

lock_context("api_conventions: REST endpoints at /api/v1/, Pydantic models...")
recall_context("api_conventions")
context_search("how do we handle auth")

Codebase Onboarding

Problem: New dev asks same questions previous dev already answered.

How Stompy helps: Previous dev's knowledge persists.

project_brief()
recall_batch(topics=["service_layer", "database_conventions", "test_patterns"])

Limitations

Single model (Opus 4.6), single codebase, N=1 per cell
Our own system on our own codebase
Pilot study, not statistical proof

What's Next

Multi-agent swarm benchmark (Phase 3 — early results: 29% savings)
Multi-model validation (Sonnet, GPT-5-Codex, Gemini 2.5 Pro)
TOON serialization format efficiency
Longitudinal study: 27 sessions on sustained development

Download Full PDF Report

We built a memory system. We tested it honestly. The results were modest. We think modesty, grounded in controlled measurement, is what this field needs.