Does AI Memory Actually Work for Coding Agents?
A Controlled Benchmark of Persistent Memory in Production Codebase Tasks
Markus Sandelin — Independent Research, February 2026
Four Key Findings
Memory doesn't improve code quality
Quality ceiling is a property of the model (84-96% across all conditions).
Memory reduces exploration overhead
28-40% fewer turns, 22-32% lower cost on complex tasks.
There's a complexity threshold
On trivial codebases, memory is pure overhead.
Memory enables smaller models
Haiku 4.5 scored 0/40 without memory, 39/40 with it. Memory isn't just efficiency — it's capability for smaller models.
What Everyone Else Claims
Industry claims vs. what they actually measure.
| System | Claimed Saving | Actually Measures | Benchmark |
|---|---|---|---|
| Mem0 | 90% tokens | Memory compression | LOCOMO |
| A-Mem | 85-93% tokens | Per-operation cost | Dialogue QA |
| MemMachine | 80% tokens | Recall accuracy | LOCOMO |
| Zep | 94.8% accuracy | Retrieval precision | Custom |
| Letta | N/A (honest) | Agent capability | Terminal-Bench |
| Stompy | 15-28% | Task efficiency | Coding tasks |
Methodology
- System: MCP-based, PostgreSQL, VoyageAI embeddings
- Codebase: 4,895 lines Python/FastAPI, 158 source files
- Three conditions: stompy (MCP recall), file (static CONTEXT.md), nomemory (cold start)
- Three tasks of increasing complexity
- Scoring: 25-point rubric (5 criteria x 5 points)
- All runs: Claude Opus 4.6, identical codebase snapshot
Results
Table 4 — Per-Task Results
| Task | Stompy | File | NoMemory |
|---|---|---|---|
| Task 1 (Moderate) | 23/25 $1.90 35t | 23/25 $1.80 29t | 24/25 $1.17 19t |
| Task 2 (High) | 21/25 $1.33 31t | 22/25 $3.22 51t | 21/25 $3.51 58t |
| Task 3 (Very High) | 23/25 $3.52 47t | 23/25 $3.16 44t | 22/25 $3.18 54t |
Table 5 — Aggregate
| Condition | Quality | Cost | Turns | Cost/Point |
|---|---|---|---|---|
| Stompy | 67/75 | $6.75 | 113 | $0.101/pt |
| File | 68/75 | $8.18 | 124 | $0.120/pt |
| NoMemory | 67/75 | $7.86 | 131 | $0.117/pt |
Table 6 — Complexity Gradient
| Task | Winner | Turn Savings | Cost Savings |
|---|---|---|---|
| Task 1 (Moderate) | nomemory | -35% turns | -20% cost |
| Task 2 (High) | stompy | +28% fewer turns | +22% cost savings |
| Task 3 (Very High) | stompy | +18% fewer turns | +22% cost savings, +32% time savings |
Table 7 — Phase 3: Multi-Agent Swarm Results (6 agents, full-stack booking feature)
| Model | Condition | Score | Cost | Turns | Time |
|---|---|---|---|---|---|
| Sonnet 4.6 | stompy | 40/40 | $3.98 | 2 | 6.5m |
| Sonnet 4.6 | nomemory | 40/40 | $7.04 | 4 | 9.6m |
| Opus 4.6 | stompy | 40/40 | $4.34 | 29 | 9.6m |
| Opus 4.6 | nomemory | 40/40 | $7.65 | 70 | 10.0m |
| Haiku 4.5 | stompy | 39/40 | $4.95 | 2 | 7.5m |
| Haiku 4.5 | nomemory | 0/40 | $3.97 | 3 | 5.8m |
Table 8 — Phase 3: Cost Savings Summary
| Model | With Memory | Without | Savings |
|---|---|---|---|
| Sonnet 4.6 | $3.98 | $7.04 | 43% |
| Opus 4.6 | $4.34 | $7.65 | 43% |
| Haiku 4.5 | $4.95 (39/40) | $3.97 (0/40) | Memory enables capability |
When Memory Hurts
Phase 1: Toy Codebase Results
nomemory won: 70.3% quality vs 59.5% for stompy.
We're showing this because cherrypicking is dishonest.
There is a complexity threshold below which memory is pure overhead. On a trivial 800-line Express/TypeScript codebase, the model can hold the entire context in its window. Memory retrieval adds latency and noise without reducing exploration — because there is nothing to explore. The breakeven point appears to be around 2,000-3,000 lines of meaningful code with non-obvious architecture.
Use Cases
Multi-Agent Swarms
Problem: 6 agents independently explore codebase. 6x redundant exploration.
How Stompy helps: Lead locks architecture, workers recall. 43% cost savings (Sonnet & Opus). Haiku: 0/40 → 39/40 with memory.
# Lead agent stores architecture
lock_context("service_layer: PostgreSQLAdapter pattern, execute_query for reads, execute_update for writes...")
# Worker agents recall before coding
recall_context("my_app/service_layer") # deeplink syntaxTicketing & Project Management
Problem: Re-explain ticket schema every session.
How Stompy helps: Lock conventions once. All ticket tools benefit.
lock_context("ticketing_workflow: states=[open,in_progress,review,done], priorities=[critical,high,medium,low]...")
ticket(action="create", title="Fix auth bug", type="bug")
ticket_board(status="in_progress")Admin & Operations
Problem: Infra decisions in Slack/heads. 3am incidents.
How Stompy helps: Lock operational knowledge.
lock_context("deployment: DO App Platform, NYC region, auto-deploy on main...")
recall_context("my_app/deployment")
db_query("SELECT * FROM mcp_global.mcp_sessions WHERE status = 'active'")Cross-Session Development
Problem: Monday morning, explain project for 14th time.
How Stompy helps: Stompy remembers across sessions.
lock_context("api_conventions: REST endpoints at /api/v1/, Pydantic models...")
recall_context("my_app/api_conventions")
context_search("how do we handle auth")Codebase Onboarding
Problem: New dev asks same questions previous dev already answered.
How Stompy helps: Previous dev's knowledge persists.
project_brief()
recall_batch(topics=["my_app/service_layer", "my_app/database_conventions", "my_app/test_patterns"])Limitations
- Three models (Opus 4.6, Sonnet 4.6, Haiku 4.5), single codebase, N=1 per cell
- Our own system on our own codebase
- Pilot study, not statistical proof
What's Next
Completed
- Phase 3: Multi-agent swarm — 43% cost savings across Opus & Sonnet, memory enables Haiku (0/40 → 39/40)
Upcoming
- Multi-model validation (GPT-5-Codex, Gemini 2.5 Pro)
- TOON serialization format efficiency
- Longitudinal study: 27 sessions on sustained development
We built a memory system. We tested it honestly. The results were modest. We think modesty, grounded in controlled measurement, is what this field needs.