Does AI Memory Actually Work for Coding Agents?
A Controlled Benchmark of Persistent Memory in Production Codebase Tasks
Markus Sandelin — Independent Research, February 2026
Three Key Findings
Memory doesn't improve code quality
Quality ceiling is a property of the model (84-96% across all conditions).
Memory reduces exploration overhead
28-40% fewer turns, 22-32% lower cost on complex tasks.
There's a complexity threshold
On trivial codebases, memory is pure overhead.
What Everyone Else Claims
Industry claims vs. what they actually measure.
| System | Claimed Saving | Actually Measures | Benchmark |
|---|---|---|---|
| Mem0 | 90% tokens | Memory compression | LOCOMO |
| A-Mem | 85-93% tokens | Per-operation cost | Dialogue QA |
| MemMachine | 80% tokens | Recall accuracy | LOCOMO |
| Zep | 94.8% accuracy | Retrieval precision | Custom |
| Letta | N/A (honest) | Agent capability | Terminal-Bench |
| Stompy | 15-28% | Task efficiency | Coding tasks |
Methodology
- System: MCP-based, PostgreSQL, VoyageAI embeddings
- Codebase: 4,895 lines Python/FastAPI, 158 source files
- Three conditions: stompy (MCP recall), file (static CONTEXT.md), nomemory (cold start)
- Three tasks of increasing complexity
- Scoring: 25-point rubric (5 criteria x 5 points)
- All runs: Claude Opus 4.6, identical codebase snapshot
Results
Table 4 — Per-Task Results
| Task | Stompy | File | NoMemory |
|---|---|---|---|
| Task 1 (Moderate) | 23/25 $1.90 35t | 23/25 $1.80 29t | 24/25 $1.17 19t |
| Task 2 (High) | 21/25 $1.33 31t | 22/25 $3.22 51t | 21/25 $3.51 58t |
| Task 3 (Very High) | 23/25 $3.52 47t | 23/25 $3.16 44t | 22/25 $3.18 54t |
Table 5 — Aggregate
| Condition | Quality | Cost | Turns | Cost/Point |
|---|---|---|---|---|
| Stompy | 67/75 | $6.75 | 113 | $0.101/pt |
| File | 68/75 | $8.18 | 124 | $0.120/pt |
| NoMemory | 67/75 | $7.86 | 131 | $0.117/pt |
Table 6 — Complexity Gradient
| Task | Winner | Turn Savings | Cost Savings |
|---|---|---|---|
| Task 1 (Moderate) | nomemory | -35% turns | -20% cost |
| Task 2 (High) | stompy | +28% fewer turns | +22% cost savings |
| Task 3 (Very High) | stompy | +18% fewer turns | +22% cost savings, +32% time savings |
When Memory Hurts
Phase 1: Toy Codebase Results
nomemory won: 70.3% quality vs 59.5% for stompy.
We're showing this because cherrypicking is dishonest.
There is a complexity threshold below which memory is pure overhead. On a trivial 800-line Express/TypeScript codebase, the model can hold the entire context in its window. Memory retrieval adds latency and noise without reducing exploration — because there is nothing to explore. The breakeven point appears to be around 2,000-3,000 lines of meaningful code with non-obvious architecture.
Use Cases
Multi-Agent Swarms
Problem: 6 agents independently explore codebase. 6x redundant exploration.
How Stompy helps: Lead locks architecture, workers recall. Phase 3: $2.34 vs $3.30 (29%), 40/40 quality.
# Lead agent stores architecture
lock_context("service_layer: PostgreSQLAdapter pattern, execute_query for reads, execute_update for writes...")
# Worker agents recall before coding
recall_context("service_layer")Ticketing & Project Management
Problem: Re-explain ticket schema every session.
How Stompy helps: Lock conventions once. All ticket tools benefit.
lock_context("ticketing_workflow: states=[open,in_progress,review,done], priorities=[critical,high,medium,low]...")
ticket(action="create", title="Fix auth bug", type="bug")
ticket_board(status="in_progress")Admin & Operations
Problem: Infra decisions in Slack/heads. 3am incidents.
How Stompy helps: Lock operational knowledge.
lock_context("deployment: DO App Platform, NYC region, auto-deploy on main...")
recall_context("deployment")
db_query("SELECT * FROM mcp_global.mcp_sessions WHERE status = 'active'")Cross-Session Development
Problem: Monday morning, explain project for 14th time.
How Stompy helps: Stompy remembers across sessions.
lock_context("api_conventions: REST endpoints at /api/v1/, Pydantic models...")
recall_context("api_conventions")
context_search("how do we handle auth")Codebase Onboarding
Problem: New dev asks same questions previous dev already answered.
How Stompy helps: Previous dev's knowledge persists.
project_brief()
recall_batch(topics=["service_layer", "database_conventions", "test_patterns"])Limitations
- Single model (Opus 4.6), single codebase, N=1 per cell
- Our own system on our own codebase
- Pilot study, not statistical proof
What's Next
- Multi-agent swarm benchmark (Phase 3 — early results: 29% savings)
- Multi-model validation (Sonnet, GPT-5-Codex, Gemini 2.5 Pro)
- TOON serialization format efficiency
- Longitudinal study: 27 sessions on sustained development
We built a memory system. We tested it honestly. The results were modest. We think modesty, grounded in controlled measurement, is what this field needs.