Groq+Stompy

Lightning-fast inference with persistent context

Fastest inference, persistent memory

The Problem

Groq is speed incarnate. LPU (Language Processing Unit) inference that makes GPUs look slow. Responses in 300+ tokens per second—so fast they feel instant. When latency matters, when you're building real-time applications, when users expect immediate responses, Groq delivers.

But speed without memory is just fast forgetting.

Here's what happens in practice: Your Groq-powered chatbot responds instantly. User asks a follow-up. You send the conversation history. User asks about something from yesterday. Gone. They mention a preference from last week. Gone. Your blazing-fast application can't remember anything beyond the current context window.

The problem isn't Groq—it's the stateless API paradigm. Every API call is independent. You can include conversation history in the messages array, but that's just the current session. Long-term project knowledge, user preferences, decisions made weeks ago—none of that persists.

When you're building for production, you need more than speed. You need context-aware speed. You need responses that are both instant AND informed by everything your application has learned.

Ultra-fast inference deserves persistent context.

How Stompy Helps

Stompy gives Groq the memory layer that matches its speed.

Your ultra-fast inference gains true long-term memory: - **Sub-second context retrieval**: Stompy's semantic search returns relevant context in milliseconds—fast enough to not bottleneck Groq's inference - **Beyond conversation history**: Not just recent messages, but project context, user preferences, and domain knowledge that compounds over time - **Persistent learning**: Every Groq conversation can save insights that inform future responses - **Speed without sacrifice**: Add memory without adding latency—Groq stays fast, but becomes smart

Your real-time applications finally remember. Chatbots that recall user preferences from last month. Support agents that know the customer's full history. Code assistants that remember your entire project context.

Blazing fast and fully contextual—that's what AI should be.

Integration Walkthrough

1

Create a memory-enabled Groq client

Wrap Groq with Stompy context retrieval that keeps pace with LPU speed.

from groq import Groq
import httpx
import os
class StompyGroq:
def __init__(self):
self.groq = Groq()
self.stompy_url = "https://mcp.stompy.ai/sse"
self.headers = {"Authorization": f"Bearer {os.environ['STOMPY_TOKEN']}"}
async def get_context(self, topics: list[str]) -> str:
"""Retrieve relevant context in parallel - fast enough for Groq."""
async with httpx.AsyncClient() as client:
tasks = [
client.post(self.stompy_url, headers=self.headers,
json={"tool": "recall_context", "topic": t})
for t in topics
]
responses = await asyncio.gather(*tasks)
context_parts = [r.json().get("content", "") for r in responses if r.is_success]
return "\n---\n".join(context_parts)
client = StompyGroq()
2

Ultra-fast context-aware completions

Get instant Groq responses that are informed by persistent project knowledge.

async def fast_completion(user_message: str, user_id: str = None) -> str:
"""Lightning-fast response with full project context."""
# Parallel context retrieval (doesn't bottleneck Groq)
context_topics = ["project_context", "coding_standards"]
if user_id:
context_topics.append(f"user_{user_id}_preferences")
project_context = await client.get_context(context_topics)
# Groq inference with context
response = client.groq.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": f"""You are a helpful assistant with access to project memory.
PROJECT CONTEXT:
{project_context}
Use this context to provide informed, contextual responses."""},
{"role": "user", "content": user_message}
],
temperature=0.7,
max_tokens=2048
)
return response.choices[0].message.content
# Real-time, context-aware response
answer = await fast_completion("What's the status of the auth refactor?")
3

Save insights from fast conversations

Even at Groq speed, capture important decisions and learnings.

async def complete_and_learn(user_message: str, conversation_id: str) -> str:
"""Fast response + persistent learning."""
response = await fast_completion(user_message)
# Async save - doesn't block the response
asyncio.create_task(save_if_significant(
conversation_id=conversation_id,
user_message=user_message,
response=response
))
return response
async def save_if_significant(conversation_id: str, user_message: str, response: str):
"""Background task to save important insights."""
# Quick heuristic: save if it looks like a decision or preference
decision_indicators = ["decided to", "let's go with", "prefer", "agreed"]
if any(indicator in response.lower() for indicator in decision_indicators):
async with httpx.AsyncClient() as http:
await http.post(
"https://mcp.stompy.ai/sse",
headers={"Authorization": f"Bearer {os.environ['STOMPY_TOKEN']}"},
json={
"tool": "lock_context",
"topic": f"conversation_{conversation_id}_decision",
"content": f"User: {user_message}\n\nDecision: {response}",
"priority": "important"
}
)

What You Get

  • Sub-second context retrieval: Stompy semantic search is fast enough to not bottleneck Groq inference
  • Persistent user preferences: Users get personalized responses without re-explaining preferences every session
  • Real-time + long-term: Combine instant inference with institutional memory for production applications
  • Works with all Groq models: Llama 3.3, Mixtral, Gemma—any model gains persistent memory
  • Async learning: Save insights in background without blocking lightning-fast responses

Ready to give Groq a memory?

Join the waitlist and be the first to know when Stompy is ready. Your Groq projects will never forget again.