Open models with persistent memory
Open models, persistent context
The Problem
Together AI has democratized access to the best open-source models. Llama 3.1 405B, Mixtral 8x22B, Qwen 72B, DeepSeek Coder—run state-of-the-art open models at scale without managing GPU clusters. One API for the entire open-source AI ecosystem.
But hosted models are still stateless.
Here's the limitation: You're using Llama 3.1 405B for code generation. It writes excellent code for your project. User asks for a new feature. The model has no idea what it wrote yesterday. Different user asks about the codebase. Model doesn't know your architecture exists.
Together AI solves the infrastructure problem—you don't need to manage GPUs. But they can't solve the statefulness problem—that's inherent to how APIs work. Every request is independent. Context windows help for single conversations, but long-term project knowledge, team decisions, accumulated learnings? None of that persists.
Open-source models have caught up to proprietary ones in capability. They deserve the same memory infrastructure.
Open models need open memory.
How Stompy Helps
Stompy gives your Together AI models the persistent memory they deserve.
Your open-source models gain enterprise-grade memory: - **Full model flexibility**: Same memory layer works with Llama, Mixtral, Qwen, DeepSeek—switch models, keep memory - **Cross-model knowledge**: Insights from one model session are available to any model in your stack - **Project-wide context**: Every Together AI call has access to your project's accumulated knowledge - **Team collaboration**: Multiple team members using different models share the same persistent context
Run Llama for code generation, Mixtral for analysis, Qwen for multilingual—all with shared, persistent memory. Your open-source AI stack finally has the same memory capabilities as proprietary solutions.
Open-source power with persistent memory.
Integration Walkthrough
Create a memory-enabled Together AI client
Build a wrapper that gives any open-source model persistent project context.
from together import Togetherimport httpximport osclass StompyTogether:def __init__(self):self.together = Together()self.stompy_url = "https://mcp.stompy.ai/sse"self.headers = {"Authorization": f"Bearer {os.environ['STOMPY_TOKEN']}"}async def get_context(self, query: str = None, topics: list[str] = None) -> str:"""Retrieve relevant context via semantic search or direct topic recall."""async with httpx.AsyncClient() as client:if query:# Semantic search for relevant contextresponse = await client.post(self.stompy_url, headers=self.headers,json={"tool": "context_search", "query": query, "limit": 5})results = response.json().get("results", [])return "\n---\n".join([r["content"] for r in results])elif topics:# Direct topic recallresponses = await asyncio.gather(*[client.post(self.stompy_url, headers=self.headers,json={"tool": "recall_context", "topic": t})for t in topics])return "\n---\n".join([r.json().get("content", "") for r in responses])return ""client = StompyTogether()
Context-aware completions with any open model
Use Llama, Mixtral, or any model with full project context.
async def open_model_completion(user_message: str,model: str = "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo") -> str:"""Open-source model completion with persistent project memory."""# Semantic search for relevant context based on the questionrelevant_context = await client.get_context(query=user_message)# Also get explicit project contextproject_context = await client.get_context(topics=["project_overview", "coding_standards"])combined_context = f"""RELEVANT CONTEXT:{relevant_context}PROJECT CONTEXT:{project_context}"""response = client.together.chat.completions.create(model=model,messages=[{"role": "system", "content": f"""You are a helpful assistant with access to project memory.{combined_context}Use this context to provide informed, accurate responses."""},{"role": "user", "content": user_message}],max_tokens=2048)return response.choices[0].message.content# Works with any modelanswer = await open_model_completion("How does our auth system work?")code_answer = await open_model_completion("Write a function to validate user tokens",model="deepseek-ai/DeepSeek-Coder-33B-Instruct")
Save learnings across model sessions
Capture insights that benefit all future model interactions.
async def save_model_insight(topic: str,content: str,model_used: str,priority: str = "reference"):"""Save insights from any model for cross-model knowledge sharing."""async with httpx.AsyncClient() as http:await http.post("https://mcp.stompy.ai/sse",headers={"Authorization": f"Bearer {os.environ['STOMPY_TOKEN']}"},json={"tool": "lock_context","topic": topic,"content": f"[Generated by {model_used}]\n\n{content}","tags": f"model:{model_used},auto-generated","priority": priority})# Example: Save code analysis from DeepSeek for future Llama sessionsawait save_model_insight(topic="codebase_analysis_auth",content="""Authentication Flow Analysis:1. JWT tokens generated in /api/auth/login2. Middleware validates in _middleware.ts3. Token refresh handled by /api/auth/refresh4. User context stored in React context (AuthProvider)""",model_used="deepseek-ai/DeepSeek-Coder-33B-Instruct",priority="important")
What You Get
- Model-agnostic memory: Same Stompy context works with Llama, Mixtral, Qwen, DeepSeek—switch models freely
- Cross-model knowledge sharing: Insights from DeepSeek code analysis available to Llama conversations
- Open-source + enterprise features: Get persistent memory without vendor lock-in to proprietary APIs
- Cost optimization: Use cheaper models for simple queries while maintaining full context
- Future-proof: As new open models emerge, they inherit your existing project memory instantly
Ready to give Together AI a memory?
Join the waitlist and be the first to know when Stompy is ready. Your Together AI projects will never forget again.