Open models with persistent memory

Open models, persistent context

The Problem

Together AI has democratized access to the best open-source models. Llama 3.1 405B, Mixtral 8x22B, Qwen 72B, DeepSeek Coder—run state-of-the-art open models at scale without managing GPU clusters. One API for the entire open-source AI ecosystem.

But hosted models are still stateless.

Here's the limitation: You're using Llama 3.1 405B for code generation. It writes excellent code for your project. User asks for a new feature. The model has no idea what it wrote yesterday. Different user asks about the codebase. Model doesn't know your architecture exists.

Together AI solves the infrastructure problem—you don't need to manage GPUs. But they can't solve the statefulness problem—that's inherent to how APIs work. Every request is independent. Context windows help for single conversations, but long-term project knowledge, team decisions, accumulated learnings? None of that persists.

Open-source models have caught up to proprietary ones in capability. They deserve the same memory infrastructure.

Open models need open memory.

How Stompy Helps

Stompy gives your Together AI models the persistent memory they deserve.

Your open-source models gain enterprise-grade memory: - **Full model flexibility**: Same memory layer works with Llama, Mixtral, Qwen, DeepSeek—switch models, keep memory - **Cross-model knowledge**: Insights from one model session are available to any model in your stack - **Project-wide context**: Every Together AI call has access to your project's accumulated knowledge - **Team collaboration**: Multiple team members using different models share the same persistent context

Run Llama for code generation, Mixtral for analysis, Qwen for multilingual—all with shared, persistent memory. Your open-source AI stack finally has the same memory capabilities as proprietary solutions.

Open-source power with persistent memory.

Integration Walkthrough

Create a memory-enabled Together AI client

Build a wrapper that gives any open-source model persistent project context.

from together import Together
import httpx
import os

class StompyTogether:
    def __init__(self):
        self.together = Together()
        self.stompy_url = "https://mcp.stompy.ai/sse"
        self.headers = {"Authorization": f"Bearer {os.environ['STOMPY_TOKEN']}"}

    async def get_context(self, query: str = None, topics: list[str] = None) -> str:
        """Retrieve relevant context via semantic search or direct topic recall."""
        async with httpx.AsyncClient() as client:
            if query:
                # Semantic search for relevant context
                response = await client.post(
                    self.stompy_url, headers=self.headers,
                    json={"tool": "context_search", "query": query, "limit": 5}
                )
                results = response.json().get("results", [])
                return "\n---\n".join([r["content"] for r in results])
            elif topics:
                # Direct topic recall
                responses = await asyncio.gather(*[
                    client.post(self.stompy_url, headers=self.headers,
                               json={"tool": "recall_context", "topic": t})
                    for t in topics
                ])
                return "\n---\n".join([r.json().get("content", "") for r in responses])
        return ""

client = StompyTogether()

Context-aware completions with any open model

Use Llama, Mixtral, or any model with full project context.

async def open_model_completion(
    user_message: str,
    model: str = "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"
) -> str:
    """Open-source model completion with persistent project memory."""

    # Semantic search for relevant context based on the question
    relevant_context = await client.get_context(query=user_message)

    # Also get explicit project context
    project_context = await client.get_context(topics=["project_overview", "coding_standards"])

    combined_context = f"""RELEVANT CONTEXT:
{relevant_context}

PROJECT CONTEXT:
{project_context}"""

    response = client.together.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": f"""You are a helpful assistant with access to project memory.

{combined_context}

Use this context to provide informed, accurate responses."""},
            {"role": "user", "content": user_message}
        ],
        max_tokens=2048
    )

    return response.choices[0].message.content

# Works with any model
answer = await open_model_completion("How does our auth system work?")
code_answer = await open_model_completion(
    "Write a function to validate user tokens",
    model="deepseek-ai/DeepSeek-Coder-33B-Instruct"
)

Save learnings across model sessions

Capture insights that benefit all future model interactions.

async def save_model_insight(
    topic: str,
    content: str,
    model_used: str,
    priority: str = "reference"
):
    """Save insights from any model for cross-model knowledge sharing."""
    async with httpx.AsyncClient() as http:
        await http.post(
            "https://mcp.stompy.ai/sse",
            headers={"Authorization": f"Bearer {os.environ['STOMPY_TOKEN']}"},
            json={
                "tool": "lock_context",
                "topic": topic,
                "content": f"[Generated by {model_used}]\n\n{content}",
                "tags": f"model:{model_used},auto-generated",
                "priority": priority
            }
        )

# Example: Save code analysis from DeepSeek for future Llama sessions
await save_model_insight(
    topic="codebase_analysis_auth",
    content="""Authentication Flow Analysis:
1. JWT tokens generated in /api/auth/login
2. Middleware validates in _middleware.ts
3. Token refresh handled by /api/auth/refresh
4. User context stored in React context (AuthProvider)""",
    model_used="deepseek-ai/DeepSeek-Coder-33B-Instruct",
    priority="important"
)

What You Get

Model-agnostic memory: Same Stompy context works with Llama, Mixtral, Qwen, DeepSeek—switch models freely
Cross-model knowledge sharing: Insights from DeepSeek code analysis available to Llama conversations
Open-source + enterprise features: Get persistent memory without vendor lock-in to proprietary APIs
Cost optimization: Use cheaper models for simple queries while maintaining full context
Future-proof: As new open models emerge, they inherit your existing project memory instantly

Ready to give Together AI a memory?

Join the waitlist and be the first to know when Stompy is ready. Your Together AI projects will never forget again.

Related Integrations

Groq

Vertical & Specialized

Lightning-fast inference with persistent context

Learn more

Replicate

Vertical & Specialized

Run any model with persistent context

Learn more

OpenAI API

Vertical & Specialized

GPT models with persistent memory

Learn more

View all integrations