Skip to main content

Zero-Inference Design

SuperModel’s core innovation is achieving complete zero-inference operation on the server side by using MCP sampling for ALL decision-making points, from request routing to UI generation.

The Traditional Problem

Most generative UI systems require expensive server-side LLM inference: Problems:
  • High inference costs for every request
  • Need to maintain LLM infrastructure
  • Scaling costs increase linearly with usage
  • Complex model management and versioning

SuperModel’s Solution

SuperModel flips the model by using the client’s LLM for all reasoning through MCP sampling: Benefits:
  • $0 server inference costs
  • No LLM infrastructure needed
  • Infinite scaling (costs stay at $0)
  • Client controls LLM choice and quality

How Zero-Inference Works

1. Request Routing via Sampling

Instead of using a server-side LLM to determine routing, SuperModel asks the client:
// Server pays for LLM inference
const routingDecision = await serverLLM.analyze({
  prompt: "Which tool should handle this request?",
  context: userRequest
});

// Cost: $0.01-0.10 per request

2. UI Generation via Sampling

Similarly, UI generation delegates all creative work to the client:
{
  "method": "sampling/createMessage",
  "params": {
    "messages": [{
      "role": "user", 
      "content": {
        "type": "text",
        "text": "Generate a React component for product search with these requirements:\n- Grid layout with images\n- Price and category filters\n- Sort dropdown\n- Add to cart buttons\n\nUse AG-UI event system for interactions.\n\nProduct data: [...products]"
      }
    }],
    "systemPrompt": "You are an expert React developer. Generate clean, functional AG-UI components.",
    "maxTokens": 2000
  }
}
The server never interprets or modifies the generated code - it simply packages whatever the client’s LLM returns.

3. Context Management via Sampling

Even complex multi-step workflows use sampling for decision-making:
{
  "method": "sampling/createMessage",
  "params": {
    "messages": [{
      "role": "user",
      "content": {
        "type": "text", 
        "text": "User wants to complete their shopping journey. Previous context:\n\n{\"selected_products\": [\"headphones-1\"], \"budget\": \"$200\", \"use_case\": \"work_from_home\"}\n\nUser just said: 'Add a carrying case and checkout'\n\nAvailable tools:\n- bundle-builder-ui: Add complementary products\n- checkout-ui: Complete purchase\n- product-search-ui: Find more products\n\nWhich tool should handle this next step?"
      }
    }],
    "systemPrompt": "Consider the user journey and context. Return the best next tool."
  }
}

Cost Comparison

ScenarioTraditionalSuperModelSavings
Simple Calculator0.05</td><tdclassName="p2">0.05</td> <td className="p-2">0.00100%
E-commerce Search0.15</td><tdclassName="p2">0.15</td> <td className="p-2">0.00100%
Multi-App Workflow0.50</td><tdclassName="p2">0.50</td> <td className="p-2">0.00100%
1000 Requests/Day150/day</td><tdclassName="p2">150/day</td> <td className="p-2">0/day$54,750/year

Implementation Guarantees

SuperModel enforces zero-inference through architectural constraints:
The SuperModel framework includes TypeScript interfaces that make it impossible to call LLM APIs directly:
interface SuperModelTool {
  // No direct LLM access allowed
  process(request: Request, context: Context): Promise<UIResource>;
  
  // Only sampling is available
  sample(prompt: SamplingRequest): Promise<SamplingResponse>;
}
SuperModel can optionally monitor for unexpected LLM API calls:
// Optional: Block all outbound LLM API calls
gateway.enableInferenceMonitoring({
  blockOpenAI: true,
  blockAnthropic: true, 
  blockOllama: true,
  onViolation: (call) => {
    throw new Error(`Unexpected LLM call detected: ${call.url}`);
  }
});
SuperModel servers can run in environments with no LLM API access to prove zero-inference:
# Deploy with no internet access to LLM APIs
docker run --network=isolated supermodel-server

# Still functions perfectly with MCP sampling

Performance Implications

Latency Considerations

Slight Latency Increase: Zero-inference comes with 500-2000ms additional latency for MCP sampling round-trips.
Traditional
  • Server LLM: 500-1500ms
  • Total: 500-1500ms
SuperModel
  • MCP Sampling: 1000-3000ms
  • Total: 1000-3000ms

Optimization Strategies

1

Parallel Sampling

Execute routing and context analysis in parallel when possible
2

Caching

Cache common routing decisions and UI patterns
3

Streaming

Stream UI generation for immediate user feedback
4

Preloading

Preload likely next tools based on user journey patterns

When Zero-Inference Makes Sense

High Volume Applications

Applications with thousands of daily requests where inference costs would be significant.

Cost-Sensitive Deployments

Startups, open-source projects, or applications with tight budgets.

Client-Controlled Quality

When you want users to control their LLM choice and quality settings.

Regulatory Compliance

When data cannot leave the client environment for LLM processing.

Trade-offs to Consider

SuperModel trades some latency (500-2000ms) for complete cost elimination. Consider if this trade-off makes sense for your use case.
UI quality depends on the client’s LLM capability. A client with a weak LLM will generate lower-quality UIs.
Requires reliable client-server communication for sampling. Network issues affect functionality.
Only works with MCP clients that support sampling. Traditional REST API clients cannot use SuperModel.

Next Steps

Gateway Pattern

Learn how SuperModel implements intelligent routing without inference.

Hello World Example

See zero-inference in action with a step-by-step example.