Skip to main content

Zero-Inference Design

SuperModel’s core innovation is achieving complete zero-inference operation on the server side by using MCP sampling for ALL decision-making points, from request routing to UI generation.

The Traditional Problem

Most generative UI systems require expensive server-side LLM inference: Problems:
  • High inference costs for every request
  • Need to maintain LLM infrastructure
  • Scaling costs increase linearly with usage
  • Complex model management and versioning

SuperModel’s Solution

SuperModel flips the model by using the client’s LLM for all reasoning through MCP sampling: Benefits:
  • $0 server inference costs
  • No LLM infrastructure needed
  • Infinite scaling (costs stay at $0)
  • Client controls LLM choice and quality

How Zero-Inference Works

1. Request Routing via Sampling

Instead of using a server-side LLM to determine routing, SuperModel asks the client:
// Server pays for LLM inference
const routingDecision = await serverLLM.analyze({
  prompt: "Which tool should handle this request?",
  context: userRequest
});

// Cost: $0.01-0.10 per request

2. UI Generation via Sampling

Similarly, UI generation delegates all creative work to the client:
{
  "method": "sampling/createMessage",
  "params": {
    "messages": [{
      "role": "user", 
      "content": {
        "type": "text",
        "text": "Generate a React component for product search with these requirements:\n- Grid layout with images\n- Price and category filters\n- Sort dropdown\n- Add to cart buttons\n\nUse AG-UI event system for interactions.\n\nProduct data: [...products]"
      }
    }],
    "systemPrompt": "You are an expert React developer. Generate clean, functional AG-UI components.",
    "maxTokens": 2000
  }
}
The server never interprets or modifies the generated code - it simply packages whatever the client’s LLM returns.

3. Context Management via Sampling

Even complex multi-step workflows use sampling for decision-making:
{
  "method": "sampling/createMessage",
  "params": {
    "messages": [{
      "role": "user",
      "content": {
        "type": "text", 
        "text": "User wants to complete their shopping journey. Previous context:\n\n{\"selected_products\": [\"headphones-1\"], \"budget\": \"$200\", \"use_case\": \"work_from_home\"}\n\nUser just said: 'Add a carrying case and checkout'\n\nAvailable tools:\n- bundle-builder-ui: Add complementary products\n- checkout-ui: Complete purchase\n- product-search-ui: Find more products\n\nWhich tool should handle this next step?"
      }
    }],
    "systemPrompt": "Consider the user journey and context. Return the best next tool."
  }
}

Cost Comparison

ScenarioTraditionalSuperModelSavings
Simple Calculator0.05</td><tdclassName="p2">0.05</td> <td className="p-2">0.00100%
E-commerce Search0.15</td><tdclassName="p2">0.15</td> <td className="p-2">0.00100%
Multi-App Workflow0.50</td><tdclassName="p2">0.50</td> <td className="p-2">0.00100%
1000 Requests/Day150/day</td><tdclassName="p2">150/day</td> <td className="p-2">0/day$54,750/year

Implementation Guarantees

SuperModel enforces zero-inference through architectural constraints:
The SuperModel framework includes TypeScript interfaces that make it impossible to call LLM APIs directly:
interface SuperModelTool {
  // No direct LLM access allowed
  process(request: Request, context: Context): Promise<UIResource>;
  
  // Only sampling is available
  sample(prompt: SamplingRequest): Promise<SamplingResponse>;
}
SuperModel can optionally monitor for unexpected LLM API calls:
// Optional: Block all outbound LLM API calls
gateway.enableInferenceMonitoring({
  blockOpenAI: true,
  blockAnthropic: true, 
  blockOllama: true,
  onViolation: (call) => {
    throw new Error(`Unexpected LLM call detected: ${call.url}`);
  }
});
SuperModel servers can run in environments with no LLM API access to prove zero-inference:
# Deploy with no internet access to LLM APIs
docker run --network=isolated supermodel-server

# Still functions perfectly with MCP sampling

Performance Implications

Latency Considerations

Slight Latency Increase: Zero-inference comes with 500-2000ms additional latency for MCP sampling round-trips.
Traditional
  • Server LLM: 500-1500ms
  • Total: 500-1500ms
SuperModel
  • MCP Sampling: 1000-3000ms
  • Total: 1000-3000ms

Optimization Strategies

1

Parallel Sampling

Execute routing and context analysis in parallel when possible
2

Caching

Cache common routing decisions and UI patterns
3

Streaming

Stream UI generation for immediate user feedback
4

Preloading

Preload likely next tools based on user journey patterns

When Zero-Inference Makes Sense

High Volume Applications

Applications with thousands of daily requests where inference costs would be significant.

Cost-Sensitive Deployments

Startups, open-source projects, or applications with tight budgets.

Client-Controlled Quality

When you want users to control their LLM choice and quality settings.

Regulatory Compliance

When data cannot leave the client environment for LLM processing.

Trade-offs to Consider

SuperModel trades some latency (500-2000ms) for complete cost elimination. Consider if this trade-off makes sense for your use case.
UI quality depends on the client’s LLM capability. A client with a weak LLM will generate lower-quality UIs.
Requires reliable client-server communication for sampling. Network issues affect functionality.
Only works with MCP clients that support sampling. Traditional REST API clients cannot use SuperModel.

Next Steps