Zero-Inference Design

SuperModel’s core innovation is achieving complete zero-inference operation on the server side by using MCP sampling for ALL decision-making points, from request routing to UI generation.

The Traditional Problem

Most generative UI systems require expensive server-side LLM inference: Problems:

High inference costs for every request
Need to maintain LLM infrastructure
Scaling costs increase linearly with usage
Complex model management and versioning

SuperModel’s Solution

SuperModel flips the model by using the client’s LLM for all reasoning through MCP sampling: Benefits:

$0 server inference costs
No LLM infrastructure needed
Infinite scaling (costs stay at $0)
Client controls LLM choice and quality

How Zero-Inference Works

1. Request Routing via Sampling

Instead of using a server-side LLM to determine routing, SuperModel asks the client:

Traditional Approach
SuperModel Approach

// Server pays for LLM inference
const routingDecision = await serverLLM.analyze({
  prompt: "Which tool should handle this request?",
  context: userRequest
});

// Cost: $0.01-0.10 per request

// Client's LLM handles the decision
const routingDecision = await gateway.sample({
  messages: [{
    role: "user",
    content: {
      type: "text", 
      text: `User request: "${userRequest}"\n\nAvailable tools: ${toolList}\n\nWhich tool should handle this?`
    }
  }],
  systemPrompt: "You are a routing assistant. Return JSON with tool selection."
});

// Cost: $0.00 (client pays)

2. UI Generation via Sampling

Similarly, UI generation delegates all creative work to the client:

{
  "method": "sampling/createMessage",
  "params": {
    "messages": [{
      "role": "user", 
      "content": {
        "type": "text",
        "text": "Generate a React component for product search with these requirements:\n- Grid layout with images\n- Price and category filters\n- Sort dropdown\n- Add to cart buttons\n\nUse AG-UI event system for interactions.\n\nProduct data: [...products]"
      }
    }],
    "systemPrompt": "You are an expert React developer. Generate clean, functional AG-UI components.",
    "maxTokens": 2000
  }
}

The server never interprets or modifies the generated code - it simply packages whatever the client’s LLM returns.

3. Context Management via Sampling

Even complex multi-step workflows use sampling for decision-making:

{
  "method": "sampling/createMessage",
  "params": {
    "messages": [{
      "role": "user",
      "content": {
        "type": "text", 
        "text": "User wants to complete their shopping journey. Previous context:\n\n{\"selected_products\": [\"headphones-1\"], \"budget\": \"$200\", \"use_case\": \"work_from_home\"}\n\nUser just said: 'Add a carrying case and checkout'\n\nAvailable tools:\n- bundle-builder-ui: Add complementary products\n- checkout-ui: Complete purchase\n- product-search-ui: Find more products\n\nWhich tool should handle this next step?"
      }
    }],
    "systemPrompt": "Consider the user journey and context. Return the best next tool."
  }
}

Cost Comparison

Scenario	Traditional	SuperModel
Simple Calculator	$0.05</td> <td className="p-2">$ 0.00	100%
E-commerce Search	$0.15</td> <td className="p-2">$ 0.00	100%
Multi-App Workflow	$0.50</td> <td className="p-2">$ 0.00	100%
1000 Requests/Day	$150/day</td> <td className="p-2">$ 0/day	$54,750/year

Implementation Guarantees

SuperModel enforces zero-inference through architectural constraints:

Compilation-Time Checks

The SuperModel framework includes TypeScript interfaces that make it impossible to call LLM APIs directly:

interface SuperModelTool {
  // No direct LLM access allowed
  process(request: Request, context: Context): Promise<UIResource>;
  
  // Only sampling is available
  sample(prompt: SamplingRequest): Promise<SamplingResponse>;
}

Runtime Monitoring

SuperModel can optionally monitor for unexpected LLM API calls:

// Optional: Block all outbound LLM API calls
gateway.enableInferenceMonitoring({
  blockOpenAI: true,
  blockAnthropic: true, 
  blockOllama: true,
  onViolation: (call) => {
    throw new Error(`Unexpected LLM call detected: ${call.url}`);
  }
});

Deployment Validation

SuperModel servers can run in environments with no LLM API access to prove zero-inference:

# Deploy with no internet access to LLM APIs
docker run --network=isolated supermodel-server

# Still functions perfectly with MCP sampling

Performance Implications

Latency Considerations

Slight Latency Increase: Zero-inference comes with 500-2000ms additional latency for MCP sampling round-trips.

Traditional

Server LLM: 500-1500ms
Total: 500-1500ms

SuperModel

MCP Sampling: 1000-3000ms
Total: 1000-3000ms

Optimization Strategies

Parallel Sampling

Execute routing and context analysis in parallel when possible

Caching

Cache common routing decisions and UI patterns

Streaming

Stream UI generation for immediate user feedback

Preloading

Preload likely next tools based on user journey patterns

When Zero-Inference Makes Sense

High Volume Applications

Applications with thousands of daily requests where inference costs would be significant.

Cost-Sensitive Deployments

Startups, open-source projects, or applications with tight budgets.

Client-Controlled Quality

When you want users to control their LLM choice and quality settings.

Regulatory Compliance

When data cannot leave the client environment for LLM processing.

Trade-offs to Consider

Latency vs Cost

SuperModel trades some latency (500-2000ms) for complete cost elimination. Consider if this trade-off makes sense for your use case.

Client Capability Dependence

UI quality depends on the client’s LLM capability. A client with a weak LLM will generate lower-quality UIs.

Network Dependency

Requires reliable client-server communication for sampling. Network issues affect functionality.

MCP Client Requirement

Only works with MCP clients that support sampling. Traditional REST API clients cannot use SuperModel.

Getting Started

Core Concepts

Examples

Zero-Inference Design

Zero-Inference Design

The Traditional Problem

SuperModel’s Solution

How Zero-Inference Works

1. Request Routing via Sampling

2. UI Generation via Sampling

3. Context Management via Sampling

Cost Comparison

Implementation Guarantees

Performance Implications

Latency Considerations

Optimization Strategies

When Zero-Inference Makes Sense

High Volume Applications

Cost-Sensitive Deployments

Client-Controlled Quality

Regulatory Compliance

Trade-offs to Consider

Next Steps

Gateway Pattern

Hello World Example

Getting Started

Core Concepts

Examples

​Zero-Inference Design

​The Traditional Problem

​SuperModel’s Solution

​How Zero-Inference Works

​1. Request Routing via Sampling

​2. UI Generation via Sampling

​3. Context Management via Sampling

​Cost Comparison

​Implementation Guarantees

​Performance Implications

​Latency Considerations

​Optimization Strategies

​When Zero-Inference Makes Sense

High Volume Applications

Cost-Sensitive Deployments

Client-Controlled Quality

Regulatory Compliance

​Trade-offs to Consider

​Next Steps

Gateway Pattern

Hello World Example

Zero-Inference Design

The Traditional Problem

SuperModel’s Solution

How Zero-Inference Works

1. Request Routing via Sampling

2. UI Generation via Sampling

3. Context Management via Sampling

Cost Comparison

Implementation Guarantees

Performance Implications

Latency Considerations

Optimization Strategies

When Zero-Inference Makes Sense

Trade-offs to Consider

Next Steps