Cost routing across model tiers in multi-agent systems

In a multi-agent system, the orchestrator decides which worker model handles each subtask. Most teams treat this as a routing problem solved once at architecture time — pick a model, hardcode the endpoint, ship it. That works until you're running twenty parallel agents and the bill arrives.

The real cost lever isn't which model you choose. It's when you use a large model versus a small one, and how that decision scales across thousands of daily invocations.

Two-tier routing

The simplest effective pattern is two tiers: a capable frontier model for tasks that require reasoning, synthesis, or judgment; a fast, cheap model for tasks that are structural — extraction, classification, formatting, retrieval re-ranking.

Most orchestration frameworks make this look harder than it is. In practice, routing logic fits in a single function:

def route_task(task: Task) -> str:
    if task.type in REASONING_TASKS:
        return "claude-opus-4-6"
    return "claude-haiku-4-5-20251001"

The catch is maintaining REASONING_TASKS. Teams tend to over-classify tasks as requiring reasoning because the failure mode of under-classifying is visible (bad output) while over-classifying is invisible (unnecessary spend).

What actually needs a large model

In practice, only a handful of task types genuinely require frontier-level reasoning:

Synthesizing conflicting evidence across long documents
Writing code that requires understanding architectural context
Handling edge cases outside the training distribution of smaller models
Evaluating the output quality of other agents

Everything else — summarization of well-structured content, entity extraction, yes/no classification, template filling — runs fine on smaller models at a fraction of the cost.

note

The 80/20 here is real. In most agentic pipelines, fewer than 20% of tasks benefit from the frontier model. Routing the rest to a smaller model typically cuts costs by 60–70% with no measurable quality regression on end-to-end benchmarks.

Token budget enforcement

Routing is necessary but not sufficient. The second lever is token budgets per task type. Frontier models are expensive per token, but they're also verbose by default — without a budget, they'll fill their context window when a 200-token response would do.

Set explicit max_tokens per task type, not per model. This enforces discipline at the task level and makes cost behavior predictable when you swap models underneath.

For a production system running 10,000 tasks per day, the difference between unconstrained output and budgeted output is routinely 3–5x in spend. That's not a rounding error — it's the difference between a sustainable system and one that gets shut down after the first month's invoice.

A good reference for the tradeoffs involved is the Anthropic model overview, which documents the capability and cost differences across tiers clearly enough to inform a routing policy.