← All posts
2026-06-23

The router, by the numbers

L0 to L3 in 0.6 ms, then cheapest capable silicon across the mesh. ~75% energy reduction at flat quality on real mixed traffic.

The router is the part of Joule Cloud most customers don't think about, which is exactly the goal. You call model: "auto"; we put your request on the cheapest capable silicon currently available across the mesh. This is what's happening when you do.

L0 to L3, in 0.6 ms

Each inference request hits the gateway and is classified by a small distilled model trained to predict task difficulty from the prompt. Sub-millisecond. The output is a tier:

TierBucketWhat goes hereTypical J/req
L0lookupcache hits, small embeddings, key-value access~0.01
L1extractionshort summarization, classification, NER, sentiment~0.05
L2aggregationRAG, mid-context summarization, structured generation~0.3
L3reasoninglong-context reasoning, code gen, planning, multi-step~6

The classifier is intentionally conservative — when it's uncertain, it upgrades. We'd rather pay the extra joules of a too-strong model than risk a quality regression a user notices.

Then: cheapest capable silicon

For the resolved tier, the router maintains a live ranking across every node in the mesh that can serve it. The score is a weighted sum:

The top-ranked node gets the call. If health-check fails mid-flight, automatic failover to the next-ranked node. The decision lands on the response header (X-Routed-To) and the receipt.

What this actually saves

For a sample week of mixed customer traffic (anonymised), "auto" vs pinning llama-3.3-70b-instruct:

QuantityPinned 70Bauto
L0 lookups2.1 M × 0.31 J = 651 kJ2.1 M × 0.012 J = 25 kJ
L1 extractions800k × 0.31 J = 248 kJ800k × 0.052 J = 41 kJ
L2 aggregations140k × 0.31 J = 43 kJ140k × 0.29 J = 41 kJ
L3 reasoning22k × 0.31 J = 7 kJ22k × 5.9 J = 130 kJ
Total949 kJ237 kJ

~75% energy reduction at flat quality (judge-model evaluated). The savings come from the long L0/L1 tail, where pinning 70B was using a sledgehammer on a thumbtack.

When NOT to use auto

Three scenarios where you should pin:

  1. Brand voice. If your product's UX depends on the chosen model's response style (Claude-y, GPT-y, etc.), pin.
  2. Determinism for evals. Pin the model when running benchmarks so apples-to-apples.
  3. Specialty. Pin Qwen for Chinese, DeepSeek for math, FLUX-dev for image quality.

For everything else, "auto" is the default for a reason.

The override

If you disagree with the classifier on a specific call, force the tier:

# this call will be routed to L3-capable silicon regardless of classification
X-Force-Tier: L3

Audit-logged. Useful for "this looks easy but I know it isn't" cases (intentionally adversarial prompts, very-long-context that the classifier underestimates).

What's coming

The model "auto" is doing more under the hood than the docs make obvious. We make it look simple because that's the point.