Model routing: Haiku / Sonnet / Opus per task

הבעיה: ברירת המחדל היא תמיד "המודל החזק"The problem: the default is always "the strong model"

כשמתחילים לבנות סוכן, הפיתוי לבחור את המודל החזק ביותר לכל דבר הוא אדיר. זה עובד, זה מרגיש בטוח, וזה מסתיר באגים. מה שלא רואים בזמן הפיתוח: הקצב שבו הסוכן הזה שורף כסף בפרודקשן.

סוכן רקע אצלנו מבצע בממוצע 180 קריאות ביום למשתמש פעיל. נניח 8K טוקני input ו-1.2K טוקני output לקריאה. עם Opus זה ~$1.30 ליום למשתמש = $39 לחודש. כפול 100 משתמשים = $3,900. עם Sonnet זה ~$0.32 ליום = $9.60 לחודש = $960 ל-100 משתמשים. עם Haiku + caching זה $0.015 ליום = $0.45 לחודש = $45 ל-100 משתמשים.

הפער בין הקצוות הוא פי 87. השאלה אינה "האם לחסוך", השאלה היא איזה חלק מהעבודה באמת דורש Opus. במערכת שלנו זה התברר כפחות מ-15% מהקריאות.

When building a new agent, the temptation to reach for the strongest model for everything is huge. It works, it feels safe, and it hides bugs. What you do not see in development is the rate at which that agent burns money in production.

A typical background agent of ours does 180 calls per day per active user. Call it 8K input tokens and 1.2K output tokens per call. On Opus that is ~$1.30/day per user = $39/month. Times 100 users = $3,900. On Sonnet, ~$0.32/day = $9.60/month = $960 for 100 users. On Haiku with caching, $0.015/day = $0.45/month = $45 for 100 users.

The spread end-to-end is 87x. The question is not "should we economize." The question is which fraction of the work actually requires Opus. In our system that turned out to be under 15% of calls.

חלוקה לפי משימה, לא לפי לקוחTier by task type, not by customer tier

הפיתוי הראשון של מנהל מוצר הוא לתת ל"לקוחות פרימיום" את Opus ולכל השאר את Haiku. זאת טעות. לקוח פרימיום שמבקש לסווג לאיזה צבע שייכת תמונה לא צריך Opus יותר מאשר לקוח חינמי. החלוקה הנכונה היא לפי מהות המשימה.

Haiku: סיווג כוונה, ניתוב לסוכן הנכון, פילטור spam, הוצאת ישות מטקסט קצר, תקציר של בלוק קוד.
Sonnet: כתיבת קוד למסך אחד, תיקון באג בקובץ נתון, עיצוב landing page, ניסוח מענה ללקוח.
Opus: סוכן build מקצה לקצה (refactor של 10 קבצים), debugging של מערכת רב-שירותית, החלטות אדריכליות עם trade-offs.

אצלנו ה-Builder המרכזי רץ על Opus 4.6 כי הוא עושה plan-then-execute על 10+ קבצים בו-זמנית. אבל ה-router שמחליט "האם זה build חדש או המשך לפרויקט קיים" רץ על Haiku, וה-classifier שמחליט אם הודעת waApp היא תמיכה או מכירה רץ גם הוא על Haiku.

The product-manager instinct is to give "premium" customers Opus and everyone else Haiku. That is wrong. A premium user asking to classify the dominant color of an image does not need Opus more than a free user does. The right axis is task nature.

Haiku: intent classification, routing, spam filtering, entity extraction from short text, summary of a code block.
Sonnet: writing one screen of code, fixing a bug in a known file, designing a landing page, drafting a customer reply.
Opus: end-to-end build agent (refactor 10 files), multi-service debugging, architectural decisions with trade-offs.

Our main Builder runs on Opus 4.6 because it does plan-then-execute across 10+ files in one turn. But the router that decides "is this a new build or a continuation of an existing project" runs on Haiku, and the classifier that decides whether an inbound WhatsApp message is support or sales is also Haiku.

דוגמה: ראוטר זעיר שבוחר מודלExample: a tiny router that picks the model

הראוטר עצמו לא צריך להיות חכם. רוב המידע נמצא בסוג ה-task, לא בתוכנו. אנחנו עושים סיווג בשתי שכבות: כלל מבוסס regex לטיפול במקרים ברורים, ואז Haiku אם נדרש שיפוט.

const MODELS = {
  cheap: "claude-haiku-4-6",
  mid:   "claude-sonnet-4-6",
  hard:  "claude-opus-4-6"
};

function routeByTask(task) {
  // Layer 1: deterministic
  if (task.kind === "classify" || task.kind === "extract") return MODELS.cheap;
  if (task.kind === "summarize" && task.input.length < 4000) return MODELS.cheap;
  if (task.kind === "draft_reply" && !task.requires_code) return MODELS.cheap;

  if (task.kind === "code_edit" && task.files.length <= 2) return MODELS.mid;
  if (task.kind === "design_page") return MODELS.mid;

  if (task.kind === "build_project") return MODELS.hard;
  if (task.kind === "refactor" && task.files.length > 3) return MODELS.hard;
  if (task.kind === "debug" && task.severity >= 4) return MODELS.hard;

  // Layer 2: when in doubt, ask Haiku
  return MODELS.cheap;
}

async function run(task) {
  const model = routeByTask(task);
  return client.messages.create({ model, ...buildRequest(task) });
}

שימו לב לכלל הסיום: "כשבספק, Haiku", לא "כשבספק, Opus". אם Haiku נכשל, אפשר לעשות retry עם Sonnet — escalation זול יותר משריפת Opus על מקרה שלא היה צריך אותו.

The router itself does not need to be smart. Most of the signal lives in the task kind, not in its content. We do two layers: deterministic rules for clear cases, then Haiku for judgment calls.

const MODELS = {
  cheap: "claude-haiku-4-6",
  mid:   "claude-sonnet-4-6",
  hard:  "claude-opus-4-6"
};

function routeByTask(task) {
  // Layer 1: deterministic
  if (task.kind === "classify" || task.kind === "extract") return MODELS.cheap;
  if (task.kind === "summarize" && task.input.length < 4000) return MODELS.cheap;
  if (task.kind === "draft_reply" && !task.requires_code) return MODELS.cheap;

  if (task.kind === "code_edit" && task.files.length <= 2) return MODELS.mid;
  if (task.kind === "design_page") return MODELS.mid;

  if (task.kind === "build_project") return MODELS.hard;
  if (task.kind === "refactor" && task.files.length > 3) return MODELS.hard;
  if (task.kind === "debug" && task.severity >= 4) return MODELS.hard;

  // Layer 2: when in doubt, ask Haiku
  return MODELS.cheap;
}

async function run(task) {
  const model = routeByTask(task);
  return client.messages.create({ model, ...buildRequest(task) });
}

Note the fallback: "when in doubt, Haiku", not "when in doubt, Opus." If Haiku fails, we retry on Sonnet — escalation is cheaper than burning Opus on a task that did not need it.

תבנית escalation: התחל זול, עלה לפי הצורךEscalation pattern: start cheap, climb on demand

השדרוג של הראוטר הבסיסי הוא escalation מבוסס תוצאה. הסוכן מנסה את המודל הזול קודם, ואם התוצאה לא עוברת ולידציה, הוא עולה למודל החזק יותר. עובד יפה כשהוולידציה זולה (regex, JSON parse, schema check).

async function withEscalation(task) {
  const tiers = [MODELS.cheap, MODELS.mid, MODELS.hard];
  let lastErr;
  for (const model of tiers) {
    const out = await client.messages.create({ model, ...buildRequest(task) });
    const check = validate(out, task.expected_schema);
    if (check.ok) return { result: out, modelUsed: model };
    lastErr = check.reason;
    // log so we can later raise the floor for this task kind
    metrics.incr(`route.escalate.${task.kind}.${model}`, { reason: lastErr });
  }
  throw new Error(`All tiers failed: ${lastErr}`);
}

אחרי שבועיים של escalation logs, רואים בבירור איזה task kinds צריכים להתחיל בשכבה גבוהה יותר. אם 60% מ-extract_invoice נופל ב-Haiku ועובר ב-Sonnet, מעלים את הברירת מחדל ל-Sonnet ומוחקים את שכבת ה-Haiku מהמסלול הזה. ה-router שלך משתפר עם הזמן בלי בני אדם.

Winהקטגוריה הכי גדולה של escalations אצלנו הייתה tone_check על מענים בעברית. Haiku החמיץ דקויות. הזזנו אותה לברירת מחדל Sonnet והעלאות נחתכו ב-90%.

The natural upgrade to a basic router is escalation: try the cheap model first, escalate when the output fails validation. This works beautifully when validation is itself cheap — regex, JSON parse, schema check.

async function withEscalation(task) {
  const tiers = [MODELS.cheap, MODELS.mid, MODELS.hard];
  let lastErr;
  for (const model of tiers) {
    const out = await client.messages.create({ model, ...buildRequest(task) });
    const check = validate(out, task.expected_schema);
    if (check.ok) return { result: out, modelUsed: model };
    lastErr = check.reason;
    metrics.incr(`route.escalate.${task.kind}.${model}`, { reason: lastErr });
  }
  throw new Error(`All tiers failed: ${lastErr}`);
}

After two weeks of escalation logs, the task kinds that should have started one tier higher are obvious. If 60% of extract_invoice fails on Haiku and passes on Sonnet, promote the default to Sonnet and remove Haiku from that path. The router improves itself over time without humans tuning it.

WinOur largest escalation category was tone_check on Hebrew replies. Haiku missed nuance. Moved it to Sonnet default and escalations dropped 90%.

מספרים אמיתיים: מה לפני ומה אחריReal numbers: what changed when we tiered

החודש הראשון שבו הפעלנו ניתוב לפי משימה הסתיים עם החיסכון הבא, על אותו עומס בדיוק (שבוע 14 לעומת שבוע 18, 2026):

נפח קריאות: 1.42M (לפני) vs 1.51M (אחרי) — עלייה קלה.
חלוקה: לפני — 92% Sonnet, 8% Opus. אחרי — 71% Haiku, 22% Sonnet, 7% Opus.
חשבון: $2,140 (לפני) vs $310 (אחרי). חיסכון של 85.5%.
איכות: ה-CSAT score לא זז (4.6 לפני ואחרי). שיעור ה-bounce על תשובות הסוכן עלה ב-0.4% — בתחום הרעש.

הדבר היחיד שהורגש לרעה היה זמן תגובה במקרי escalation — קריאה שנכשלה ב-Haiku, עברה ל-Sonnet, ולפעמים גם ל-Opus, יכולה לקחת 8-12 שניות במקום 3. הוספנו indicator של "הסוכן חושב על זה רגע" ל-UI כדי להפחית חיכוך, ושמרנו על אחוז escalation מתחת ל-9% כקו אדום.

The first full month after we routed by task ended with this delta on the same workload (week 14 vs week 18, 2026):

Call volume: 1.42M (before) vs 1.51M (after). Slight increase in usage.
Mix: before — 92% Sonnet, 8% Opus. After — 71% Haiku, 22% Sonnet, 7% Opus.
Bill: $2,140 (before) vs $310 (after). 85.5% reduction.
Quality: CSAT did not move (4.6 before and after). Bounce rate on agent replies went up 0.4% — within noise.

The one negative was tail latency on escalation cases. A call that fails on Haiku, retries on Sonnet, and occasionally climbs to Opus can take 8-12 seconds instead of 3. We added a "agent is thinking" indicator in the UI to soften that, and we hold the escalation rate under 9% as a hard line.

מתי לא לנתבWhen not to tier

ניתוב מודלים לא תמיד מנצח. סיטואציות שבהן עדיף לזרוק הכל על מודל אחד:

כשהוולידציה יקרה. אם בדיקת התוצאה דורשת בעצמה קריאה למודל, ה-escalation אוכל את החיסכון.
כשהאיכות חשובה יותר מהעלות. סוכן רפואי שמייעץ לרופאים? אל תחסוך. שמור על Opus לכל קריאה.
בפיילוט מוקדם. אופטימיזציה לפני שיש product-market fit היא הסחת דעת. הריצו על Sonnet, מדדו, ואז נתבו כשהעומס מצדיק זאת — לרוב מעל $500/חודש בחיובי API.
כשהעיקרון "אותה תשובה לכל המשתמשים" קריטי. ב-deterministic systems (compliance, חוזים), אי-עקביות בין שכבות יוצרת באגים שקשה לאתר.

שאלת המפתח לפני שמטמיעים: האם המשימה הזאת חוזרת מספיק כדי שהאופטימיזציה תחזיר את עלות הניתוח? אם לא — תשאירו על Sonnet, וזה בסדר.

Tiering does not always win. Cases where it is wiser to send everything to one model:

When validation is expensive. If checking the output itself requires a model call, escalation eats the savings.
When quality dominates cost. A medical advisor agent for clinicians? Do not economize. Stay on Opus per call.
Early in the pilot. Optimizing before product-market fit is a distraction. Run on Sonnet, measure, then route when load justifies it — typically past $500/month in API spend.
When "every user gets the same answer" matters. In deterministic domains (compliance, contracts), inconsistency between tiers creates bugs that are hard to trace.

The question to ask before tiering: does this task recur often enough that the optimization repays the analysis cost? If not — leave it on Sonnet. That is fine.