Schema Overload: more tools, less precision

תקצירTL;DR

מודלים גדולים שעוברים את גבול ה-schema budget מפסיקים להשתמש ב-tool_use ומתחילים לפלוט JSON בתוך טקסט. אצלנו זה קרה ל-Opus 4.6 דרך OAuth ברגע שעברנו 40+ כלים; ב-12 כלים זה נשאר ב-protocol. הפתרון הוא tool tiering — core/build/ops — ופרסר התאוששות שמזהה JSON-as-text וממיר אותו ל-tool_use סינתטי.

When models exceed the schema budget, they stop emitting native tool_use and start dumping JSON inside plain text. For us, Opus 4.6 over OAuth flipped this way past 40 tools; at 12 it stayed in protocol. The fix is tool tiering — core/build/ops — plus a recovery parser that detects JSON-as-text and converts it to a synthetic tool_use event.

הסימפטום: JSON שמודבק בתוך טקסטThe symptom: JSON pasted inside text

בנינו builder עם 47 כלים. רשימת todos, כתיבת קבצים, deploy, fetch תמונות, צבעים, fonts, OG tags, ועוד. כל כלי בעל schema קצר וברור. היה זה אמור להיות פשוט. ביום הראשון בפרודקשן ראינו טריק חדש: המודל לא קרא לכלי. הוא תיאר את הקריאה.

{
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Now I will write the file.\n\n```json\n{\n  \"tool\": \"write_file\",\n  \"path\": \"index.html\",\n  \"content\": \"<!doctype html>...\"\n}\n```"
    }
  ]
}

אין tool_use. אין input. רק טקסט. ה-runtime, שמחכה ל-type:"tool_use", רואה type:"text", מסיים את ה-turn, ושולח את הטקסט הגולמי לדפדפן. המשתמש קורא "Now I will write the file..." ומחכה. אף קובץ לא נכתב.

בלוגים זה הופיע כהמון stop_reason: "end_turn" ללא tool_use בדרך, על אף שהפרומפט אומר במפורש להשתמש בכלים. הכמות זינקה ברגע שעלינו ל-40+ כלים, וירדה כששמרנו רשימה קטנה.

We built a builder with 47 tools. A todo list, file writes, deploys, image fetching, colors, fonts, OG tags, and more. Each tool had a short, clear schema. This was supposed to be straightforward. On day one in production we saw a new trick: the model wasn't calling the tools. It was describing the call.

{
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Now I will write the file.\n\n```json\n{\n  \"tool\": \"write_file\",\n  \"path\": \"index.html\",\n  \"content\": \"<!doctype html>...\"\n}\n```"
    }
  ]
}

No tool_use block. No input. Just text. The runtime, which waits for type:"tool_use", sees type:"text", ends the turn, and ships the raw text to the browser. The user reads "Now I will write the file..." and waits. Nothing was actually written.

In the logs this appeared as a wave of stop_reason: "end_turn" with no tool_use on the way, even though the prompt explicitly says to use tools. The volume jumped the moment we crossed 40 tools, and dropped when we kept the list small.

למה זה קורה: schema budgetWhy it happens: the schema budget

אין מספר קסום ב-API. אבל יש מציאות: כל schema שאתה שולח נכלל ב-context, נכלל ב-attention, ונכלל בתקציב הפנימי שהמודל מקצה לתכנון. כש-40 schemas נכנסים, חלק מהמודלים — בעיקר Opus 4.6 דרך OAuth — מפסיקים לסמוך על זה ש-tool_use הוא הדרך, ונופלים לחזרה לתבנית שכיחה יותר באימון: "בוא נכתוב JSON בתוך הסבר".

זה לא bug, וזה לא חוסר יכולת. זה התנהגות של מודל שעמוס. הראיה: באותו prompt, באותו מודל, אותו משתמש — אם תוריד את מספר הכלים ל-12, הוא חוזר לפרוטוקול. אם תעלה ל-50, הוא נופל. הסף שלנו, אמפירית:

≤12 כלים: 99%+ tool_use נכון
13-25 כלים: 92-98%
26-40 כלים: 70-90%, רגישות גבוהה ל-prompt
40+ כלים: ≤60%, לפעמים זה משתחרר ולפעמים נופל

ההבדל בין OAuth לבין API key רגיל לא בטל. תחת OAuth ראינו את הסף נופל מוקדם יותר. אנחנו חושדים שזה קשור לכמה pre-prompts שמוזרקים, אבל לא הצלחנו למדוד את זה ישירות.

There is no magic number in the API. But there is a reality: every schema you send is included in context, included in attention, and included in the budget the model allocates internally to planning. When 40 schemas arrive, some models — especially Opus 4.6 over OAuth — stop trusting that tool_use is the path and fall back to a more common training pattern: "let me write JSON inside an explanation".

This isn't a bug, and it isn't lack of capability. It's the behavior of a model that's loaded. Proof: with the same prompt, same model, same user — drop the tool count to 12 and it returns to protocol. Push it to 50 and it slips. Our empirical thresholds:

≤12 tools: 99%+ correct tool_use
13-25 tools: 92-98%
26-40 tools: 70-90%, very prompt-sensitive
40+ tools: ≤60%, intermittent recovery

The OAuth vs raw API key difference is not negligible. Under OAuth we saw the threshold drop earlier. We suspect injected pre-prompts contribute, but we couldn't measure that directly.

התשובה הראשונה: tool tiering — core / build / opsFirst answer: tool tiering — core / build / ops

במקום לשלוח 47 כלים בכל turn, אנחנו שולחים 3 קבוצות לפי שלב. ה-router בוחר את הקבוצה לפי state של ה-session ולפי ה-stop_reason של ה-turn הקודם. הקבוצות אצלנו:

core (5 כלים): todo_write, todo_complete, ask_user, read_session, finish_turn. תמיד טעון.
build (8 כלים): fetch_image, set_colors, set_fonts, write_file, list_files, read_file, delete_file, deploy.
ops (6 כלים): fetch_url_live, screenshot, check_broken_images, regen_og, set_meta, publish.

function getToolsFor(session) {
  const tools = [...CORE_TOOLS];
  if (session.phase === 'building') tools.push(...BUILD_TOOLS);
  if (session.phase === 'verifying') tools.push(...OPS_TOOLS);
  return tools;
}

כל קבוצה היא ≤13 כלים. שילוב core+build = 13. שילוב core+ops = 11. שילוב core+build+ops לעולם לא קורה — אין turn שצריך הכל. ה-phase מתחלף עקב פעולות מפורשות: todo_complete("build") מעביר ל-verifying.

תיקון אחד נוסף: כשהמודל קורא ל-finish_turn אנחנו לא מסיימים מיד — בודקים שיש todo במצב done, שיש לפחות write_file אחד אם זו פאזת build, ושיש deploy אחרון. mandatory tool enforcement הוא חלק מאותו עיקרון: אל תאפשר "done" בלי checklist של כלים שנקראו.

Instead of shipping 47 tools every turn, we ship 3 groups gated by phase. The router picks the group from the session state and the previous turn's stop_reason. Our groups:

core (5 tools): todo_write, todo_complete, ask_user, read_session, finish_turn. Always loaded.
build (8 tools): fetch_image, set_colors, set_fonts, write_file, list_files, read_file, delete_file, deploy.
ops (6 tools): fetch_url_live, screenshot, check_broken_images, regen_og, set_meta, publish.

function getToolsFor(session) {
  const tools = [...CORE_TOOLS];
  if (session.phase === 'building') tools.push(...BUILD_TOOLS);
  if (session.phase === 'verifying') tools.push(...OPS_TOOLS);
  return tools;
}

Each combined set is ≤13 tools. core+build = 13. core+ops = 11. core+build+ops never happens — no turn needs all of it. Phase transitions are explicit: todo_complete("build") moves us to verifying.

One more touch: when the model calls finish_turn we don't actually finish — we verify a done todo exists, that at least one write_file fired if this was a build phase, and that a final deploy ran. Mandatory tool enforcement is the same principle: no "done" without a checklist of tool calls in the trail.

איך מזהים schema overload בפרודקשןDetecting schema overload in production

אם אתה לא מודד, אתה לא יודע מתי זה התחיל. שלוש מטריקות שאנחנו מסתכלים עליהן:

tool_use_rate: אחוז ה-turns שהסתיימו עם stop_reason="tool_use" מתוך אלה שהיו אמורים להיגמר ככה (פאזת building למשל). יורד מתחת ל-95% — אזהרה.
json_in_text_rate: regex על ה-text content שמחפש בלוקים שמתחילים ב-```json ומכילים "tool":. כל hit הוא דגל אדום.
fence-only stop: turns שמסתיימים ב-end_turn אבל הטקסט מסתיים בקטע קוד פתוח. לרוב זה אומר שהמודל "כתב" קריאה לכלי בטקסט.

function detectJsonAsText(content) {
  const text = content.find(b => b.type === 'text')?.text || '';
  const blocks = text.match(/```(?:json)?\s*\n([\s\S]+?)\n```/g) || [];
  return blocks.filter(b => /"tool"\s*:/.test(b));
}

אנחנו רושמים כל אחת מהמטריקות לכל turn, מקבצים לפי tool_count ב-context, ובונים גרף heatmap. ברגע שהקו הירוק (tool_use_rate) יורד והקו האדום (json_in_text_rate) עולה כשמצטלבים ב-x=40 — אתה רואה את ה-overload בעיניים.

Noteזה לא עניין של מודל ספציפי בלבד. ראינו את אותה תופעה בעוצמות שונות בכל המודלים הגדולים. הסף משתנה. ההתנהגות לא.

If you aren't measuring, you don't know when this started. Three metrics we watch:

tool_use_rate: the share of turns that ended with stop_reason="tool_use" out of those that should have (building phase, say). Below 95% — alarm.
json_in_text_rate: regex over text content that looks for fenced blocks beginning with ```json and containing "tool":. Any hit is a red flag.
fence-only stops: turns that end with end_turn while the text trails off inside an open code block. Almost always means the model "wrote" a tool call into prose.

function detectJsonAsText(content) {
  const text = content.find(b => b.type === 'text')?.text || '';
  const blocks = text.match(/```(?:json)?\s*\n([\s\S]+?)\n```/g) || [];
  return blocks.filter(b => /"tool"\s*:/.test(b));
}

We log all three metrics per turn, group by tool_count in context, and plot a heatmap. The moment the green line (tool_use_rate) drops and the red line (json_in_text_rate) rises, crossing at x=40, the overload is visible to the eye.

NoteThis is not exclusive to one model. We saw the same pattern at different intensities across all major models. The threshold shifts. The behavior doesn't.

ה-recovery patch: לפענח JSON-as-text ולהמיר ל-tool_use סינתטיThe recovery patch: parse JSON-as-text into a synthetic tool_use

גם אחרי tiering, חלק קטן מ-turns עדיין נופל ל-JSON-as-text. במקום להפיל את ה-turn, יש לנו שכבת התאוששות. אם המודל פלט בלוק JSON עם שדה tool, אנחנו ממירים אותו ל-tool_use בלוק סינתטי, מריצים את הכלי, ושולחים את התוצאה בחזרה כאילו זה היה protocol רגיל.

function recoverToolUse(assistantMessage, validTools) {
  const text = assistantMessage.content
    .filter(b => b.type === 'text')
    .map(b => b.text)
    .join('\n');
  const fence = text.match(/```(?:json)?\s*\n([\s\S]+?)\n```/);
  if (!fence) return null;
  let parsed;
  try { parsed = JSON.parse(fence[1]); } catch { return null; }
  if (!parsed.tool || !validTools.has(parsed.tool)) return null;
  return {
    type: 'tool_use',
    id: `synthetic_${Date.now()}`,
    name: parsed.tool,
    input: { ...parsed }
  };
}

שני ערכים חשובים: ה-id מתחיל ב-synthetic_ כך שאנחנו יכולים למדוד כמה turns ניצלו ככה, ו-validTools היא set של שמות הכלים בקבוצה הנוכחית — אם המודל "המציא" שם, אנחנו לא קוראים אותו.

הפרסר חוסך כ-3% מ-turns בפרודקשן. זה לא הרבה, אבל ב-3% האלה המשתמש היה רואה תקיעה. אנחנו רושמים כל recovery עם recovered=true וסופרים — אם זה עולה מעל 5%, הסף השתנה ויש לדון מחדש ב-tiering.

Even with tiering, a small slice of turns still slips into JSON-as-text. Instead of dropping the turn, we have a recovery layer. If the model emitted a JSON block with a tool field, we convert it to a synthetic tool_use block, run the tool, and send the result back as if it had been native protocol.

function recoverToolUse(assistantMessage, validTools) {
  const text = assistantMessage.content
    .filter(b => b.type === 'text')
    .map(b => b.text)
    .join('\n');
  const fence = text.match(/```(?:json)?\s*\n([\s\S]+?)\n```/);
  if (!fence) return null;
  let parsed;
  try { parsed = JSON.parse(fence[1]); } catch { return null; }
  if (!parsed.tool || !validTools.has(parsed.tool)) return null;
  return {
    type: 'tool_use',
    id: `synthetic_${Date.now()}`,
    name: parsed.tool,
    input: { ...parsed }
  };
}

Two values matter: the id is prefixed with synthetic_ so we can measure how many turns were rescued this way, and validTools is the set of names in the current tier — if the model "invented" a name, we refuse to call it.

The parser saves about 3% of production turns. Not huge, but in those 3% the user would have seen a stall. We log each recovery with recovered=true and count — if it crosses 5%, the threshold has shifted and tiering needs another look.

מלכודות שצריך לדעת עליהן מראשPitfalls to know about up front

שלוש טעויות שעשינו בדרך ושכדאי לחסוך:

recovery כקביים: ברגע שהוספנו את הפרסר, התפתינו להוסיף עוד כלים, כי "יש לנו רשת ביטחון". זה גרר את ה-tool_use_rate למטה. כללו את ה-recovery כדי לטפל ב-3% הקצה, לא כדי לשבור את התקרה.
tier שמשתנה תוך turn: ניסינו לשלוח build+ops ביחד באמצע turn ארוך. המודל התבלבל באיזה כלי לקרוא ולפעמים קרא ל-deploy כשעוד לא היה מה לעשות. tier משתנה רק בין turns.
schema עמוס מדי בכלי בודד: כלי אחד עם 22 שדות אופציונליים שווה 3-4 כלים בעומס. פיצלנו build_and_deploy ל-build + deploy כשראינו שמספר השדות בלבד מעמיס.

What workedהכלל שלנו: 12 כלים פעילים ב-turn, כל כלי עם ≤8 שדות, ה-description בכל כלי קצר מ-200 מילים, וה-recovery parser רץ רק כ-fallback אחרון. ב-30 ימים האחרונים, tool_use_rate שלנו 99.1%.

האינסטינקט שלך כשמודל "מתעקש" לכתוב JSON בתוך טקסט הוא להחמיר את ה-prompt. תפסיק. זה לא יעזור. תוריד כלים.

Three mistakes we made on the way that are worth skipping:

Recovery as a crutch: once we had the parser, we were tempted to add more tools because "we have a safety net". That dragged tool_use_rate down. Use recovery to handle the 3% edge, not to break the ceiling.
Mid-turn tier swaps: we tried sending build+ops together inside a long turn. The model got confused about which tool to call and occasionally invoked deploy before there was anything to deploy. Tier transitions happen between turns only.
One overloaded tool: a single tool with 22 optional fields costs as much as 3-4 ordinary tools. We split build_and_deploy into build and deploy the moment we noticed field count alone was loading the model.

What workedOur running rule: 12 active tools per turn, each with ≤8 fields, descriptions under 200 words, and the recovery parser as a last-resort fallback. Last 30 days, our tool_use_rate is 99.1%.

Your instinct when a model "insists" on writing JSON inside text will be to harden the prompt. Don't. It won't help. Drop tools.