Mandatory Tools: why agents "lie" about execution

בעיית ה-"done" שהוזהThe hallucinated-done problem

הבעיה לא שהסוכן עושה משהו שגוי. הבעיה שהוא לא עושה כלום ואומר "סיימתי". אצלנו זה היה ה-failure mode הכי שכיח לפני הוספת ה-gate: ה-builder מקבל בקשה לבנות אתר, מבצע todo_write, ומיד מחזיר end_turn עם "האתר מוכן".

במאי 2025 ערכנו audit על 412 builds רצופים. 23% מהם הצהירו "ready" בלי שהיה deploy מוצלח אחריהם. חלקם בלי שום write_file. סוכן אחד הצליח לסיים ב-2 turns שכלל בהם רק todo_write ו-fetch_image, ואז הצהיר "deployed" בלי לכתוב קובץ אחד.

זה לא בעיית prompt. הוספנו "you must deploy before saying done" שלוש פעמים בעדכונים שונים, בכל פעם בעוצמה גוברת. הסוכן הסכים בכתב, ועדיין דילג. ההוראה המילולית לא מקושרת למימוש. רק enforcement ב-runtime עושה את ההבדל בין הוראה להתנהגות.

Noteזה לא חוסר ביושר של המודל. זה אופטימיזציה ל-end_turn מוקדם, שמופיעה כשה-loss במהלך האימון נטה לתגמל סיומים מהירים. אכיפה היא הדרך היחידה לשנות את שיווי המשקל.

The problem is not the agent doing something wrong. The problem is the agent doing nothing and saying "done". Before we added the gate this was our top failure mode: the builder accepts a request to ship a site, runs todo_write, and immediately emits end_turn with "site is ready".

In May 2025 we audited 412 consecutive builds. 23% of them claimed "ready" without a successful deploy following. Some without any write_file at all. One agent finished in two turns containing only todo_write and fetch_image, then declared "deployed" without writing a single file.

This is not a prompt problem. We added "you must deploy before saying done" three separate times, each iteration more emphatic. The agent agreed in text, then skipped it anyway. Verbal instruction is not coupled to execution. Only runtime enforcement closes the gap between told and done.

NoteThis is not dishonesty. It is an optimization toward early end_turn, baked in when training rewarded fast completions. Enforcement is the only way to shift the equilibrium.

ה-checklist כחוזהThe checklist as a contract

ה-checklist הוא חוזה בין ה-task type לבין ה-tools שחייבים להיקרא לפני end_turn. אצלנו הוא ספציפי ל-builder:

todo_write — לפחות פעם אחת. מאלץ planning לפני ביצוע.
fetch_image — לפחות פעם אחת. וידוא שיש hero image או החלטה מודעת לוותר.
set_colors — פעם אחת. בוחר palette על בסיס domain.
write_file — לפחות שלוש פעמים. אתר עם פחות משלושה קבצים הוא לא אתר.
deploy — חייב להחזיר ok: true. בלי deploy מוצלח, אין סיום.

הסדר חשוב פחות מהקיום. רוב הסוכנים מבצעים אותם בסדר הזה ממילא, אבל לא נדרש. מה שנדרש: כל הפריטים סומנו לפני end_turn, אחרת ה-turn הופך ל-tool_result שאומר מה חסר.

למה לא רק deploy? כי ה-deploy יכול לעבור על אתר ריק. ראינו את זה — git push של template ברירת מחדל, deploy ירוק, ולקוח שמסתכל על אתר שאין לו שם או צבעים. ה-checklist בודק שגם הקלט הגיע ל-deploy, לא רק שה-deploy רץ.

The checklist is a contract between the task type and the tools that must be called before end_turn. Ours is builder-specific:

todo_write — at least once. Forces planning before action.
fetch_image — at least once. Confirms a hero image exists, or that one was deliberately skipped.
set_colors — once. Picks a palette based on the domain.
write_file — at least three times. A site with fewer than three files is not a site.
deploy — must return ok: true. Without a green deploy, there is no completion.

Order matters less than presence. Most agents run them in that order anyway, but the order is not required. What is required: every item must be checked before end_turn, otherwise the turn converts into a tool_result naming what is still missing.

Why not just deploy? Because a deploy can succeed against an empty site. We saw it: a git push of the default template, a green deploy, and a customer staring at a site with no name and no colors. The checklist verifies that real input reached deploy, not just that deploy ran.

מימוש ה-gateImplementing the gate

ה-gate הוא middleware בין ה-LLM ל-stop_reason. כשה-stop_reason הוא end_turn וה-content מסמן completion, מריצים את ה-checker. אם הוא נכשל, ממירים את ה-turn ל-tool_result סינתטי שמכיל הודעת שגיאה, וממשיכים את ה-loop.

const CHECKLIST = {
  builder: [
    { tool: 'todo_write',  min: 1 },
    { tool: 'fetch_image', min: 1 },
    { tool: 'set_colors',  min: 1 },
    { tool: 'write_file',  min: 3 },
    { tool: 'deploy',      min: 1, mustOk: true },
  ],
};

function missingFromChecklist(role, history) {
  const reqs = CHECKLIST[role] || [];
  const counts = countToolUses(history);
  return reqs.filter(r => {
    const hits = counts[r.tool] || 0;
    if (hits < r.min) return true;
    if (r.mustOk) {
      return !history.some(h => h.name === r.tool && h.result?.ok === true);
    }
    return false;
  });
}

async function gateEndTurn(turn, history, role) {
  if (turn.stop_reason !== 'end_turn') return turn;
  const missing = missingFromChecklist(role, history);
  if (missing.length === 0) return turn;
  return synthesizeBlock(missing);
}

הקריאה ל-gateEndTurn נעשית מיד אחרי שה-LLM החזיר stop_reason: 'end_turn', לפני שאנחנו מודיעים למשתמש שה-task הסתיים. אם חסר משהו, אנחנו לא מחזירים שגיאה למשתמש — אנחנו מחזירים את הסוכן ל-loop עם הודעה שאומרת בדיוק מה חסר.

The gate is middleware between the LLM and the stop_reason. When stop_reason is end_turn and the content frames completion, we run the checker. On failure we convert the turn into a synthetic tool_result with an error message and let the loop continue.

const CHECKLIST = {
  builder: [
    { tool: 'todo_write',  min: 1 },
    { tool: 'fetch_image', min: 1 },
    { tool: 'set_colors',  min: 1 },
    { tool: 'write_file',  min: 3 },
    { tool: 'deploy',      min: 1, mustOk: true },
  ],
};

function missingFromChecklist(role, history) {
  const reqs = CHECKLIST[role] || [];
  const counts = countToolUses(history);
  return reqs.filter(r => {
    const hits = counts[r.tool] || 0;
    if (hits < r.min) return true;
    if (r.mustOk) {
      return !history.some(h => h.name === r.tool && h.result?.ok === true);
    }
    return false;
  });
}

async function gateEndTurn(turn, history, role) {
  if (turn.stop_reason !== 'end_turn') return turn;
  const missing = missingFromChecklist(role, history);
  if (missing.length === 0) return turn;
  return synthesizeBlock(missing);
}

The gateEndTurn call runs immediately after the LLM returns stop_reason: 'end_turn', before we tell the user the task is complete. On a miss we do not surface an error to the user — we hand the agent back to the loop with a message that names exactly what is missing.

הודעות שגיאה שמלמדותError messages that teach

הודעת השגיאה היא prompt חדש. אם נכתוב "checklist failed", הסוכן יחזור עם end_turn שני שאומר "אז סיימתי בלי checklist". צריך הודעה שמתארת ספציפית מה חסר ומה לעשות הלאה.

function synthesizeBlock(missing) {
  const items = missing.map(m => {
    if (m.mustOk) return `${m.tool} must succeed (ok:true)`;
    return `${m.tool} called at least ${m.min} time(s)`;
  }).join('; ');

  return {
    role: 'user',
    content: [{
      type: 'tool_result',
      tool_use_id: 'gate_check',
      is_error: true,
      content:
        `You declared completion, but the deliverable is not verified. ` +
        `Required and not yet satisfied: ${items}. ` +
        `Run the missing tools, then end_turn.`,
    }],
  };
}

Warnאל תכלול ב-feedback רשימה של tools שכן רצו. הסוכן ינסה לסכם אותם ולהצהיר "done" שוב. רשימה של מה שחסר בלבד.

בהפעלה ראשונה של ה-gate ראינו שיפור מ-23% completions לא תקפים ל-2.4% בתוך שבוע. ה-2.4% הנותרים היו בעיקר deploys שנכשלו באמת — בעיה אחרת לחלוטין, של תשתית, לא של הזיה.

The error message is a new prompt. Write "checklist failed" and the agent will come back with a second end_turn that says "fine, done without the checklist". The message must say exactly what is missing and what to do next.

function synthesizeBlock(missing) {
  const items = missing.map(m => {
    if (m.mustOk) return `${m.tool} must succeed (ok:true)`;
    return `${m.tool} called at least ${m.min} time(s)`;
  }).join('; ');

  return {
    role: 'user',
    content: [{
      type: 'tool_result',
      tool_use_id: 'gate_check',
      is_error: true,
      content:
        `You declared completion, but the deliverable is not verified. ` +
        `Required and not yet satisfied: ${items}. ` +
        `Run the missing tools, then end_turn.`,
    }],
  };
}

WarnDo not list the tools that did run. The agent will summarize them and re-declare done. List only what is missing.

The first week with the gate enabled, invalid completions dropped from 23% to 2.4%. The remaining 2.4% were mostly real deploy failures — a different bug, an infrastructure one, not a hallucination.

ה-tradeoff של false negativesThe false-negative tradeoff

ה-gate הוא false-negative-friendly: עדיף לעצור סיום אמיתי מאשר לאפשר סיום הוזה. אבל יש מקרים אמיתיים שבהם ה-checklist יורה כשלא צריך:

בקשה לעריכה מינורית — "תשנה את הצבע של הכפתור". לא נדרש fetch_image חדש או שלושה write_file. נדרש רק שינוי ו-deploy.
בקשת מידע — "כמה דפים יש באתר?". אין כאן deliverable. רק תשובה.
פירוק משימה — סוכן שמבצע planning בלבד ומחזיר תוכנית שצריך אישור לפני ביצוע.

הפתרון: לא לאכוף checklist על כל role. ה-builder יש לו checklist; ה-q&a לא; ה-planner לא. ה-router מחליט איזה role הוא הסוכן הנוכחי, וה-gate מתאים את עצמו.

const role = inferRoleFromContext(messages);
// roles: 'builder' | 'editor' | 'qa' | 'planner'
const gated = await gateEndTurn(turn, history, role);

במקרים גבוליים — למשל "תוסיף משפט אחד לעמוד" — אנחנו עוברים ל-role editor שדורש רק write_file.min = 1 ו-deploy.mustOk. עדיף gate חלש מ-gate נעדר. בלי שום gate, גם ה-edit הופכת להזיה.

The gate is false-negative-friendly: better to block a real completion than to admit a hallucinated one. But there are legitimate cases where the checklist fires wrongly:

Minor edit — "change the button color". No new fetch_image, no three write_files. Just a change and a deploy.
Information request — "how many pages does the site have?". No deliverable, only an answer.
Task decomposition — an agent doing planning only, returning a plan that needs human approval before execution.

The fix: do not enforce a checklist on every role. Builder has one; Q&A does not; planner does not. The router decides which role the current agent is, and the gate adapts.

const role = inferRoleFromContext(messages);
// roles: 'builder' | 'editor' | 'qa' | 'planner'
const gated = await gateEndTurn(turn, history, role);

For edge cases — e.g. "add one sentence to the page" — we switch to an editor role with only write_file.min = 1 and deploy.mustOk. A weaker gate beats no gate. Without any gate, even an edit becomes a hallucination.

מתי לא להפעיל gateWhen NOT to enable the gate

ה-gate לא תרופת פלא. שתי דרכים לקלקל אותו:

מתי לא להפעיל

סוכני שיחה. אם כל ה-turn הוא תשובה למשתמש בלי deliverable — אין מה לאכוף.
שלבי plan-then-act. שלב ה-plan לא מסיים deliverable. הוא מסיים תוכנית. ה-checklist צריך לחול רק על שלב ה-act.
סוכנים שכבר נחסמים. אם המשימה דורשת tool שהמשתמש חסם (אין הרשאה ל-deploy), ה-gate ייצור loop אינסופי.

איך לא להגדיר

אל תוסיף לרשימה tool שיכול להיכשל מסיבות חיצוניות (rate limit של ספק חיצוני). הסוכן ינסה אותו שוב ושוב על חשבון ה-context.
אל תדרוש min: 5 כשהממוצע הוא 3. אתה מאלץ עבודה דמיונית — הסוכן יקרא ל-write_file פעמיים לאותו קובץ רק כדי לעמוד בדרישה.
אל תשכח timeout. אם הסוכן נתקע ב-loop של "missing X" → "still missing X", אחרי 3 ניסיונות תפסיק את ה-task ותחזיר שגיאה אמיתית למשתמש.

if (gateRetries > 3) {
  throw new TaskAbort('gate_loop',
    `Agent failed checklist 3 times. Missing: ${missing.map(m => m.tool).join(',')}`);
}

ה-gate הוא תיקון קריטי לבעיית ה-"done" שהוזה, אבל הוא דורש קונפיגורציה מדויקת ל-role. הגדרה גסה הופכת אותו לחסם, לא ל-quality bar.

The gate is not a panacea. Two ways to misuse it:

When not to enable

Conversational agents. If the entire turn is an answer to the user with no deliverable, there is nothing to enforce.
Plan-then-act stages. The plan stage does not complete a deliverable. It completes a plan. The checklist applies only to the act stage.
Agents already blocked. If a task requires a tool the user revoked (no deploy permission), the gate creates an infinite loop.

How not to configure

Do not list a tool that can fail for external reasons (third-party rate limit). The agent will retry it forever, burning context.
Do not set min: 5 when the average is 3. You force imaginary work — the agent will call write_file twice on the same file just to satisfy the count.
Do not forget a timeout. If the agent is stuck in a loop of "missing X" → "still missing X", abort after 3 attempts and surface a real error to the user.

if (gateRetries > 3) {
  throw new TaskAbort('gate_loop',
    `Agent failed checklist 3 times. Missing: ${missing.map(m => m.tool).join(',')}`);
}

The gate is the right fix for hallucinated completion, but it demands precise per-role configuration. A blanket setup turns it from a quality bar into a roadblock.