Eval Gates: before you say "done"

למה DoD חייב להיות מכונת-בדיקהWhy DoD must be machine-checkable

סוכן שמסיים תור באמירה "סיימתי" הוא בעיה לא פתורה. בלי שער שמאלץ אותו לאמת, הוא ממציא הצלחות. ראינו את זה ב-dogfooding שלנו: ב-build flow ישן, חלק לא מבוטל מ"הפעולות שהושלמו" נכשלו ב-e2e שרץ דקה אחר כך. המודל לא שיקר במזיד — הוא פשוט לא ידע שהוא לא ידע.

הפתרון הוא הפיכת ה-Definition of Done ל-API שהסוכן חייב לקרוא לפני שהוא מצהיר על סיום:

סטטוס: שערים עברו / נכשלו / לא רלוונטיים.
אם נכשלו: פלט מלא של הכלי שנכשל.
אם עברו: hash של הריצה לרישום.

הסוכן לא יכול לסגור תור עם stop_reason: "end_turn" אם השער האחרון לא החזיר passed: true. enforcement נעשה ב-router שלנו — לא בידיו של המודל.

An agent that closes a turn by saying "done" is an unsolved problem. Without a gate that forces verification, it invents successes. We saw it in our own dogfooding: in an older build flow, a meaningful share of "completed actions" failed the e2e that ran a minute later. The model was not lying on purpose — it just did not know what it did not know.

The fix is to make Definition of Done an API the agent must call before declaring completion:

Status: gates passed / failed / not applicable.
On failure: full stdout of the failing gate.
On pass: a run hash for the audit log.

The agent cannot close a turn with stop_reason: "end_turn" unless the final gate returned passed: true. Enforcement happens in our router, not in the model's hands.

שלושת השערים שלנוThe three gates we run

אצלנו DoD בנוי משלושה שערים, כל אחד מותנה ב-scope של השינוי:

node --check על כל קובץ JS שנגעו בו. תופס שגיאות תחביר לפני שהן מגיעות לפרודקשן. ריצה בודדת לוקחת ~80ms לקובץ. תמיד רץ אם נגעו בקובץ JS.
fix-loop (npm run fix-loop): סוויטת tests מקומית של כ-40 בדיקות מהירות שמרצות 30-90 שניות. כוללת lint + tests יחידה + smoke contracts. תמיד רץ.
e2e (npm run e2e): סוויטה מלאה שמפעילה build חי בסביבת sandbox מבודדת. ~6-8 דקות. רץ רק אם נגעו ב-sandbox/, build/, או agent/trading/.

שלושת השערים נשלטים ע"י קובץ אחד — tools/gate-runner.js — שיודע מאיזה תיקיות נגענו ומריץ רק את מה שצריך. ההפרדה הזאת חשובה: אם המסלול היחיד הוא e2e המלא, הסוכן לעולם לא ירצה לרוץ אותו על שינוי תיעוד, ויעקוף את ה-DoD ב"אני יודע שזה רק markdown".

Our DoD is built from three gates, each conditional on the scope of the change:

node --check on every touched JS file. Catches syntax errors before they reach production. ~80ms per file. Always runs when a JS file was touched.
fix-loop (npm run fix-loop): a local suite of about 40 fast checks running 30-90 seconds total. Lint + unit tests + smoke contracts. Always runs.
e2e (npm run e2e): full suite that triggers a live build in an isolated sandbox. ~6-8 minutes. Runs only when sandbox/, build/, or agent/trading/ was touched.

One file — tools/gate-runner.js — drives all three. It inspects which directories were touched and runs only what is required. That separation is important: if the only available gate were the full e2e, the agent would never want to run it for a docs change and would route around DoD with "I know it is just markdown".

דוגמה: מריץ שערים זעירA tiny gate runner

הליבה של ה-runner קצרה. הוא מקבל רשימת קבצים שנגעו בהם, מסיק אילו שערים רלוונטיים, מריץ אותם sequentially ומחזיר אובייקט מובנה.

const { execFileSync } = require("child_process");
const path = require("path");

function touchedScopes(files) {
  const scopes = new Set();
  for (const f of files) {
    if (f.endsWith(".js")) scopes.add("js");
    if (f.startsWith("sandbox/") || f.startsWith("build/") ||
        f.startsWith("agent/trading/")) scopes.add("e2e");
  }
  scopes.add("fix"); // always
  return scopes;
}

function runGate(name, cmd, args, opts = {}) {
  const start = Date.now();
  try {
    const out = execFileSync(cmd, args, { encoding: "utf8", ...opts });
    return { name, passed: true, ms: Date.now() - start, out: out.slice(-2000) };
  } catch (e) {
    return { name, passed: false, ms: Date.now() - start,
             out: (e.stdout || "") + (e.stderr || "") };
  }
}

async function runGates(touched) {
  const scopes = touchedScopes(touched);
  const results = [];
  if (scopes.has("js")) {
    for (const f of touched.filter(x => x.endsWith(".js"))) {
      results.push(runGate(`node-check:${path.basename(f)}`, "node", ["--check", f]));
    }
  }
  if (scopes.has("fix")) results.push(runGate("fix-loop", "npm", ["run", "fix-loop"]));
  if (scopes.has("e2e")) results.push(runGate("e2e", "npm", ["run", "e2e"]));
  const passed = results.every(r => r.passed);
  return { passed, results, ranAt: new Date().toISOString() };
}

module.exports = { runGates };

הסוכן קורא ל-runGates(touched) ככלי. אם passed: false, ה-router דוחה כל ניסיון לסגור תור עם הודעה "DoD failed: see results.{name}.out". המודל מבין מהר מאוד שאין מסלול קצר.

The core runner is short. It takes the list of files touched in this turn, infers which gates apply, runs them sequentially, and returns a structured result.

const { execFileSync } = require("child_process");
const path = require("path");

function touchedScopes(files) {
  const scopes = new Set();
  for (const f of files) {
    if (f.endsWith(".js")) scopes.add("js");
    if (f.startsWith("sandbox/") || f.startsWith("build/") ||
        f.startsWith("agent/trading/")) scopes.add("e2e");
  }
  scopes.add("fix"); // always
  return scopes;
}

function runGate(name, cmd, args, opts = {}) {
  const start = Date.now();
  try {
    const out = execFileSync(cmd, args, { encoding: "utf8", ...opts });
    return { name, passed: true, ms: Date.now() - start, out: out.slice(-2000) };
  } catch (e) {
    return { name, passed: false, ms: Date.now() - start,
             out: (e.stdout || "") + (e.stderr || "") };
  }
}

async function runGates(touched) {
  const scopes = touchedScopes(touched);
  const results = [];
  if (scopes.has("js")) {
    for (const f of touched.filter(x => x.endsWith(".js"))) {
      results.push(runGate(`node-check:${path.basename(f)}`, "node", ["--check", f]));
    }
  }
  if (scopes.has("fix")) results.push(runGate("fix-loop", "npm", ["run", "fix-loop"]));
  if (scopes.has("e2e")) results.push(runGate("e2e", "npm", ["run", "e2e"]));
  const passed = results.every(r => r.passed);
  return { passed, results, ranAt: new Date().toISOString() };
}

module.exports = { runGates };

The agent calls runGates(touched) as a tool. If passed: false, the router rejects any attempt to close the turn with the message "DoD failed: see results.{name}.out". The model learns very fast that there is no shortcut.

אכיפה: השער הוא lock על end_turnEnforcement: the gate is a lock on end_turn

הקטע החשוב: ה-DoD לא חי בתוך ההנחיה למודל. הוא חי ב-router. כשהמודל מחזיר stop_reason: "end_turn", ה-router בודק:

function canCloseTurn(session) {
  const lastGate = session.events.findLast(e => e.type === "gate_run");
  if (!lastGate) {
    return { ok: false, reason: "no gate run this turn — call runGates() first" };
  }
  if (!lastGate.result.passed) {
    return { ok: false, reason: `gate failed: ${
      lastGate.result.results.filter(r => !r.passed).map(r => r.name).join(", ")}` };
  }
  if (Date.now() - new Date(lastGate.result.ranAt).getTime() > 5 * 60_000) {
    return { ok: false, reason: "stale gate run (older than 5 min) — re-run" };
  }
  return { ok: true };
}

שלושה תנאים: מישהו הריץ שערים, השערים עברו, וזה היה לאחרונה. אם אחד מהם לא מתקיים, ה-router מזריק לתוך השיחה הודעת tool_result סינתטית עם הסיבה, והמודל ממשיך לרוץ במקום להיסגר. זה מרגיש דומה ל-feedback loop של compiler: אי אפשר "לסיים" עם errors פתוחים.

הזהירות: לא לאפשר לסוכן לעקוף את הבדיקה דרך flag כמו force_close: true. אם נפתח חור, יום אחד מודל ימצא אותו. במקום זה, אם נדרשת התערבות אנושית, יש כלי request_human_override(reason) שיוצר הודעה ב-Slack ועוצר את הסוכן עד שמישהו מאשר ידנית.

The important part: DoD does not live inside the system prompt. It lives in the router. When the model returns stop_reason: "end_turn", the router checks:

function canCloseTurn(session) {
  const lastGate = session.events.findLast(e => e.type === "gate_run");
  if (!lastGate) {
    return { ok: false, reason: "no gate run this turn — call runGates() first" };
  }
  if (!lastGate.result.passed) {
    return { ok: false, reason: `gate failed: ${
      lastGate.result.results.filter(r => !r.passed).map(r => r.name).join(", ")}` };
  }
  if (Date.now() - new Date(lastGate.result.ranAt).getTime() > 5 * 60_000) {
    return { ok: false, reason: "stale gate run (older than 5 min) — re-run" };
  }
  return { ok: true };
}

Three conditions: somebody ran the gates, the gates passed, and it was recent. If any of them fail, the router injects a synthetic tool_result message with the reason and lets the model keep running instead of closing. It feels like a compiler feedback loop: you cannot "ship" with errors open.

One discipline: do not give the agent an escape flag like force_close: true. If you open the door, one day a model walks through it. Instead, when human intervention is genuinely needed, expose a request_human_override(reason) tool that posts to Slack and pauses the agent until a human approves manually.

הקרון של יום ראשון: רשת בטיחות שבועיתThe Sunday cron: a weekly safety net

שערים פר-תור תופסים רגרסיות שהסוכן עצמו הכניס. הם לא תופסים רגרסיות שהיו שם מלכתחילה אבל לא נדלקו. בשביל זה יש לנו קרון שבועי — 0 5 * * 0 ב-UTC, שזה יום ראשון 08:00 בשעון ישראל — שמריץ e2e-builder על כל המסלולים הקריטיים ושולח דיווח ל-Digest של בוקר יום ראשון.

# crontab
0 5 * * 0  cd /opt/hive && node tools/e2e-builder.js --full --report

# tools/e2e-builder.js does:
# 1. spin up a fresh sandbox container
# 2. run 12 canonical "build a site" scenarios end-to-end
# 3. take a live screenshot of each result and run critic loop
# 4. write summary row to weekly_digest table
# 5. emit a webhook to the morning digest if any scenario failed

הריצה אורכת בין 35 ל-50 דקות. הדיווח מגיע ב-08:50 IL לסלאק עם ירוק/אדום פר-scenario, ועם diff של מטריקות מהשבוע הקודם (זמן build חציוני, אחוז הצלחה ב-critic, מספר retries).

Noteהקרון לא מחליף את ה-DoD שרץ בכל תור. הוא רשת בטיחות לדברים שלא נוגעים בהם — תלות חיצונית שהשתנתה, חידוש תעודה, גרסת node שזזה. אם הקרון נכשל ולא היה deploy בשבוע, הבעיה היא בעולם, לא אצלכם.

Per-turn gates catch regressions the agent itself just introduced. They do not catch regressions that were already there but never tripped a guard. For that we run a weekly cron — 0 5 * * 0 in UTC, which is Sunday 08:00 Israel time — that runs e2e-builder against all critical paths and posts the result into the Sunday morning digest.

# crontab
0 5 * * 0  cd /opt/hive && node tools/e2e-builder.js --full --report

# tools/e2e-builder.js does:
# 1. spin up a fresh sandbox container
# 2. run 12 canonical "build a site" scenarios end-to-end
# 3. take a live screenshot of each result and run critic loop
# 4. write summary row to weekly_digest table
# 5. emit a webhook to the morning digest if any scenario failed

The run takes 35-50 minutes. The report lands in Slack around 08:50 IL with green/red per scenario and a diff against the prior week (median build time, critic-pass rate, retry count).

NoteThe cron does not replace the per-turn DoD. It is a safety net for things nobody touched — an external dependency that shifted, a certificate that renewed, a node version that bumped. If the cron is red on a week with no deploys, the world changed, not your code.

מלכודות: שערים איטיים, false positives, ומסלולי עקיפהPitfalls: slow gates, false positives, and escape hatches

שערים מקלקלים יותר ממה שמתקנים אם לא בנויים נכון. שלושה דברים לשמור עליהם:

שער אטי הוא שער שלא ירוץ. הסוכן יבזבז 6 דקות על e2e פעם אחת ויפסיק לקרוא לו. שמרו ל-fix-loop מתחת ל-90 שניות. e2e שמור לשינויים שדורשים אותו.
flaky test = שער מת. בדיקה שעוברת ב-95% מהפעמים מאמנת את הסוכן (ואת בני האדם) להתעלם מכישלונות. עדיף בדיקה אחת אמינה מ-10 פלקיות. הריצו את ה-fix-loop פי 50 ב-CI לפני שהוא נכנס ל-DoD; אם יש פלקיות, תקנו או הסירו.
אל תפתחו escape hatch. אסקייפ "רק לפעמים" הופך ל"כל פעם". אם צריך התערבות, תעשו אותה אנושית ועקובה.
תיעדו את הזמן הממוצע של כל שער. אם fix-loop נמשך מ-45s ל-180s, תדעו לפני שהסוכן יתחיל לעקוף.
שמרו פלט מסודר. כשהשער נכשל, מה שהסוכן רואה הוא 2K הבייטים האחרונים של stdout/stderr. ודאו שזה הקטע הרלוונטי — חתכו, אל תקצצו מההתחלה.

הכלל הבסיסי הכי חשוב: הסוכן צריך להאמין שהשערים עוזרים לו, לא חוסמים אותו. אם הם מפריעים בלי להוסיף ערך, הוא ינסה לעקוף, ואז כל המנגנון שווה זרו.

Gates do more harm than good when they are built wrong. Three things to watch:

A slow gate is a gate that will not run. The agent will burn six minutes on e2e once and stop calling it. Keep fix-loop under 90 seconds. Reserve e2e for changes that require it.
A flaky test is a dead gate. A check that passes 95% of the time trains the agent (and humans) to ignore failures. One reliable test beats ten flaky ones. Run fix-loop 50 times in CI before it enters DoD; if it flakes, fix it or remove it.
Do not open an escape hatch. A "just this once" override becomes "every time". If you need intervention, make it human and audited.
Track per-gate latency. If fix-loop drifts from 45s to 180s, you want to know before the agent starts routing around it.
Curate failure output. When a gate fails, the model sees the last 2K bytes of stdout/stderr. Make sure it is the relevant part — slice, do not blunt-truncate from the start.

The most important meta-rule: the agent has to believe the gates help, not block. If they obstruct without adding value, the model will route around them, and the whole apparatus is worth zero.