Critic loop: ground truth is the live URL, not the DB

השקר השקט של ה-DBThe quiet lie of the DB

הסיפור הזה קרה לנו באחת מהריצות הפנימיות. הבילדר סיים לכתוב 14 קבצים, קרא ל-deploy, התהליך החזיר 0, וה-DB עודכן ל-build_jobs.status='deployed'. הסוכן אמר ללקוח "האתר עלה: hiveagent.co/p/florist-tlv". הלקוח לחץ. דף 502.

הבעיה: nginx הקונטיינר היה למעלה, אבל ה-app בפנים קרס בעלייה כי process.env.PORT לא הוגדר. ה-deploy script לא חיכה ל-health check, רק ל-exit code של docker run. ה-DB ראה הצלחה. החיים אמרו אחרת.

זה הרגע שבו הבנו שכלל ברזל אצלנו: ה-source of truth הוא תגובת HTTP חיה, לא שורה בטבלה. אם הסוכן מצהיר "done", צריך מישהו שיבדוק. את המישהו הזה אנחנו קוראים Critic.

Warnבכל מקום שראינו status field שמייצג מצב חיצוני — שורת DB, שדה ב-Redis, ערך ב-cache — היה רגע שבו השדה שיקר. תמיד.

This bit us during one internal dogfooding run. The builder wrote 14 files, called deploy, the process exited 0, and the row flipped to build_jobs.status='deployed'. The agent told the user "site is live at hiveagent.co/p/florist-tlv". The user clicked. 502.

The container's nginx was up, but the app inside crashed on boot because process.env.PORT was unset. The deploy script waited only on the docker-run exit code, not a health check. The DB saw success. Reality disagreed.

That was the moment a rule went into our system prompt: ground truth is a live HTTP response, never a table row. If the agent claims "done", something has to verify. That something we call the Critic.

WarnEvery status field that mirrors external state — a DB row, a Redis flag, a cached value — has, at some point, lied. Without exception.

מה ה-Critic באמת בודקWhat the Critic actually checks

ה-Critic לא בודק את ה-DB. הוא לא מסתכל על ה-build queue. הוא מקבל URL, ומריץ עליו רצף בדיקות שכולן מבוססות על תגובת HTTP אמיתית.

HTTP status: חייב להיות 200. כל 5xx הוא כשל מיידי. 3xx עוקב עד 3 redirect ובודק את היעד.
תוכן בסיסי: אורך body מעל 500 בייט. דף ריק זה כשל. <title> חייב להופיע. <h1> רצוי.
תמונות שבורות: regex על src="", src="undefined", src="null". כל אחת מהן זה bug ידוע של LLM שמייצר HTML בלי בדיקה.
alt חסר: כל <img> בלי alt נספר. מעל סף נחשב ככשל נגישות.
JS errors: headless Chrome פותח את הדף, אוסף console.error ו-uncaught exception. שגיאה אחת = warning, שלוש = fail.
Screenshot: מצולם 1280x800, נשמר. אם הדף לבן לחלוטין (כל הפיקסלים בערך >245) — fail.

הבדיקות רצות במקביל ב-budget של 8 שניות. כל מה שלוקח יותר נחשב timeout, וזה לבד fail.

The Critic does not look at the DB. It does not read the build queue. It is handed a URL and runs a sequence of checks, every one of them grounded in a real HTTP response.

HTTP status: must be 200. Any 5xx is a hard fail. 3xx is followed up to three hops, then the target is checked.
Baseline content: body longer than 500 bytes. An empty page is a failure. <title> must be present. <h1> preferred.
Broken images: regex over src="", src="undefined", src="null". Each is a known LLM bug — generating HTML without checking the actual asset.
Missing alt: every <img> without alt is counted. Above a threshold this is a fail on accessibility grounds.
JS errors: headless Chrome loads the page and harvests console.error and uncaught exceptions. One error is a warning, three a failure.
Screenshot: 1280x800, saved. If the page is essentially blank (mean pixel value > 245) — fail.

Checks run in parallel under an 8-second budget. Anything that misses is a timeout, which on its own is a failure.

מימוש: הסוכן מתבונן בעצמוImplementation: the agent looking at itself

הקוד פשוט יותר ממה שזה נשמע. הנה הליבה — בלי ה-screenshot, רק ה-HTTP-level checks:

async function critique(url) {
  const result = { url, ok: true, issues: [] };
  const res = await fetch(url, { redirect: 'follow', signal: AbortSignal.timeout(8000) });

  if (res.status !== 200) {
    result.ok = false;
    result.issues.push({ kind: 'http_status', value: res.status });
    return result;
  }

  const html = await res.text();

  if (html.length < 500) {
    result.ok = false;
    result.issues.push({ kind: 'empty_body', value: html.length });
  }

  const broken = html.match(/<img[^>]+src="(undefined|null|)"/g) || [];
  if (broken.length) {
    result.ok = false;
    result.issues.push({ kind: 'broken_images', count: broken.length });
  }

  const noAlt = (html.match(/<img(?![^>]*\salt=)[^>]*>/g) || []).length;
  if (noAlt > 2) {
    result.issues.push({ kind: 'missing_alt', count: noAlt });
  }

  if (!/<title>[^<]+<\/title>/.test(html)) {
    result.ok = false;
    result.issues.push({ kind: 'no_title' });
  }

  return result;
}

הפונקציה מוחזרת לסוכן כתוצאה של tool call. אם ok=false הוא מקבל את issues כקלט ל-turn הבא ומחויב לתקן. אם ok=true ה-system prompt מאפשר לו להחזיר "done" ללקוח.

The code is simpler than it sounds. Here is the core, minus the screenshot path — just the HTTP-level checks:

async function critique(url) {
  const result = { url, ok: true, issues: [] };
  const res = await fetch(url, { redirect: 'follow', signal: AbortSignal.timeout(8000) });

  if (res.status !== 200) {
    result.ok = false;
    result.issues.push({ kind: 'http_status', value: res.status });
    return result;
  }

  const html = await res.text();

  if (html.length < 500) {
    result.ok = false;
    result.issues.push({ kind: 'empty_body', value: html.length });
  }

  const broken = html.match(/<img[^>]+src="(undefined|null|)"/g) || [];
  if (broken.length) {
    result.ok = false;
    result.issues.push({ kind: 'broken_images', count: broken.length });
  }

  const noAlt = (html.match(/<img(?![^>]*\salt=)[^>]*>/g) || []).length;
  if (noAlt > 2) {
    result.issues.push({ kind: 'missing_alt', count: noAlt });
  }

  if (!/<title>[^<]+<\/title>/.test(html)) {
    result.ok = false;
    result.issues.push({ kind: 'no_title' });
  }

  return result;
}

The result is fed back to the agent as a tool result. If ok=false the issues become input for the next turn and the agent is required to fix them. If ok=true the system prompt unlocks emitting "done" to the user.

צורת ה-loop: הסוכן עוצר את עצמוLoop shape: the agent gating itself

ה-Critic לא רץ כשירות חיצוני נפרד. הוא tool בתוך הסוכן. ההבדל חשוב: הסוכן לא יכול לדלג עליו. ה-system prompt קובע ש-"done" אסור לפני שיש critique_passed אחד לפחות בהיסטוריה של הטרן הנוכחי.

// אכיפה אחרי כל turn של המודל
function canEmitDone(toolHistory) {
  const lastDeploy = toolHistory.findLast(t => t.name === 'deploy');
  if (!lastDeploy) return false;

  const critiqueAfterDeploy = toolHistory
    .slice(toolHistory.indexOf(lastDeploy))
    .find(t => t.name === 'critique' && t.result?.ok === true);

  return Boolean(critiqueAfterDeploy);
}

if (model.stop_reason === 'end_turn' && !canEmitDone(history)) {
  return injectMessage("You called end_turn without a passing critique. " +
    "Call the critique tool against the deployed URL before declaring done.");
}

השילוב הזה — tool שמחויב להיקרא, מבחן שמבוצע נגד URL חי, אכיפה ב-loop — מסיר את הסיכון הקלאסי של LLM שמדמיין הצלחה. אם הוא ממציא "done", השכבה החיצונית תופסת.

Winבריצות הפנימיות שלנו, מאז שה-Critic נכנס ל-loop, אחוז ה-deploys הירוקים ב-DB שמחזירים שגיאה ב-URL ירד באופן עקבי מסביבות 7% לפחות מ-0.5%. הצלחות שקריות זה הסוג הגרוע ביותר של כשל.

The Critic is not a separate service. It is a tool inside the agent loop. The distinction matters: the agent cannot skip it. The system prompt forbids "done" until at least one critique_passed exists in the current turn history.

// runs after every model turn
function canEmitDone(toolHistory) {
  const lastDeploy = toolHistory.findLast(t => t.name === 'deploy');
  if (!lastDeploy) return false;

  const critiqueAfterDeploy = toolHistory
    .slice(toolHistory.indexOf(lastDeploy))
    .find(t => t.name === 'critique' && t.result?.ok === true);

  return Boolean(critiqueAfterDeploy);
}

if (model.stop_reason === 'end_turn' && !canEmitDone(history)) {
  return injectMessage("You called end_turn without a passing critique. " +
    "Call the critique tool against the deployed URL before declaring done.");
}

The combination — a tool the agent is required to call, a check fired against the live URL, enforcement in the loop — removes the classic LLM failure where the model hallucinates success. If it tries to fake "done", the outer layer catches it.

WinIn our internal dogfooding, since the Critic entered the loop the rate of green-DB-rows-pointing-at-broken-URLs has consistently dropped from around 7% to under 0.5%. False positives in deploy are the worst kind of failure mode you can ship.

כשל ה-false pass: ה-Critic שאומר OK בטעותFalse pass: when the Critic itself lies

הבעיה הבאה אחרי "DB ששיקר" היא "Critic ששיקר". זה קורה כש-fetch מצליח להחזיר 200 על דף שבעצם שבור. הנה התרחישים שתפסו אותנו:

SPA עם שלד טעון אבל בלי data: ה-server מחזיר את ה-HTML העוטף, ה-React טוען, fetch לאחר עלייה נכשל, אבל ה-Critic כבר קיבל 200 וחזר. הפתרון: headless Chrome עם networkidle.
Cached error page: ה-CDN מחזיר 200 על דף שגיאה ישן. Cache-Control: no-cache ב-fetch של ה-Critic, וגם בדיקה של X-Cache header.
תוכן placeholder: הדף מציג "Coming soon". 200, יש title, אורך body מעל 500 בייט — Critic ירוק. ה-fix: רשימה של n-grams חשודים ("lorem ipsum", "coming soon", "under construction") ש-fail מיידית.
Wrong URL: ה-deploy החזיר URL של פרויקט אחר ב-tenant אחר. הדף עולה יפה, אבל זה לא הפרויקט של הלקוח. ה-fix: ה-Critic בודק שיש בדף marker ייחודי לפרויקט (project_id מוטמע ב-meta tag).

<meta name="hive-project" content="prj_8f2a1c">

ה-Critic בודק שה-content תואם ל-project_id הצפוי. בלי המארקר, מספיק bug אחד ב-routing כדי ש-deploy יחזיר "OK" על אתר זר.

The next failure after "DB lied" is "Critic lied". It happens when a fetch returns 200 on a page that is actually broken. Scenarios that have caught us:

SPA with a loaded shell but no data: the server returns the wrapper HTML, React mounts, the post-mount fetch fails — but the Critic already saw 200 and moved on. Fix: a headless Chrome pass with networkidle.
Cached error page: the CDN serves a stale error at 200. Cache-Control: no-cache on the Critic's fetch, plus an X-Cache header check.
Placeholder content: the page reads "Coming soon". Status 200, title present, body over 500 bytes — the Critic goes green. Fix: a deny-list of suspicious n-grams ("lorem ipsum", "coming soon", "under construction") that hard-fail.
Wrong URL: the deploy returned a URL pointing at a different tenant's project. The page loads cleanly, but it is not the customer's project. Fix: the Critic verifies a unique project marker embedded in a meta tag.

<meta name="hive-project" content="prj_8f2a1c">

The Critic asserts that the content matches the expected project_id. Without that marker, a single routing bug can produce a deploy that reports "OK" against somebody else's site.

עלות, ומתי לא להשתמשCost, and when not to use it

ה-Critic לא חינם. כל ריצה זה ~1.5 שניות של network + 800ms של headless Chrome אם הוא רץ. במחיר של GCP, זה כ-0.4 אגורות לבדיקה. אם הייתם מריצים ~3,000 בדיקות ביום, זה מסתכם ב-~360 ש"ח בחודש — זול ביחס לשיחה אחת עם תמיכה.

אבל יש מקרים שבהם זה overkill:

Edits בלי deploy. כש-tool רק שומר draft, אין URL חי לבדוק. נכון לדלג.
Internal-only tools. אם ה-output לא נחשף ללקוח אלא רק לסוכן הבא בשרשרת, ה-Critic מיותר — ה-step הבא יבדוק את עצמו.
Streaming endpoints. SSE או WebSocket לא נבדקים ב-fetch בודדת. צריך client אמיתי שמתחבר ומקשיב, וזה יקר. אצלנו דברים כאלה נבדקים ב-eval batch ולא inline.

הכלל המנחה: ה-Critic רץ לפני שאומרים ללקוח "done". אם הסוכן עוד באמצע, הוא בזבוז. אם הוא בקצה - הוא חובה.

Noteאם אתה לא יכול לבדוק את התוצאה ב-HTTP, אתה לא יכול להבטיח ללקוח שהיא קרתה. "לפי ה-DB עלה" זה ניחוש מנומס. "קיבלתי 200" זה עדות.

The Critic is not free. Each run is ~1.5s of network plus ~800ms of headless Chrome when it fires. At GCP pricing that is roughly 0.1 cents per check. If you ran ~3,000 a day, that totals about $90 a month — cheaper than a single support conversation.

But there are spots where it is overkill:

Edits without deploy. When a tool only saves a draft, there is no live URL to check. Skip.
Internal-only tools. If the output is consumed by the next agent in the chain rather than the user, the Critic is redundant — the next step will verify on its own.
Streaming endpoints. SSE and WebSocket cannot be validated by a single fetch. They need a real client that connects and listens, which is expensive. We test those in batched evals, not inline.

The guiding rule: the Critic runs right before telling the user "done". Earlier, it is waste. At that boundary, it is mandatory.

NoteIf you cannot verify the result over HTTP, you cannot promise the user it happened. "The DB says it shipped" is a polite guess. "I got a 200" is evidence.