Sub-agents: when to split and how to return results

תקצירTL;DR

פיצול ל-sub-agents מצדיק את עצמו רק כשאחד משלושה תנאים מתקיים: התפיחה של context בעיה אמיתית, צריך tool set שונה, או צריך מודל אחר במחיר/ביצועים. לא כשנדמה שזה "מסודר יותר". sub-agent טוב מחזיר סיכום מובנה, לא transcript. ב-Hive המעבר מ-builder יחיד ל-builder + critic + image-curator קיצר משמעותית את זמן ה-build הממוצע.

Sub-agents pay for themselves only when one of three conditions holds: real context bloat, a different tool set, or a different model on cost/latency grounds. Not because it "feels cleaner". A good sub-agent returns a structured summary, not a transcript. In Hive, splitting one builder into builder + critic + image-curator meaningfully cut average build time.

ברירת המחדל היא לא לפצלThe default answer is no

פיצול ל-sub-agents הוא מבנה מפתה כי הוא נראה ארכיטקטוני. בפועל הוא יקר. כל sub-agent הוא: עוד system prompt, עוד tool registry, עוד טעות אפשרית בהעברת context, ובדרך כלל גם עוד latency. אם תוכלו לפתור את הבעיה בסוכן יחיד עם prompt טוב, עשו זאת.

ב-Hive תחילה הייתה לנו ארכיטקטורת רב-סוכן "נכונה ארכיטקטונית": planner, builder, deployer, critic — ארבעה סוכנים נפרדים. כל אחד החזיק 30% מהבעיה ואיש מהם לא הבין אותה במלואה. ה-builder היה מקבל "build a landing page" מה-planner ולא ידע מה הוא לקוח דורש בעצם, כי הניואנסים אבדו ב-handoff.

איחדנו ל-builder יחיד עם כל הכלים. האיכות והמהירות עלו בצורה ניכרת. שמרנו פיצול רק היכן שהיו סיבות ספציפיות, לא ב-design תיאורטי. הסיבות הן הסעיפים הבאים.

Splitting into sub-agents is tempting because it looks architectural. In practice it is expensive. Each sub-agent is: another system prompt, another tool registry, another potential error in handoff, usually more latency. If you can solve the problem with a single agent and a good prompt, do that.

Hive originally had an "architecturally correct" multi-agent topology: planner, builder, deployer, critic. Each owned 30% of the problem and none owned the whole. The builder would receive "build a landing page" from the planner without knowing the customer's actual nuance, because nuance was lost in handoff.

We collapsed to a single builder with all tools. Quality and speed both rose noticeably. We kept splits only where there was a specific reason, not as theoretical design. Those reasons are the next sections.

טריגר 1: התפיחה של context אמיתיתTrigger 1: real context bloat

אם הסוכן הראשי צריך לקרוא קבצים גדולים, לחפש ב-DB, או לסרוק עמוד אינטרנט — וכל אלה יוצרים אלפי טוקנים שלא רלוונטיים לשאר ה-loop — זה רגע טוב לפצל. ה-sub-agent קורא, מסכם, מחזיר סיכום של 200 טוקנים. הסוכן הראשי לא ראה את 30,000 הטוקנים של הקלט.

ב-Hive, ה-image-curator הוא דוגמה. הוא מקבל "find 5 images for a falafel restaurant", מבצע 12 קריאות ל-API של Unsplash, מסנן לפי תאורה, ומחזיר 5 URLs. ה-builder הראשי לא צריך לראות 87 candidate URLs ו-12 metadata blobs — רק את ה-5 שנבחרו.

async function curateImages(brief) {
  const candidates = await searchUnsplash(brief, { limit: 60 });
  const filtered = candidates.filter(c => c.brightness > 0.3);
  const ranked = await rankByRelevance(filtered, brief);
  return {
    chosen: ranked.slice(0, 5).map(({ url, alt, attribution }) => ({ url, alt, attribution })),
    considered: ranked.length,
    rejected_dark: candidates.length - filtered.length
  };
}

שימו לב לפלט: chosen מובנה ומצומצם, ו-considered/rejected_dark נותנים לסוכן הראשי תחושה כמותית בלי לשפוך את כל הנתונים. זה מה שאומרים "summary, not transcript".

If your main agent has to read large files, query the DB, or crawl a page — and the result fills the loop with thousands of tokens irrelevant to the rest — split. The sub-agent reads, summarizes, returns ~200 tokens. The main agent never sees the 30,000 input tokens.

In Hive, the image curator is the example. It takes "find 5 images for a falafel restaurant", makes 12 Unsplash calls, filters by lighting, returns 5 URLs. The main builder does not need to see 87 candidate URLs and 12 metadata blobs — just the 5 picked.

async function curateImages(brief) {
  const candidates = await searchUnsplash(brief, { limit: 60 });
  const filtered = candidates.filter(c => c.brightness > 0.3);
  const ranked = await rankByRelevance(filtered, brief);
  return {
    chosen: ranked.slice(0, 5).map(({ url, alt, attribution }) => ({ url, alt, attribution })),
    considered: ranked.length,
    rejected_dark: candidates.length - filtered.length
  };
}

Note the output: chosen is structured and small; considered and rejected_dark give the parent quantitative grounding without dumping the data. That is the meaning of "summary, not transcript".

טריגר 2: tool set שונהTrigger 2: a different tool set

זוכרים את "מקסימום 12 כלים"? זו אחת הסיבות העיקריות לפיצול. אם הסוכן הראשי מתמודד עם 9 כלי build, ויש סוג מסוים של עבודה (למשל debugging) שדורש 8 כלים נוספים — אל תוסיפו אותם ל-tool list הראשי. צרו sub-agent שיש לו רק את ה-debugging tools.

זה מקנה שני יתרונות. ראשית, ה-tool registry של הסוכן הראשי נשאר נמוך וה-fallback ל-JSON-as-text שראינו ב-Opus 4.6 לא מתרחש. שנית, ה-system prompt של ה-sub-agent יכול להיות מתמחה — אין צורך לכתוב הוראות על מתי להשתמש בכלי build כשהסוכן בכלל לא רואה אותם.

// Main builder: 9 build-time tools
const builderTools = ['write_file', 'set_colors', 'fetch_image', /* ... */ ];

// Critic sub-agent: different 6 tools
const criticTools = ['fetch_url', 'screenshot', 'check_status', 'check_image', 'check_seo', 'report'];

async function runCritic(url) {
  return invokeSubAgent({
    system: CRITIC_PROMPT,
    tools: criticTools,
    model: 'claude-haiku-4-6',
    input: { url }
  });
}

הכלל: אם sub-agent חולק יותר מ-50% מהכלים עם הסוכן הראשי, כנראה לא היה צריך לפצל אותו. צרו variant של ה-prompt במקום.

Remember the "max 12 tools" rule? It is one of the main reasons to split. If the main agent already handles 9 build tools and a specific task (say, debugging) needs 8 more — do not add them to the main registry. Spawn a sub-agent that owns just the debugging tools.

Two benefits. First, the main agent's tool registry stays low and the JSON-as-text fallback we saw on Opus 4.6 does not trigger. Second, the sub-agent's system prompt can specialize — no need to write rules about when to use build tools when the sub-agent never sees them.

// Main builder: 9 build-time tools
const builderTools = ['write_file', 'set_colors', 'fetch_image', /* ... */ ];

// Critic sub-agent: different 6 tools
const criticTools = ['fetch_url', 'screenshot', 'check_status', 'check_image', 'check_seo', 'report'];

async function runCritic(url) {
  return invokeSubAgent({
    system: CRITIC_PROMPT,
    tools: criticTools,
    model: 'claude-haiku-4-6',
    input: { url }
  });
}

Rule of thumb: if a sub-agent shares more than 50% of its tools with the parent, it probably should not have been split. Make a variant prompt instead.

טריגר 3: מודל אחרTrigger 3: a different model

במידה והעבודה מסוימת זולה ומכנית — תיוג, סיווג, סינון — Haiku יעשה אותה במחיר ובלאטנסי קטנים בעשרות אחוזים מ-Opus. אם אתם משאירים את כל הסוכן על Opus, אתם משלמים פרמיה על דברים שלא צריכים אותה.

ב-Hive ה-builder רץ על Opus 4.6 כי הוא דורש שיקול דעת ארכיטקטוני. ה-router שמסווג הודעות נכנסות לכוונה ("זו הודעת build?", "זו הודעת ops?") רץ על Haiku 4.6. ה-image-curator ש-rank-iה תמונות לפי רלוונטיות רץ על Sonnet. כל אחד במחיר/ביצועים שמתאימים לסיכון של ההחלטה.

async function routeMessage(text) {
  // cheap classifier sub-agent
  const intent = await invokeSubAgent({
    system: ROUTER_PROMPT,
    model: 'claude-haiku-4-6',
    tools: ['classify_intent'],
    input: { text }
  });
  if (intent.kind === 'build') return runBuilder(text);
  if (intent.kind === 'ops')   return runOpsAgent(text);
  return chat(text);
}

שימו לב: כל סוכן שירוץ על מודל זול חייב tools עם schema הדוקה — מודלים זולים שוגים יותר ב-tool_use אם לא מנחים אותם. ושום החלטה שעלות הטעות שלה גבוהה לא צריכה לעבור דרכם. ב-Hive ה-Haiku router שלנו לא מחליט אם פתחים פרויקט חדש; הוא רק מסווג. ההחלטה היא של Opus.

When a task is cheap and mechanical — labeling, classifying, filtering — Haiku does it at a fraction of Opus's cost and latency. Leaving everything on Opus means you pay a premium for work that does not need it.

In Hive the builder runs on Opus 4.6 because it needs architectural judgment. The router that classifies incoming messages ("is this a build request?", "is this an ops question?") runs on Haiku 4.6. The image curator that ranks images by relevance runs on Sonnet. Each is priced for the risk of its decisions.

async function routeMessage(text) {
  // cheap classifier sub-agent
  const intent = await invokeSubAgent({
    system: ROUTER_PROMPT,
    model: 'claude-haiku-4-6',
    tools: ['classify_intent'],
    input: { text }
  });
  if (intent.kind === 'build') return runBuilder(text);
  if (intent.kind === 'ops')   return runOpsAgent(text);
  return chat(text);
}

Caveat: any agent on a cheaper model needs tighter schemas — cheap models trip up on tool_use more often without strong guidance. And no high-stakes decision should pass through them. Our Haiku router does not decide whether to open a new project; it only classifies. The decision is Opus's.

מחזיר סיכום, לא transcriptReturns a summary, not a transcript

טעות חוזרת בארכיטקטורת sub-agent: ה-sub-agent מחזיר את כל ה-transcript שלו לסוכן הראשי. כל ה-tool calls, כל ה-thinking blocks. הסוכן הראשי טובע. הוא לא צריך לדעת איך ה-critic הגיע למסקנה — הוא צריך את המסקנה.

ה-contract של sub-agent טוב הוא: קבל input, החזר אובייקט מובנה. כל מה שביניהן הוא black box.

// Bad: full transcript
{
  "thinking": "Let me check the URL...",
  "tool_calls": [
    { "name": "fetch_url", "args": { "url": "..." }, "result": { /* 8KB */ } },
    { "name": "screenshot", "args": { /*...*/ }, "result": { /* 32KB */ } }
  ],
  "final": "All good"
}

// Good: structured summary
{
  "ok": true,
  "url": "https://prj-8a7c.hiveagent.co",
  "status": 200,
  "checks": {
    "images_loaded": 4,
    "images_broken": 0,
    "meta_title_present": true,
    "hebrew_rtl_correct": true
  },
  "warnings": []
}

זה מאפשר לסוכן הראשי לקרוא את התוצאה כאילו היא tool result רגיל. אין הבדל בין "קראתי לכלי" לבין "שלחתי sub-agent". זה גם מאפשר לכם לזהות בעתיד בעיות — אם ה-sub-agent מתחיל להחזיר אובייקטים לא תקינים, ה-validator תופס את זה לפני שהסוכן הראשי מתבלבל.

A recurring mistake in sub-agent architecture: the sub-agent hands its entire transcript back to the parent. Every tool call, every thinking block. The parent drowns. The parent does not need to know how the critic reached its conclusion — it needs the conclusion.

A good sub-agent's contract is: take an input, return a structured object. Everything between is a black box.

// Bad: full transcript
{
  "thinking": "Let me check the URL...",
  "tool_calls": [
    { "name": "fetch_url", "args": { "url": "..." }, "result": { /* 8KB */ } },
    { "name": "screenshot", "args": { /*...*/ }, "result": { /* 32KB */ } }
  ],
  "final": "All good"
}

// Good: structured summary
{
  "ok": true,
  "url": "https://prj-8a7c.hiveagent.co",
  "status": 200,
  "checks": {
    "images_loaded": 4,
    "images_broken": 0,
    "meta_title_present": true,
    "hebrew_rtl_correct": true
  },
  "warnings": []
}

This lets the parent read the output as if it were a normal tool result. There is no difference between "I called a tool" and "I dispatched a sub-agent". It also lets you catch issues later — if the sub-agent starts returning malformed objects, the validator catches it before the parent gets confused.

מתי לא לפצלWhen not to split

שלוש סיבות שלרוב מציגות עצמן כסיבות טובות לפצל, אבל אינן.

"זה יותר מודולרי". מודולריות בקוד היא טובה. ב-LLM agents היא יקרה. אם הפיצול לא מבוסס על אחד משלושת הטריגרים — context, tools, model — זה רק handoff נוסף שיעלה לכם בלאטנסי ובטעויות.
"כל סוכן יהיה ספציפי יותר". לפעמים נכון. לרוב, system prompt טוב יכול להגדיר "persona" בסעיף אחד בלי שתצטרכו עוד runtime. אם הסוכן הראשי מבולבל, הבעיה לרוב ב-prompt או ב-tool design, לא בכמות הסוכנים.
"זה דומה לארכיטקטורות שאני רואה ב-papers". הרבה papers מציגים multi-agent כי זה יותר גלמור לכתוב עליהם. ב-prod, hub-and-spoke פשוט עם sub-agents נדודיים מנצח graph מורכב.

Warnהסיכון הגדול ביותר ב-multi-agent הוא responsibility diffusion. כל אחד חושב שהשני אחראי לוודא X. ה-DoD נופל בין הכיסאות. ב-Hive הקפדנו: אחד אחראי על verification — ה-builder, אחרי קריאה ל-critic. ה-critic לא יכול להחליט שהבדיקה ירוקה; הוא מחזיר נתונים, ה-builder מסיק. בלעדי זה ראינו פרויקטים שעלו broken כי כל אחד חשב שהשני בדק.

אם נתקעתם בארכיטקטורת multi-agent ולא בטוחים אם להפוך אותה — נסו את הניסוי הבא: שלבו את כל הסוכנים לאחד עם כל הכלים לבחינה אחת, והריצו על 50 בקשות. אם איכות לא יורדת, ה-multi-agent היה מותרות.

Three reasons that often masquerade as good reasons to split, but are not.

"It's more modular". Code modularity is good. Agent modularity is expensive. If the split is not driven by context, tools, or model, it is just a handoff that costs you latency and adds error surface.
"Each agent will be more specialized". Sometimes true. Usually a good system prompt defines a "persona" in one section without needing another runtime. If your main agent is confused, the cause is usually prompt or tool design, not agent count.
"It looks like the architectures in the papers". Many papers showcase multi-agent because it is more glamorous to write up. In prod, simple hub-and-spoke with disposable sub-agents beats elaborate graphs.

WarnThe biggest risk in multi-agent is responsibility diffusion. Each agent assumes the other verified X. The DoD falls between chairs. In Hive we are strict: one party owns verification — the builder, after calling the critic. The critic cannot declare a check green; it returns data, the builder draws the conclusion. Without that we saw broken projects ship because each agent assumed the other had checked.

If you are stuck with a multi-agent setup and unsure whether to collapse it, try this experiment: merge everything into one agent with all tools and run on 50 requests. If quality does not drop, the multi-agent was a luxury.