Anti-pattern: the do-anything tool

תקצירTL;DR

כלי "do_anything" עם שדה command חופשי מחזיר את ה-LLM למצב של פלט טקסט במקום פרוטוקול. הוא מבטל את ה-validation, מסתיר את משטח האבטחה, ומאלץ אתכם לכתוב parser במקום סכמה. ב-Hive ראינו שכלי כזה ב-routes/client.js הקלאסי הוביל ל-9% קריאות שלא ניתנות להפעלה — והחלפנו אותו ב-12 כלים ייעודיים.

A "do_anything" tool with a free-form command string drops the LLM back to text output instead of protocol. It bypasses validation, hides your security surface, and forces you to write a parser instead of a schema. In Hive an early version of such a tool produced 9% non-executable calls; we replaced it with 12 narrow tools.

איך נראה הכלי הרעWhat the bad tool looks like

הכלי הזה מופיע כמעט בכל codebase של agent בגרסה הראשונה. הוא נראה מפתה כי הוא חוסך עבודה — אתם לא צריכים להגדיר 12 כלים, רק אחד:

{
  "name": "do",
  "description": "Run any operation. Pass a command string.",
  "input_schema": {
    "type": "object",
    "properties": {
      "command": { "type": "string" },
      "args":    { "type": "object" }
    },
    "required": ["command"]
  }
}

במבט ראשון זה נראה גמיש. במציאות, השדה command הוא שפה. אתם זה עתה דרשתם מהמודל לכתוב שפה ולכם — לכתוב interpreter עבורה. ה-LLM כבר יודע לקרוא לכלים בעזרת פרוטוקול ה-tool_use; אתם פשוט אילצתם אותו לחזור לכתוב טקסט שאתם תפענחו.

גרוע מכך, אתם מאבדים את כל מה שסכמת JSON Schema נותנת לכם. אין enum, אין pattern, אין שדות שמיים. ה-validator יעבור על כל מחרוזת. ה-bug הראשון יגיע באמצע ערב שישי.

This tool appears in almost every agent codebase v1. It looks attractive because it saves work — instead of defining 12 tools, you define one:

{
  "name": "do",
  "description": "Run any operation. Pass a command string.",
  "input_schema": {
    "type": "object",
    "properties": {
      "command": { "type": "string" },
      "args":    { "type": "object" }
    },
    "required": ["command"]
  }
}

It looks flexible. In reality, command is a language. You just asked the model to write a language and yourselves to write an interpreter. The LLM already knew how to call tools using the tool_use protocol; you forced it back into producing text you have to parse.

Worse, you give up everything JSON Schema offers. No enum, no pattern, no named fields. The validator approves any string. The first bug arrives on a Friday night.

אובדן הפרוטוקולYou lose the protocol

פרוטוקול ה-tool_use של Claude נותן לכם דבר חזק: המודל מציג מבנה שאפשר ל-validate, ה-runtime מאשר או דוחה. כש-do_anything מקבל command: "create user, name=Yair", ה-runtime לא יכול להגיד "חסר field X" — הוא רק יכול לנסות לפענח. בכל פעם שהפענוח נכשל, המודל לא מקבל error מובנה אלא הודעה כללית.

אצלנו ב-Hive גרסה ראשונה של ה-builder השתמשה בכלי בודד execute עם פרמטר action_dsl. ראינו את המודל מייצר "build_site(theme=darkmode, deploy=true, also_email=customer)" — שילוב של שלושה כלים אמיתיים בשפה אחת. הוצאנו את זה משם בפיצול ל-build_site, deploy, send_email, וה-also_email פשוט הפסיק להופיע.

// Bad: parser tries to extract structure from a string
function parseCommand(cmd) {
  const m = /^(\w+)\((.*)\)$/.exec(cmd);
  if (!m) return { error: 'unparseable' };
  // ... 200 lines of regex hell to handle nested args
}

// Good: schema does this for free
const tool = { name: 'build_site', input_schema: { /* ... */ } };

כאשר אתם נותנים למודל שפה משלו, הוא ממציא ב-DSL הזה תכונות שאתם לא תמיד מצפים להן. הסכמה היא הגדר.

Claude's tool_use protocol gives you something valuable: the model emits structure the runtime can validate, and the runtime can accept or reject. When do_anything receives command: "create user, name=Yair", the runtime cannot say "missing field X" — it can only attempt to parse. Every parse failure becomes a generic error instead of a structured one.

An early Hive builder used a single execute tool with an action_dsl parameter. We caught the model emitting "build_site(theme=darkmode, deploy=true, also_email=customer)" — three real tools fused into one DSL string. We split it into build_site, deploy, and send_email, and the rogue also_email disappeared.

// Bad: parser tries to extract structure from a string
function parseCommand(cmd) {
  const m = /^(\w+)\((.*)\)$/.exec(cmd);
  if (!m) return { error: 'unparseable' };
  // ... 200 lines of regex hell to handle nested args
}

// Good: schema does this for free
const tool = { name: 'build_site', input_schema: { /* ... */ } };

When you let the model invent a language, it invents features in that DSL you did not anticipate. The schema is your fence.

האבטחה הופכת בלתי שקופהSecurity becomes opaque

כלי do_anything עם command חופשי הוא חלום של האקר. אתם כותבים בקוד if (cmd.startsWith('rm ')) כדי לחסום שגיאות ברמת UNIX, ושוכחים ש-command יכול להיות "delete_user". אתם לא יודעים מה האפשרויות, אז אתם לא יכולים לכתוב allowlist.

הסכמה היא ה-allowlist הטבעי. אם action הוא enum של חמישה ערכים, יש לכם חמישה paths לבדוק. אם command הוא string, יש לכם infinity. בקודנו ראינו את המודל שולח "read /etc/passwd" כפעולה פנימית. הוא לא היה זדוני; הוא ראה שיש דרך וניסה.

WarnOWASP LLM Top 10 כולל את LLM07: Insecure Plugin Design. הסעיף הראשון בו: "כלים שמקבלים free-form input ומפענחים אותו שווים ל-eval של קלט משתמש". זה לא תיאור נדיב — זה תיאור מדויק.

גם אם תוקף לא מערב את העניין, ה-LLM עצמו ינסה דברים שלא חשבתם עליהם. כלי צר חוסם שני וקטורי תקיפה ב-design — אדם זדוני ומודל סקרן. סכמה רחבה חוסמת אף אחד.

A do_anything tool with a free command is an attacker's dream. You write if (cmd.startsWith('rm ')) to block UNIX-level mistakes and forget that command could be delete_user. You don't know the option set, so you can't write an allowlist.

The schema is the natural allowlist. If action is an enum of five values, you audit five code paths. If command is a string, you audit infinity. We caught the model emitting "read /etc/passwd" as an inner action. It was not malicious; it just saw an opening and tried.

WarnOWASP LLM Top 10 lists LLM07: Insecure Plugin Design. Its first item: "plugins that accept free-form input and interpret it are equivalent to eval over user input". That is not generous phrasing; it is accurate.

Even without an attacker, the LLM itself will try things you did not consider. A narrow tool blocks two attack vectors by design — malicious humans and curious models. A wide schema blocks neither.

המלכודת של schema overloadThe schema-overload trap

תגובת הנגד הראשונה ל-do_anything היא לרוב פיצול לעשרות כלים. גם זה יכול להזיק. אצלנו ב-Hive עם Opus 4.6 דרך OAuth, מעל 12 כלים גרמו למודל לחזור לפלט JSON-as-text במקום tool_use נטיב. הוא קורס תחת רשימת choices ארוכה.

הפתרון הוא לא חזרה לכלי יחיד אלא tool tiering: חלוקה לקבוצות של ≤12 כלים שכל אחת מתאימה למצב. ב-Hive: core (תמיד זמין), build (כשיש פרויקט פעיל), ops (כשיש בקשה אדמינית). הסוכן רואה רק את ה-tier הרלוונטי.

function toolsForContext(ctx) {
  const core = ['ask_user', 'todo_write', 'log'];
  if (ctx.mode === 'build') return [...core, ...BUILD_TOOLS];
  if (ctx.mode === 'ops')   return [...core, ...OPS_TOOLS];
  return core;
}

הסכמה הצרה לכל כלי נשמרת. מספר הכלים בכל turn נשאר נמוך. ה-fallback ל-JSON-as-text שראינו ב-Hive ירד מ-18% של הסשנים ל-2% אחרי שהכנסנו tiering. רק חשוב לוודא שהמודל לא צריך כלי שלא ב-tier הנוכחי — אחרת הוא ימציא tool_use "מומצא" שלא קיים.

The first reaction to do_anything is usually to split into dozens of tools. That can also hurt. In Hive, with Opus 4.6 over OAuth, more than 12 tools at once made the model fall back to JSON-as-text instead of native tool_use. It collapses under a long choice list.

The solution is not to return to a single tool but tool tiering: groups of ≤12 tools, each scoped to a mode. In Hive: core (always available), build (when a project is active), ops (admin requests). The agent only ever sees the relevant tier.

function toolsForContext(ctx) {
  const core = ['ask_user', 'todo_write', 'log'];
  if (ctx.mode === 'build') return [...core, ...BUILD_TOOLS];
  if (ctx.mode === 'ops')   return [...core, ...OPS_TOOLS];
  return core;
}

Each tool keeps its narrow schema. The number of tools per turn stays low. The JSON-as-text fallback we saw dropped from 18% of sessions to 2% after tiering shipped. The trap to avoid: the model needs a tool not in the current tier and invents a fake tool_use. Watch for that in logs and refine your tier rules.

החריג ההגון: bash בכוונהThe honest exception: bash on purpose

יש מקרה אחד לגיטימי לכלי רחב: כלי bash או shell ב-coding agent כמו Claude Code. שם הקלט הוא באמת לא מוגבל, ב-design — אתם מצפים מהמודל להריץ פקודות UNIX שלא ניתן לפרט מראש. אבל זה לא do_anything; זה run_bash: כלי בודד עם semantics ברורות, סביבה sandboxed, ופלט מובנה.

מה שמבדיל אותו מהאנטי-פטרן: (1) השם מתאר במדויק מה הוא עושה — מריץ shell. (2) הוא אינו תחליף לכלים אחרים — לצד shell יש read_file, edit_file, git_*. (3) ה-effect מובן: stdout, stderr, exit code.

{
  "name": "run_bash",
  "description": "Execute a shell command in the sandboxed workspace. Returns stdout, stderr, exit_code. Read-only file operations should prefer read_file. Use this for build/test/git that genuinely need a shell.",
  "input_schema": {
    "type": "object",
    "properties": {
      "cmd": { "type": "string", "maxLength": 8000 },
      "timeout_s": { "type": "integer", "minimum": 1, "maximum": 600 }
    },
    "required": ["cmd"]
  }
}

השדה הוא string, אבל ה-tool הוא צר. ההבדל בין כלי שגוי לכלי הגון אינו תמיד באורך ה-string. הוא בכוונה: האם השם, ה-description, וה-effect קשורים זה לזה? אם כן — מותר. אם do משמש בפועל גם לבניית site וגם לשליחת מייל וגם לפתיחת user, אתם בלולאה השגויה.

There is one legitimate case for a wide tool: a bash or shell tool in a coding agent like Claude Code. The input there is genuinely unconstrained by design — you expect the model to run UNIX commands you cannot enumerate in advance. But that is not do_anything; it is run_bash: a single tool with clear semantics, a sandbox, and structured output.

What sets it apart from the anti-pattern: (1) the name describes exactly what it does — runs a shell. (2) It is not a replacement for other tools — alongside it live read_file, edit_file, git_*. (3) The effect is well-defined: stdout, stderr, exit code.

{
  "name": "run_bash",
  "description": "Execute a shell command in the sandboxed workspace. Returns stdout, stderr, exit_code. Read-only file operations should prefer read_file. Use this for build/test/git that genuinely need a shell.",
  "input_schema": {
    "type": "object",
    "properties": {
      "cmd": { "type": "string", "maxLength": 8000 },
      "timeout_s": { "type": "integer", "minimum": 1, "maximum": 600 }
    },
    "required": ["cmd"]
  }
}

The field is a string, but the tool is narrow. The line between the anti-pattern and an honest wide tool is not the input shape; it is intent. Are the name, description, and effect coherent? If yes — fine. If do in practice covers both site builds and email sends and user creation, you are in the loop you wanted to avoid.

איך לפרק do_anything שכבר קייםHow to dismantle a do_anything you already have

אם יש לכם כלי כזה ב-prod, הנה הסדר שבו עברנו ב-Hive. החלפנו את execute תוך 11 ימים בלי downtime.

logging מלא ל-7 ימים. אספנו כל קריאה ל-execute, רשמנו את command ואת ה-args. סופרים אילו subcommands באמת קוראים — לרוב 5-10 פעולות מכסות 95% מהקריאות.
זיהוי חלוקה. מיינו את הקריאות לקבוצות לפי effect. "כותב קובץ", "קורא קובץ", "מפעיל שרת". כל קבוצה היא כלי עתידי.
הוספת כלים חדשים לצד הכלי הישן. עדכנו את ה-system prompt: "prefer create_project, build_site, deploy over the legacy execute tool". זה מספיק כדי שהמודל יעבור ב-80% מהמקרים תוך טסט אחד.
execute מחזיר deprecation warning. כל קריאה ל-execute מחזירה גם warnings: [{ code: 'deprecated', hint: 'use create_project instead' }]. בעוד שבוע, השימוש ב-execute יורד מתחת ל-2%.
הסרה. מסירים את execute מ-tool registry. אם נשארה קריאה אחת או שתיים, ה-tool registry מחזיר "tool not found, use one of: ..." והמודל מצליח להחליף.

SELECT command, count(*)
FROM tool_calls
WHERE tool_name = 'execute' AND created_at > now() - interval '7 days'
GROUP BY command ORDER BY 2 DESC LIMIT 20;

זה מסלול שמסיים את החיים של כלי רחב בלי לשבור פרודקשן. הסוד הוא ש-LLMs מעדיפים כלי צר ברגע שהוא נחשף — הם פחות שוגים בו, וזה גם להם מאמץ קטן יותר.

If you already have one of these in prod, here is the order we used at Hive. We replaced our execute in 11 days with no downtime.

Full logging for 7 days. Capture every call to execute, log command and args. Count which subcommands actually appear — usually 5-10 cover 95% of traffic.
Identify the split. Group calls by effect: "writes a file", "reads a file", "starts a server". Each group becomes a future tool.
Add the new tools alongside the old one. Update the system prompt: "prefer create_project, build_site, deploy over the legacy execute tool". That is enough for the model to switch on 80% of calls within one test run.
execute returns a deprecation warning. Every call to execute also returns warnings: [{ code: 'deprecated', hint: 'use create_project instead' }]. Within a week, execute traffic drops below 2%.
Remove. Drop execute from the registry. If a stray call remains, the registry returns "tool not found, use one of: ..." and the model substitutes.

SELECT command, count(*)
FROM tool_calls
WHERE tool_name = 'execute' AND created_at > now() - interval '7 days'
GROUP BY command ORDER BY 2 DESC LIMIT 20;

This is how to end a wide tool without breaking production. The trick is that LLMs prefer narrow tools once exposed to them — they fail less in them, and the model itself does less work.