Retry strategy: a matrix by error code

תקצירTL;DR

לא כל שגיאת tool שווה ל-retry. 5xx דורשת backoff מעריכי עם jitter; 429 מחייבת לכבד את ה-Retry-After של השרת; 4xx (חוץ מ-429) זו שגיאת בקשה — אסור לרי-טריי, מחזירים את הסיבה לסוכן; שגיאות רשת ו-timeout — רי-טריי רק אם ה-tool idempotent. ב-Hive המעבר מ-retry אחיד לסיווג לפי קוד הוריד retry storms ב-86%.

Not every tool error deserves a retry. 5xx wants exponential backoff with jitter; 429 wants the server's Retry-After; 4xx (other than 429) is a request bug — do not retry, return the reason to the agent; network and timeout retry only if the tool is idempotent. In Hive, switching from a uniform retry to a per-code policy cut retry storms by 86%.

למה אסטרטגיית retry אחת לא מספיקהWhy one retry strategy fits none

הקוד הראשון שלנו ל-tool execution היה פשוט: try { run() } catch (e) { retry(3) }. כל שגיאה, אותה תגובה. זה נשבר בשלוש דרכים שונות באותו שבוע.

שגיאת 401 חזרה שלוש פעמים, כל פעם אחרי 2 שניות, בתוך אותו prompt — בזבוז של 6 שניות לפני שהסוכן בכלל ראה את הבעיה. שגיאת 429 גרמה ל-burst נוסף שהאריך את ה-rate limit window. שגיאת 503 קצרה הפכה ל-failure של כל ה-task כי 3 retries לא הספיקו, וה-build נחסם.

retry strategy אחת לא יכולה לטפל בשלוש המצבים האלה. כל קוד שגיאה אומר משהו אחר על מי אשם, ועל מי יודע מתי הקריאה תצליח שוב, ולכן צריך טיפול אחר. הסיווג חייב לקרות לפני ההחלטה אם לרי-טריי, ובכמה אגרסיביות.

NoteSDKs רבים (כולל ה-SDK הרשמי של Anthropic) כבר עושים retry פנימי על 5xx ו-429. אם אתם עוטפים אותם, ודאו שאתם לא מכפילים את העבודה — אחרת 3×3 = 9 ניסיונות לכל קריאה.

Our first tool execution wrapper was simple: try { run() } catch (e) { retry(3) }. Same error, same retry. It broke in three different ways in the same week.

A 401 retried three times at 2-second intervals inside one prompt — six wasted seconds before the agent even saw the problem. A 429 produced a follow-up burst that extended the rate-limit window. A short 503 turned into a full task failure because three retries were not enough and the build was blocked.

One retry strategy cannot cover those three cases. Each status code says something different about who is at fault, and who knows when the call will succeed again. The classification has to happen before the decision to retry, and before the choice of how aggressive that retry should be.

NoteMany SDKs (including the official Anthropic SDK) already retry 5xx and 429 internally. If you wrap them, make sure you do not double up — otherwise it is 3×3 = 9 attempts per call.

טקסונומיית השגיאותThe error taxonomy

השגיאות מתחלקות לפי שאלה אחת: מי יודע מתי הקריאה תצליח שוב?

5xx (server) — השרת מתנדנד. לא יודע מתי יחזור. שאלה הסתברותית, נדרש backoff.
429 (rate limit) — השרת יודע. הוא אמר את זה ב-Retry-After. נדרש לכבד את הערך הזה ולא לנחש.
4xx אחר (400, 401, 403, 404, 422) — הבקשה שגויה. אם נחזור עם אותם args, נקבל בדיוק אותה שגיאה.
network / timeout — שום שרת לא נגע בבקשה (או לפחות לא הגיב). ייתכן שעבר, ייתכן שלא — מצב לא ידוע.

function classifyError(err) {
  if (err.code === 'ECONNRESET' || err.code === 'ETIMEDOUT' ||
      err.code === 'EAI_AGAIN' || err.name === 'AbortError') {
    return 'network';
  }
  if (!err.status) return 'unknown';
  if (err.status === 429) return 'rate_limit';
  if (err.status >= 500) return 'server';
  if (err.status >= 400) return 'client';
  return 'unknown';
}

הסיווג הזה הוא הצומת. כל מה שאחריו זה lookup table — לכל סוג שגיאה, פעולת retry אחת וברורה.

Errors split on a single question: who knows when the call will succeed again?

5xx (server) — The server is degraded. It does not know when it will recover. Probabilistic, needs backoff.
429 (rate limit) — The server knows. It told you so in Retry-After. Honor it, do not guess.
Other 4xx (400, 401, 403, 404, 422) — The request is wrong. The same args will return the same error.
Network / timeout — No server touched the request (or none replied). It may have been processed, it may not. Unknown state.

function classifyError(err) {
  if (err.code === 'ECONNRESET' || err.code === 'ETIMEDOUT' ||
      err.code === 'EAI_AGAIN' || err.name === 'AbortError') {
    return 'network';
  }
  if (!err.status) return 'unknown';
  if (err.status === 429) return 'rate_limit';
  if (err.status >= 500) return 'server';
  if (err.status >= 400) return 'client';
  return 'unknown';
}

This classifier is the fork. Everything past it is a lookup table — one error class, one retry policy.

5xx: backoff מעריכי עם jitter5xx: exponential backoff with jitter

5xx זה backoff מעריכי עם jitter. הנוסחה: delay = base * 2^attempt + random(0, jitter). אצלנו ה-base הוא 200ms, ה-jitter הוא 100ms, ועד 5 ניסיונות.

async function retryServer(fn, max = 5) {
  const base = 200, jitter = 100;
  for (let attempt = 0; attempt < max; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (classifyError(err) !== 'server') throw err;
      if (attempt === max - 1) throw err;
      const delay = base * (2 ** attempt) + Math.random() * jitter;
      await sleep(delay);
    }
  }
}

למה jitter? כי בלעדיו, אם 100 סוכנים נכשלו באותו רגע, כולם ינסו שוב באותו רגע. ה-jitter מפרק את העדר. 100ms של רנדומיות מספיקים כדי לפזר 100 סוכנים על פני שניה.

למה לא רי-טריי אינסופי? כי אם השרת לא חוזר תוך 6 שניות (200, 400, 800, 1600, 3200), הוא לא יחזור גם תוך 12. עדיף להחזיר שגיאה לסוכן ולתת לו להחליט: לנסות tool אחר, או לעצור עם הסבר למשתמש.

Winאחרי שעברנו ל-jitter, ראינו ירידה של 73% ב-thundering herd events ב-Anthropic API שלנו. אותם סוכנים, אותו עומס, פיזור טוב יותר.

5xx is exponential backoff with jitter. Formula: delay = base * 2^attempt + random(0, jitter). Our base is 200ms, jitter 100ms, up to 5 attempts.

async function retryServer(fn, max = 5) {
  const base = 200, jitter = 100;
  for (let attempt = 0; attempt < max; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (classifyError(err) !== 'server') throw err;
      if (attempt === max - 1) throw err;
      const delay = base * (2 ** attempt) + Math.random() * jitter;
      await sleep(delay);
    }
  }
}

Why jitter? Because without it, if 100 agents failed at the same instant, all 100 retry at the same instant. Jitter breaks up the herd. 100ms of randomness is enough to spread 100 agents across a one-second window.

Why not retry forever? Because if the server has not recovered in 6 seconds (200, 400, 800, 1600, 3200), it will not recover in 12. Surface the error to the agent and let it decide: pick a different tool, or stop with an explanation for the user.

WinAfter moving to jitter we saw a 73% drop in thundering-herd events against the Anthropic API. Same agents, same load, better spread.

429: לכבד את Retry-After429: honor Retry-After

429 הוא שונה. השרת אמר לך מתי לחזור — תכבד את זה. Retry-After יכול להיות שניות (מספר) או תאריך HTTP. שני המקרים תקפים, צריך לטפל בשניהם.

function parseRetryAfter(header) {
  if (!header) return 1000; // default 1s
  const seconds = Number(header);
  if (!Number.isNaN(seconds)) return seconds * 1000;
  const date = Date.parse(header);
  if (!Number.isNaN(date)) return Math.max(0, date - Date.now());
  return 1000;
}

async function retryRateLimit(fn, max = 3) {
  for (let attempt = 0; attempt < max; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (classifyError(err) !== 'rate_limit') throw err;
      if (attempt === max - 1) throw err;
      const wait = parseRetryAfter(err.headers?.['retry-after']);
      await sleep(wait);
    }
  }
}

Warnאם Retry-After אומר 30 שניות, אל תוסיף לו backoff מעריכי. זה כבר ה-backoff. הוספה גורמת לסוכן להמתין 60 שניות לחינם, ולמשתמש לחשוב שהמערכת תקועה.

אם אין header (ספקים מסוימים שוכחים אותו), ברירת המחדל היא שניה — לא backoff מעריכי. עד 3 ניסיונות, אחרי זה מוסרים את השגיאה לסוכן עם הסבר. אם הספק שלך מחזיר 429 בלי Retry-After באופן עקבי, פתח tickets — זו פגיעה ב-API contract.

429 is different. The server told you when to come back — honor that. Retry-After may be seconds (a number) or an HTTP date. Both are valid; both must be parsed.

function parseRetryAfter(header) {
  if (!header) return 1000; // default 1s
  const seconds = Number(header);
  if (!Number.isNaN(seconds)) return seconds * 1000;
  const date = Date.parse(header);
  if (!Number.isNaN(date)) return Math.max(0, date - Date.now());
  return 1000;
}

async function retryRateLimit(fn, max = 3) {
  for (let attempt = 0; attempt < max; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (classifyError(err) !== 'rate_limit') throw err;
      if (attempt === max - 1) throw err;
      const wait = parseRetryAfter(err.headers?.['retry-after']);
      await sleep(wait);
    }
  }
}

WarnIf Retry-After says 30 seconds, do not add exponential backoff on top. That value is the backoff. Stacking makes the agent wait 60s for nothing, and makes the user think the system is stuck.

If the header is missing (some providers omit it), default to one second — not exponential. Up to 3 attempts; after that surface the error to the agent. If your provider consistently returns 429 without Retry-After, file a ticket — that is an API-contract violation.

4xx: אסור לרי-טריי4xx: do not retry

4xx (חוץ מ-429) זו לא שגיאה זמנית. זה bug. אותם args יחזירו אותה שגיאה. רי-טריי כאן הוא בזבוז של זמן, ועוד יותר חשוב — של context window.

הטיפול הנכון: לא לרי-טריי, אבל גם לא לזרוק את השגיאה למשתמש. להחזיר אותה לסוכן עם הסבר ספציפי. הסוכן יבחר tool אחר או ישנה את ה-args.

async function executeTool(name, args, ctx) {
  try {
    return await callTool(name, args, ctx);
  } catch (err) {
    const kind = classifyError(err);
    if (kind === 'client') {
      return {
        type: 'tool_result',
        is_error: true,
        content:
          `${name} rejected: ${err.status} ${err.message}. ` +
          `Args were: ${JSON.stringify(args).slice(0, 300)}. ` +
          `Fix the args or call a different tool.`,
      };
    }
    if (kind === 'server')     return retryServer(()    => callTool(name, args, ctx));
    if (kind === 'rate_limit') return retryRateLimit(() => callTool(name, args, ctx));
    if (kind === 'network')    return retryNetwork(name, args,
                                          ()    => callTool(name, args, ctx));
    throw err;
  }
}

במיוחד 401 ו-403 — סוכנים נוטים לחשוב שאם ינסו שוב יקבלו אישור. לא יקבלו. מחזירים את הסטטוס ואת ה-message המקורי, וזהו. עבור 422 (validation), מומלץ לכלול גם את שדות ה-validation ב-content כדי שהסוכן ידע איזה field לתקן.

4xx (other than 429) is not transient. It is a bug. Same args, same error. Retrying here wastes time and — more importantly — context window.

Correct handling: do not retry, but do not throw at the user either. Return the error to the agent with a specific reason. The agent will switch tools or fix the args.

async function executeTool(name, args, ctx) {
  try {
    return await callTool(name, args, ctx);
  } catch (err) {
    const kind = classifyError(err);
    if (kind === 'client') {
      return {
        type: 'tool_result',
        is_error: true,
        content:
          `${name} rejected: ${err.status} ${err.message}. ` +
          `Args were: ${JSON.stringify(args).slice(0, 300)}. ` +
          `Fix the args or call a different tool.`,
      };
    }
    if (kind === 'server')     return retryServer(()    => callTool(name, args, ctx));
    if (kind === 'rate_limit') return retryRateLimit(() => callTool(name, args, ctx));
    if (kind === 'network')    return retryNetwork(name, args,
                                          ()    => callTool(name, args, ctx));
    throw err;
  }
}

Especially 401 and 403 — agents like to assume that one more attempt will magically return an authorization. It will not. Return the status and the original message, and stop. For 422 (validation), include the failing fields in content so the agent knows which arg to fix.

network ו-timeout: רי-טריי רק על idempotentNetwork and timeout: retry only when idempotent

שגיאות רשת הן הקשות ביותר. ECONNRESET, ETIMEDOUT — לא יודעים אם הבקשה הגיעה לשרת או לא. אם היא הגיעה ועובדה, רי-טריי יוצר double-write.

הפתרון: רי-טריי רק אם ה-tool הוא idempotent. ב-Hive כל ה-tools המסוכנים (deploy, write_file, db_write) משלבים idempotency key מבוסס hash של ה-args. רי-טריי עם אותו key לא משכפל את ה-side effects — ה-TTL שלנו הוא 60 שניות על (user_id, project_id_or_name_hash).

const NATURALLY_IDEMPOTENT = new Set([
  'fetch_image', 'read_file', 'list_files', 'critique', 'screenshot',
]);

async function retryNetwork(name, args, fn, max = 3) {
  const idempotent = NATURALLY_IDEMPOTENT.has(name) || !!args.idempotency_key;
  if (!idempotent) {
    return {
      type: 'tool_result',
      is_error: true,
      content:
        `Network error on non-idempotent ${name}. ` +
        `Cannot retry safely — request may or may not have applied. ` +
        `Verify state with a read tool before continuing.`,
    };
  }
  for (let attempt = 0; attempt < max; attempt++) {
    try { return await fn(); }
    catch (err) {
      if (classifyError(err) !== 'network') throw err;
      if (attempt === max - 1) throw err;
      await sleep(500 * (attempt + 1));
    }
  }
}

במקרה של non-idempotent + network error, הסוכן מקבל מסר ברור: "אני לא יודע אם הקריאה הצליחה. תוודא לפני שתמשיך". הוא קורא ל-read_file או list_projects, מוודא, וממשיך — בלי לכתוב פעמיים.

Network errors are the hardest. ECONNRESET, ETIMEDOUT — you do not know whether the request reached the server. If it did and was processed, a retry creates a double-write.

Solution: retry only if the tool is idempotent. In Hive every dangerous tool (deploy, write_file, db_write) carries an idempotency key derived from a hash of the args. A retry with the same key is a no-op on side effects — our TTL is 60 seconds on (user_id, project_id_or_name_hash).

const NATURALLY_IDEMPOTENT = new Set([
  'fetch_image', 'read_file', 'list_files', 'critique', 'screenshot',
]);

async function retryNetwork(name, args, fn, max = 3) {
  const idempotent = NATURALLY_IDEMPOTENT.has(name) || !!args.idempotency_key;
  if (!idempotent) {
    return {
      type: 'tool_result',
      is_error: true,
      content:
        `Network error on non-idempotent ${name}. ` +
        `Cannot retry safely — request may or may not have applied. ` +
        `Verify state with a read tool before continuing.`,
    };
  }
  for (let attempt = 0; attempt < max; attempt++) {
    try { return await fn(); }
    catch (err) {
      if (classifyError(err) !== 'network') throw err;
      if (attempt === max - 1) throw err;
      await sleep(500 * (attempt + 1));
    }
  }
}

For non-idempotent + network error, the agent receives a clear message: "I do not know whether the call succeeded. Verify before continuing". It runs a read_file or list_projects, checks, and continues — without double-writing.

מלכודות נפוצותCommon pitfalls

שלוש מלכודות שראינו אצלנו ואצל לקוחות:

אין retry budget גלובלי

בלי תקציב כללי, רצף של 5xx + 429 + network יכול להפוך task של 5 שניות ל-task של 5 דקות. אצלנו יש retry_budget_ms = 30000 לכל סוכן, ל-task. עוברים את זה — מסיימים את ה-task עם שגיאה אמיתית, גם אם ה-retry האחרון היה תיאורטית הגיוני.

retry בתוך retry

SDK מסוים עוטף את הקריאה ב-retry פנימי, ואנחנו עוטפים אותו ב-retry חיצוני. התוצאה: 3×3 = 9 ניסיונות לכל קריאה, ולפעמים 27 שניות של המתנה לפני שהסוכן רואה את השגיאה. תמיד לבדוק מה ה-SDK עושה ולכבות אחד מהם — בדרך כלל את של ה-SDK, כי הוא לא יודע מה idempotent.

logging בלי sampling

retry בכל קריאה גורם ל-log spam. ב-Hive הגענו ל-2.4M שורות לוג ביום על שגיאות 503 חולפות. הפתרון: log רק את הניסיון הראשון ואת הניסיון האחרון, או רק כשה-retry chain נכשל סופית.

if (attempt === 0 || attempt === max - 1 || finalSuccess === false) {
  logger.warn('tool_retry', {
    name, attempt, kind, status: err.status, ms_elapsed: Date.now() - startedAt,
  });
}

אחרי שלוש המלכודות האלה תוקנו, ה-retry storms שלנו ירדו ב-86% וה-fix-loop נהיה קצר ב-40%. החיסכון העיקרי: ב-context window של הסוכן, לא ב-CPU של השרת.

Three traps we have seen in our own code and in customer code:

No global retry budget

Without a global cap, a sequence of 5xx + 429 + network can stretch a 5-second task into a 5-minute task. We use retry_budget_ms = 30000 per agent per task. Past that, end the task with a real error — even if the last retry was theoretically reasonable.

Retry inside a retry

An SDK wraps calls in its own retry, and we wrap that SDK in our retry. Result: 3×3 = 9 attempts per call, sometimes 27 seconds of waiting before the agent sees the error. Always inspect what the SDK does and disable one layer — usually the SDK's, because it does not know what is idempotent in your domain.

Logging without sampling

Retrying every call produces log spam. We hit 2.4M log lines a day from transient 503s. The fix: log only the first attempt and the last attempt, or only when the chain ultimately fails.

if (attempt === 0 || attempt === max - 1 || finalSuccess === false) {
  logger.warn('tool_retry', {
    name, attempt, kind, status: err.status, ms_elapsed: Date.now() - startedAt,
  });
}

After fixing those three traps, our retry storms dropped 86% and the fix-loop shortened by 40%. The biggest saving was in the agent's context window, not the server's CPU.