Idempotency Keys: how to stop three ghost builds from one message

כיצד נולד פרויקט-רפאיםHow a ghost project is born

הסיפור חוזר על עצמו. המשתמש כותב "בנה לי אתר לחנות פרחים", הלקוח שולח את הבקשה, השרת מתחיל לעבד, ואז ה-WiFi מתנתק לשלוש שניות. הלקוח לא קיבל תשובה, מנסה שוב. עכשיו יש שתי בקשות זהות בתור. מהשנייה לא יודעים שהראשונה כבר רצה.

במערכת ללא אידמפוטנטיות, כל בקשה פותחת רשומת build_jobs חדשה, מקצה project_id חדש, מרים קונטיינר חדש. הראשון יסיים, השני יסיים אחריו, השלישי - אם המשתמש לחץ עוד פעם - גם הוא. המשתמש רואה אתר אחד. בבסיס הנתונים יש שלושה. שניים מהם פרויקטי-רפאים: רשומות בלי בעלים אמיתי, צורכות מקום, מבלבלות את ה-routing הבא.

אצלנו זה התגלה כשמשתמש דיווח ש-"כל פעם שאני נכנס יש לי עוד פרויקט". הסתכלנו בלוגים: שלושה project_id שונים נוצרו תוך 4 שניות, כולם עם אותו prompt בדיוק. הרשת שלו פשוט קרסה ב-handshake.

The story repeats itself. A user types "build me a flower-shop site", the client posts the request, the server starts processing, and then the WiFi blips for three seconds. The client never sees a response and retries. Now two identical requests sit in the queue. The second one has no idea the first is already running.

In a system without idempotency, each request opens a new build_jobs row, allocates a fresh project_id, and provisions a new container. The first finishes, the second finishes after it, and if the user clicks again, the third does too. The user sees one site. The database holds three. Two of them are ghost projects: rows without a real owner, eating storage, poisoning future routing.

We caught this when a customer wrote "every time I log in there's another project". The logs showed three different project_id values created within four seconds, all carrying the exact same prompt. His connection had simply died mid-handshake.

צורת המפתח: על מה לעשות hashShape of the key: what to hash on

הפיתוי הראשון הוא לעשות hash על כל גוף הבקשה. זה רעיון רע. שני משתמשים שונים שכותבים "אתר לפיצרייה" יקבלו את אותו hash, ואחד מהם יקבל בטעות את הפרויקט של השני. גם פרומפט עם רווח מיותר יחמיץ את ה-dedup.

המפתח הנכון מורכב משלושה חלקים: מי (user_id), על מה (project_id קיים, או hash של שם הפרויקט אם זה פרויקט חדש), ומה הכוונה (intent — build, deploy, fix). אצלנו:

function dedupKey({ userId, projectId, projectName, intent }) {
  const target = projectId
    ? `p:${projectId}`
    : `n:${sha256(normalize(projectName)).slice(0, 16)}`;
  return `${intent}:${userId}:${target}`;
}

function normalize(name) {
  return (name || '').trim().toLowerCase().replace(/\s+/g, ' ');
}

הנרמול חשוב: "חנות פרחים" ו-" חנות פרחים " חייבים להניב את אותו מפתח. אנחנו לא משתמשים ב-prompt עצמו - הוא ארוך, רועש, ומשתנה בין ניסיונות. השם או ה-id מספיקים כדי לזהות "זו אותה כוונה".

The first temptation is to hash the whole request body. That is a bad idea. Two different users typing "pizzeria site" will collide on the same hash, and one will inherit the other's project. A trailing whitespace will also dodge the dedup entirely.

The correct key has three components: who (user_id), what (existing project_id, or a hash of the project name for new projects), and intent (build, deploy, fix). Our composer:

function dedupKey({ userId, projectId, projectName, intent }) {
  const target = projectId
    ? `p:${projectId}`
    : `n:${sha256(normalize(projectName)).slice(0, 16)}`;
  return `${intent}:${userId}:${target}`;
}

function normalize(name) {
  return (name || '').trim().toLowerCase().replace(/\s+/g, ' ');
}

Normalization matters: "flower shop" and " Flower Shop " must produce the same key. We deliberately do not include the prompt itself — it is long, noisy, and shifts between retries. The name or id is enough to say "this is the same intent".

בדיקת ה-dedup: pseudo-code ו-SQLThe dedup check: pseudocode and SQL

החלון אצלנו הוא 60 שניות. ארוך מספיק כדי לתפוס retry של רשת איטית, קצר מספיק כדי לא לחסום משתמש שחזר בכוונה אחרי דקה. הבדיקה אטומית ב-DB, לא בזיכרון - שני workers על שני שרתים חייבים לראות את אותה תמונה.

async function enqueueBuild(req) {
  const key = dedupKey(req);
  const existing = await db.query(`
    SELECT id, status FROM build_jobs
    WHERE dedup_key = $1
      AND created_at > NOW() - INTERVAL '60 seconds'
    ORDER BY created_at DESC
    LIMIT 1
  `, [key]);
  if (existing.rows[0]) {
    return { jobId: existing.rows[0].id, deduped: true };
  }
  const job = await db.query(`
    INSERT INTO build_jobs (dedup_key, user_id, project_id, intent, payload)
    VALUES ($1, $2, $3, $4, $5)
    RETURNING id
  `, [key, req.userId, req.projectId, req.intent, req.payload]);
  return { jobId: job.rows[0].id, deduped: false };
}

וה-index שמחזיק את זה בחיים:

CREATE INDEX idx_build_jobs_dedup
  ON build_jobs (dedup_key, created_at DESC)
  WHERE created_at > NOW() - INTERVAL '1 hour';

ה-partial index חשוב — בלעדיו ה-index גדל ללא הפסקה ובסוף ה-SELECT סורק חצי טבלה.

Our window is 60 seconds. Long enough to absorb a slow-network retry, short enough to not block a user who deliberately came back a minute later. The check is atomic at the DB layer, not in process memory — two workers on two boxes must see the same picture.

async function enqueueBuild(req) {
  const key = dedupKey(req);
  const existing = await db.query(`
    SELECT id, status FROM build_jobs
    WHERE dedup_key = $1
      AND created_at > NOW() - INTERVAL '60 seconds'
    ORDER BY created_at DESC
    LIMIT 1
  `, [key]);
  if (existing.rows[0]) {
    return { jobId: existing.rows[0].id, deduped: true };
  }
  const job = await db.query(`
    INSERT INTO build_jobs (dedup_key, user_id, project_id, intent, payload)
    VALUES ($1, $2, $3, $4, $5)
    RETURNING id
  `, [key, req.userId, req.projectId, req.intent, req.payload]);
  return { jobId: job.rows[0].id, deduped: false };
}

And the index that keeps it cheap:

CREATE INDEX idx_build_jobs_dedup
  ON build_jobs (dedup_key, created_at DESC)
  WHERE created_at > NOW() - INTERVAL '1 hour';

The partial predicate matters — without it the index grows forever and the lookup eventually scans half the table.

מירוצי-תהליכים: שני workers, אותה שנייהRaces: two workers, same second

SELECT-then-INSERT לא מספיק כשיש concurrency אמיתית. שני בקשות שמגיעות באותה שנייה יראו אפס ב-SELECT, ושניהם יבצעו INSERT. נצטרך מנעול ברמת ה-DB.

הפתרון הנקי הוא UNIQUE constraint על המפתח, בשילוב עם partial index שמייצג את החלון ה"חי":

CREATE UNIQUE INDEX idx_build_jobs_dedup_unique
  ON build_jobs (dedup_key)
  WHERE created_at > NOW() - INTERVAL '60 seconds';

WarnPostgres לא מאפשר predicate שמשתמש ב-NOW() ב-partial index — זה לא immutable. במקום זה אנחנו עובדים עם trick של עמודת expires_at שנקבעת ב-INSERT.

ALTER TABLE build_jobs
  ADD COLUMN expires_at TIMESTAMPTZ NOT NULL
  DEFAULT NOW() + INTERVAL '60 seconds';

CREATE UNIQUE INDEX idx_build_jobs_dedup_unique
  ON build_jobs (dedup_key)
  WHERE expires_at > NOW();

עכשיו INSERT שני יקבל 23505 unique_violation. ה-handler תופס את זה ומחזיר את השורה הקיימת. זה הופך את ה-flow לאטומי בלי serializable transaction.

SELECT-then-INSERT is not enough under real concurrency. Two requests landing in the same millisecond will both see zero rows and both INSERT. We need a database-level lock.

The clean answer is a UNIQUE constraint on the key, scoped to the live window:

CREATE UNIQUE INDEX idx_build_jobs_dedup_unique
  ON build_jobs (dedup_key)
  WHERE created_at > NOW() - INTERVAL '60 seconds';

WarnPostgres rejects a partial-index predicate that calls NOW() — it is not immutable. The workaround is an expires_at column set at INSERT time.

ALTER TABLE build_jobs
  ADD COLUMN expires_at TIMESTAMPTZ NOT NULL
  DEFAULT NOW() + INTERVAL '60 seconds';

CREATE UNIQUE INDEX idx_build_jobs_dedup_unique
  ON build_jobs (dedup_key)
  WHERE expires_at > NOW();

Now the second INSERT gets a 23505 unique_violation. The handler catches that error and returns the existing row. The whole flow becomes atomic without forcing a serializable transaction.

כיוון החלון: למה דווקא 60 שניותTuning the window: why 60 seconds

החלון הוא tradeoff בין שני סוגי תלונות. חלון קצר מדי (5 שניות): retries אמיתיים מצליחים לעקוף, פרויקטים-רפאים חוזרים. חלון ארוך מדי (10 דקות): משתמש שכתב "בנה אתר לפיצרייה", הסתכל בתוצאה, ואז כתב שוב כי רצה משהו אחר עם אותו שם — חוזר עם אותו פרויקט במקום אחד חדש.

אצלנו 60 שניות הגיע מהמדידה הבאה: ב-99% מה-retries שראינו הפער היה מתחת ל-12 שניות. ב-99.9% מתחת ל-45. החלטה אנושית מודעת "בוא ננסה גרסה אחרת" לוקחת לפחות 90 שניות. 60 שניות יושב בדיוק בחור הזה.

אם ה-intent מסוכן יותר (deploy לפרודקשן, מחיקה), אנחנו מאריכים ל-300 שניות. הסיבה: deploy חוזר תוך דקה כמעט תמיד יהיה בטעות.

const TTL_BY_INTENT = {
  build: 60,
  fix: 30,
  deploy: 300,
  delete: 600,
};

function ttlFor(intent) {
  return TTL_BY_INTENT[intent] ?? 60;
}

הערך שמור ב-config, לא קבוע בקוד. כששינינו אותו לראשונה זה היה בלילה אחרי שמשתמש דיווח על deploy כפול — וניתן היה להעלות ל-300 בלי לחכות ל-deploy של השרת.

The window is a trade-off between two failure complaints. Too short (5 seconds): real retries slip through, ghosts come back. Too long (10 minutes): a user who typed "pizzeria site", glanced at the result, and tried again because they wanted a different angle on the same name will be served the same project instead of a fresh one.

Sixty seconds came out of measurement: 99% of observed retries were within 12 seconds, 99.9% within 45. A deliberate "let me try a different version" takes at least 90 seconds of human thought. Sixty sits cleanly in that gap.

For more dangerous intents (production deploy, delete) we stretch to 300. The logic: a deploy fired twice within a minute is almost always an accident.

const TTL_BY_INTENT = {
  build: 60,
  fix: 30,
  deploy: 300,
  delete: 600,
};

function ttlFor(intent) {
  return TTL_BY_INTENT[intent] ?? 60;
}

The value lives in config, not source. When we first bumped it, it was at midnight after a duplicate-deploy report — we could push 300 without waiting for a server release.

מלכודות: מתי אידמפוטנטיות פוגעת בךPitfalls: when idempotency hurts you

אידמפוטנטיות אינה חינם, ויש מקומות שבהם היא הופכת לבאג. הנה ארבע מלכודות שראינו בפרודקשן:

חזרת שגיאה כתוצאה מוצלחת. הבקשה הראשונה נכשלה (timeout מול הספק), ההחזרה השנייה רואה את ה-row הקיים ומחזירה את ה-jobId שלו — שעדיין במצב failed. אצלנו ה-dedup בודק status NOT IN ('failed', 'cancelled').
גרירת payload ישן. אם המשתמש שינה את ה-prompt בדיוק בחלון של 60 שניות, אנחנו מחזירים את ה-job הקודם — הוא לא יראה את השינוי. הפתרון: לכלול hash של ה-payload כחלק מה-key, או לשבור את ה-dedup כש-payload שונה משמעותית.
UNIQUE INDEX לא מנקה. ה-rows לא נמחקים כשה-TTL פג, רק מפסיקים להופיע ב-index ה-partial. בלי vacuum, ה-table גדל. אנחנו מריצים DELETE FROM build_jobs WHERE expires_at < NOW() - INTERVAL '7 days' בכל שעה.
side effects לפני ה-check. אם החיוב לכרטיס האשראי קורה ב-handler לפני בדיקת ה-dedup, יש לך double-charge גם עם idempotency על ה-job. ה-check חייב להיות הדבר הראשון אחרי authentication.

Noteאידמפוטנטיות מתאימה ל-write שיש לו תוצאה צפויה. ל-stream של אירועים, ל-WebSocket או לכל הזרמה — צריך מנגנון אחר (sequence number, last-event-id), לא dedup window.

Idempotency is not free, and there are spots where it turns into the bug. Four pitfalls we have hit in production:

Returning a failure as a success. The first request failed (timeout to a provider), the retry sees the existing row and hands back its jobId — still in failed state. Our dedup filters status NOT IN ('failed', 'cancelled').
Stale payload. If the user edits the prompt within the 60-second window, we return the prior job — and they never see the edit. Fix: hash the payload into the key, or break dedup when payloads diverge meaningfully.
The UNIQUE index does not clean up. Rows are not deleted when the TTL expires; they just drop out of the partial index. Without vacuum the table grows. We run DELETE FROM build_jobs WHERE expires_at < NOW() - INTERVAL '7 days' hourly.
Side effects before the check. If a credit-card charge runs in the handler before the dedup lookup, you double-charge despite job-level idempotency. The dedup check must be the first thing after authentication.

NoteIdempotency fits writes with a deterministic outcome. Streaming events, WebSocket frames, or anything append-only need a different primitive (sequence numbers, last-event-id), not a dedup window.