Prompt Caching: the 90% discount most teams miss

למה זה חשובWhy caching matters at all

סוכן שרץ ברקע מבזבז את רוב הטוקנים שלו על אותו דבר בדיוק: ה-system prompt, רשימת הכלים, ועותקי זיכרון יציבים שלא משתנים בין קריאות. בלי caching אנחנו משלמים את המחיר המלא של ה-input על כל סיבוב, גם אם 95% ממנו זהה לסיבוב הקודם.

במחיר הרגיל של Sonnet, system prompt של 8K טוקנים + הגדרות של 30 כלים (~5K טוקנים) + הקשר זיכרון של 4K טוקנים = כ-17K טוקנים שמשולמים כל קריאה. בקצב של 200 קריאות ביום למשתמש, מדובר על 3.4M טוקנים יומיים רק על החלק היציב. עם caching ההמשך עולה כ-10% מזה.

הנקודה החשובה: זה לא רק אופטימיזציה כספית. זה גם משחרר תקציב טוקנים אמיתי בתוך החלון — אם השרת לא חוסך על ה-system prompt, הוא נאלץ לקצץ בהקשר השיחה עצמו. caching מאפשר להחזיק הקשר רחב יותר באותו תקציב, וזה משנה את התנהגות הסוכן.

A background agent burns most of its tokens on the same bytes over and over: the system prompt, the tool definitions, and the stable memory context. Without caching, we pay full input price every turn even when 95% of the prompt is byte-identical to the previous turn.

At Sonnet rates, an 8K-token system prompt plus 30 tool definitions (~5K tokens) plus 4K tokens of stable memory works out to roughly 17K stable input tokens per call. Run that 200 times a day per user and it is 3.4M input tokens daily just for the unchanged scaffolding. With caching enabled, the second-and-onward turns pay about 10% of that.

The point is not only billing. The cache also frees real budget inside the context window. If the server cannot economize on the system prompt, it economizes on the conversation itself. Caching lets the agent carry more live context for the same money, and that changes how it behaves.

איך זה עובד בפועלHow the cache breakpoint actually works

נקודת ה-cache מסומנת ידנית עם cache_control: {type: "ephemeral"} על בלוק תוכן. כל מה שלפני הסימון נשמר; כל מה שאחרי משולם רגיל. ה-TTL הוא 5 דקות, אבל הוא מתאפס בכל hit — אז סוכן שמדבר כל 30 שניות שומר על המטמון חי שעות.

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    {
      type: "text",
      text: STABLE_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" }
    }
  ],
  tools: TOOL_DEFINITIONS, // ~30 tools, ~5K tokens
  messages: conversation
});

שימו לב: ה-tools מצורפים לבלוק שלפני נקודת ה-cache בסדר העיבוד הפנימי, כך שגם הם נכנסים למטמון אם ה-system block מסומן. שינוי של בית אחד בהגדרת כלי מבטל את כל ה-block — לא רק את הכלי.

Noteהחיוב מבחין בין cache_creation_input_tokens (פעם ראשונה — מחיר גבוה יותר ב-25%) לבין cache_read_input_tokens (כל הבאות — ~10% מהמחיר הרגיל). ב-usage report זה מופיע בנפרד.

The cache breakpoint is opt-in. You mark a content block with cache_control: {type: "ephemeral"} and everything from the start of the prompt up to and including that block is cached. The TTL is 5 minutes, but it resets on every cache hit, so an agent that fires once every 30 seconds keeps the cache warm for hours.

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    {
      type: "text",
      text: STABLE_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" }
    }
  ],
  tools: TOOL_DEFINITIONS, // ~30 tools, ~5K tokens
  messages: conversation
});

Tool definitions are concatenated before the marked system block in the API's internal layout, so they are cached too whenever the system block is marked. The corollary is the painful one: changing one byte in any tool description invalidates the entire prefix, not just that tool.

NoteBilling distinguishes cache_creation_input_tokens (first write, ~25% premium) from cache_read_input_tokens (subsequent reads, ~10% of base). Both appear separately in the usage object — log them.

סדר הוא הכל: יציב לפני, נדיף אחריOrdering is everything: stable before, volatile after

הכלל היחיד שצריך לזכור: תוכן יציב הולך לפני ה-breakpoint, תוכן נדיף הולך אחריו. זה נשמע טריוויאלי עד שמסתכלים על prompt אמיתי.

טעות נפוצה

שמנו את התאריך הנוכחי בתוך ה-system prompt כדי שהמודל יידע "היום". התוצאה: ה-cache התאפס כל חצי לילה ב-00:00 — ראינו spike של cache misses בשעה הזאת, וזה עלה לנו כסף עד שזיהינו.

הסידור הנכון

// BEFORE breakpoint — never changes
system: [{
  type: "text",
  text: IDENTITY_AND_RULES, // "You are Hive Builder. Iron rules: ..."
  cache_control: { type: "ephemeral" }
}],
tools: TOOLS_CORE, // 12 tools, stable for weeks

// AFTER breakpoint — changes per turn
messages: [
  { role: "user", content: `Date: ${today}\nProject: ${projectId}\n${userMsg}` }
]

הזזנו תאריך, project id, ושם משתמש מה-system prompt לתוך ההודעה הראשונה של המשתמש. ה-cache hit rate עלה מ-31% ל-94%, וחשבון ה-API ירד ב-72% באותו שבוע.

The only rule you need to internalize: stable content before the breakpoint, volatile content after. It sounds trivial until you look at a real prompt.

A common mistake

We injected today's date into the system prompt so the model would know "today". The cache silently invalidated every midnight, and we saw a clean cache-miss spike at 00:00 IL until we traced it.

The right shape

// BEFORE breakpoint — never changes
system: [{
  type: "text",
  text: IDENTITY_AND_RULES, // "You are Hive Builder. Iron rules: ..."
  cache_control: { type: "ephemeral" }
}],
tools: TOOLS_CORE, // 12 tools, stable for weeks

// AFTER breakpoint — changes per turn
messages: [
  { role: "user", content: `Date: ${today}\nProject: ${projectId}\n${userMsg}` }
]

We pulled date, project id, and username out of the system prompt and into the first user message. Cache hit rate jumped from 31% to 94%, and the API bill for that workload dropped 72% the same week.

מספר נקודות cache: עד 4Multiple breakpoints: up to four

אנחנו לא מוגבלים לנקודה אחת. ה-API תומך עד 4 breakpoints בו-זמנית, וזה שימושי כשיש שכבות יציבות שונות. דוגמה מהפרודקשן שלנו:

שכבה 1 (יציבה לחודשים): identity + iron rules.
שכבה 2 (יציבה לימים): tools + memory/core.md.
שכבה 3 (יציבה לדקות): היסטוריית השיחה האחרונה עד הודעת המשתמש הנוכחית.
נדיף: ההודעה האחרונה.

system: [
  { type: "text", text: IDENTITY, cache_control: { type: "ephemeral" } },
  { type: "text", text: TOOLS_AND_MEMORY, cache_control: { type: "ephemeral" } }
],
messages: [
  ...history.map((m, i) => ({
    ...m,
    ...(i === history.length - 1
      ? { content: [{ type: "text", text: m.content,
            cache_control: { type: "ephemeral" } }] }
      : {})
  })),
  { role: "user", content: latestUserMessage }
]

ההגיון: גם אם השכבה הנדיפה השתנתה, השכבות הפנימיות נשארות חמות. ב-4-breakpoint setup ראינו אפילו טיפול חלק יותר בשיחות ארוכות שבהן ההיסטוריה גדלה — כל הודעה חדשה מאריכה את ה-cache במקום לשבור אותו.

You are not limited to one breakpoint. The API accepts up to four simultaneously, which is useful when distinct layers age at different rates. The shape we run in production:

Layer 1 (stable for months): identity + iron rules.
Layer 2 (stable for days): tool definitions + memory/core.md.
Layer 3 (stable for minutes): conversation history up to but not including the latest turn.
Volatile: the latest user message.

system: [
  { type: "text", text: IDENTITY, cache_control: { type: "ephemeral" } },
  { type: "text", text: TOOLS_AND_MEMORY, cache_control: { type: "ephemeral" } }
],
messages: [
  ...history.map((m, i) => ({
    ...m,
    ...(i === history.length - 1
      ? { content: [{ type: "text", text: m.content,
            cache_control: { type: "ephemeral" } }] }
      : {})
  })),
  { role: "user", content: latestUserMessage }
]

The reasoning: even when the outermost layer is invalidated, inner layers stay warm. In long conversations the four-breakpoint pattern lets each new turn extend the cache rather than rebuild it.

מדידה: cache hit rate הוא מטריקה ראשונה במעלהMeasuring: cache hit rate is a first-class metric

כל קריאה מחזירה usage עם 4 שדות: input_tokens, cache_creation_input_tokens, cache_read_input_tokens, ו-output_tokens. אנחנו רושמים את כולם לטבלה ומחשבים hit rate יומי לכל endpoint.

SELECT
  date_trunc('hour', created_at) AS hour,
  SUM(cache_read_input_tokens)::float
    / NULLIF(SUM(cache_read_input_tokens + cache_creation_input_tokens + input_tokens), 0)
    AS hit_rate,
  SUM(input_tokens + cache_creation_input_tokens) AS billed_full_price,
  SUM(cache_read_input_tokens) AS billed_discounted
FROM api_usage
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY 1 ORDER BY 1;

אם hit rate יורד מתחת ל-80% פתאום, זה כמעט תמיד אומר שמשהו ב-system prompt או ב-tool list השתנה — לפעמים זה deploy, לפעמים זה כלי שמכניסים אליו ID משתמש בתוך ה-description (אל תעשו את זה). יש לנו alert ב-Grafana שמתפעל מתחת ל-70% במשך 15 דקות.

Pitfallאם אתם מבצעים A/B על system prompt, כל וריאנט יוצר entry נפרד ב-cache. עם 50% traffic על וריאנט חדש, יש לכם בפועל hit rate 50% גם אם הכל "נכון". תכננו את זה לתוך התקציב.

Every response returns a usage object with four fields: input_tokens, cache_creation_input_tokens, cache_read_input_tokens, and output_tokens. We log all four to a table and compute hourly hit rate per endpoint.

SELECT
  date_trunc('hour', created_at) AS hour,
  SUM(cache_read_input_tokens)::float
    / NULLIF(SUM(cache_read_input_tokens + cache_creation_input_tokens + input_tokens), 0)
    AS hit_rate,
  SUM(input_tokens + cache_creation_input_tokens) AS billed_full_price,
  SUM(cache_read_input_tokens) AS billed_discounted
FROM api_usage
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY 1 ORDER BY 1;

If hit rate drops below 80% suddenly, it almost always means something in the system prompt or tool list changed — usually a deploy, occasionally a well-intentioned developer who interpolated a user id into a tool description (do not do that). We page on sub-70% for 15 minutes via Grafana.

PitfallIf you A/B test the system prompt, each variant creates a separate cache entry. With 50% traffic on a new variant you effectively have a 50% hit rate even when nothing is wrong. Budget for it.

מלכודות שכדאי להכיר מראשPitfalls worth knowing before you ship

גם ה-feature הכי פשוט הזה יודע לנשוך. רשימה מהפרודקשן שלנו:

שינוי בית אחד = invalidation מלא. פסיק במקום נקודה ב-tool description שובר את כל ה-prefix. גרסאו את ה-system prompt וה-tools, ודעו מתי הם השתנו.
תאריך/שעה ב-system. הזכרנו. גם process.env.HOSTNAME בתוך ה-prompt נחשב — אם אתם רצים על pod שמתחלף.
cache לא חוצה accounts. אם יש לכם kit לכמה לקוחות עם API keys שונים, כל אחד בונה cache משלו. תכננו תקציב cache_creation לפי מספר ה-accounts הפעילים.
הזמנה ראשונה איטית יותר. כתיבה ל-cache מוסיפה כ-25% לעלות הראשונה. אם ה-prefix לא ייקרא לפחות 2-3 פעמים, ה-caching מפסיד כסף. לסוכנים short-lived שמדברים פעם אחת, אל תפעילו.
5 דקות זה לא הרבה. אם הסוכן שותק ל-6 דקות, ה-cache מת. ל-jobs שרצים נדיר יותר, שקלו להשאיר heartbeat קל.

הרגל פשוט שעוזר: לפני כל deploy, להריץ diff על ה-system prompt וה-tools. אם משהו השתנה, ידוע מראש שיהיה cache miss spike — תיעדו אותו ב-runbook ואל תיבהלו ממנו.

This is one of the simpler features Anthropic ships, and it still bites. A list from our own production:

One byte = full invalidation. A comma-instead-of-period in a tool description nukes the whole prefix. Version your system prompt and tool definitions and know when they change.
Time-of-day in the system block. Already mentioned. Also count process.env.HOSTNAME if it varies across pods.
Caches do not cross accounts. If you serve multiple customers with separate API keys, each one builds its own cache. Plan cache-creation budget against active account count.
First call is more expensive. Cache writes carry a ~25% premium. If a prefix is not read at least 2-3 times, caching loses money. For short-lived single-shot agents, leave it off.
Five minutes is short. If the agent goes silent for six minutes the cache dies. For rare jobs, consider a lightweight heartbeat to keep the prefix warm.

Simple habit: diff the system prompt and tool definitions on every deploy. If they changed, a cache-miss spike is expected — note it in the runbook so on-call does not panic.