Memory in three tiers: HOT / WARM / COLD

תקצירTL;DR

הקשר אינסופי הוא המיתוס היקר ביותר בבניית סוכנים. הפתרון אצלנו: שלוש שכבות זיכרון — HOT (תמיד נטען, ≤100 שורות), WARM (החלטות וימים אחרונים, נטען אוטומטית), COLD (ארכיון, נשלף לפי בקשה). שילוב של חיפוש FTS + trigram + vector על PostgreSQL מחזיר את המידע הנכון בלי לפוצץ את החלון.

Infinite context is the most expensive myth in agent design. Our answer is three tiers: HOT (always loaded, ≤100 lines), WARM (recent days and decisions, auto-loaded), COLD (archive, retrieved on demand). Hybrid retrieval — FTS plus trigram plus vector on PostgreSQL — surfaces the right facts without blowing the window.

למה שכבות, ולא הקשר אחד גדולWhy tiers, not one big context

הפיתוי הקלאסי הוא לדחוף לסוכן את כל מה שיודעים עליו: היסטוריית שיחות, החלטות עבר, פרטי הפרויקט, רשימת מטלות. עם חלון של 200K טוקנים, מתפתה לחשוב שיש מקום. אין. מה שיש זה רעש שמסיט את תשומת הלב של המודל מהמשימה הנוכחית.

בניסויים פנימיים אצלנו, אותו סוכן שמקבל הקשר היסטורי רחב נכשל יותר על משימות שהוא פותר בקלות עם הקשר ממוקד. ההבדל הוא לא ביכולת — הוא בקשב. במחקר של Anthropic על "needle in a haystack" התופעה ידועה: ככל שהחלון מתמלא, החזרה של פרטים בודדים יורדת.

הפתרון הוא שלוש שכבות פיזיות עם חוקי קידום ברורים. כל פריט מידע יודע לאיזו שכבה הוא שייך, ויש מנגנון שמעלה ומוריד אותו בין השכבות לפי שימוש בפועל.

The classic temptation is to dump everything we know about the user into the prompt: chat history, past decisions, project facts, todo list. With a 200K-token window it is tempting to assume there is room. There is not. What there is, is noise that pulls the model's attention away from the current task.

In our internal dogfooding, the same agent fed wide historical context failed on tasks it solved cleanly with a focused context of the same budget. The difference is not capability — it is attention. Anthropic's "needle in a haystack" work documents the same effect: as the window fills, single-fact recall degrades.

The fix is three physical tiers with explicit promotion and demotion rules. Every fact knows which tier it lives in, and there is machinery to move it between tiers based on actual usage.

HOT: זהות, חוקי ברזל, מצב נוכחיHOT: identity, iron rules, current state

שכבת HOT היא קובץ אחד — memory/core.md — שאסור לחרוג מ-100 שורות. הוא נטען בכל קריאה. מה שנכנס לשם הוא מה שהסוכן חייב לדעת לפני כל החלטה: זהות, חוקי הברזל, מצב הפרויקט הנוכחי, ואילוצים פעילים.

דוגמה אמיתית מאחד הסוכנים שלנו:

# memory/core.md
## Identity
Name: Hive Builder. Hebrew-first. Never apologize for using Hebrew.

## Iron rules
1. Never call build_and_deploy twice on the same intent in 60s.
2. Never claim "done" before /healthz returns 200.
3. Tool name = one verb. Description for the model, not the user.

## Active project
project_id: hv-3812 (Pizza Roma)
stack: Express + PG
last_deploy: <iso-timestamp> IL — passed critic loop

## Open blockers
none

החוק הקריטי: מה שיכול להיכנס ל-WARM, לא ייכנס ל-HOT. הסוכן עצמו עורך את הקובץ הזה דרך כלי update_core_memory שמכפיף לסקירת LLM קצרה לפני כתיבה — אם הצעה גורמת לחריגה מ-100 שורות, היא נדחית עם הצעה לארכיון.

The HOT tier is one file — memory/core.md — capped at 100 lines, loaded on every call. What earns a place there is what the agent must know before any decision: identity, iron rules, current project state, active constraints.

A real excerpt from one of our agents:

# memory/core.md
## Identity
Name: Hive Builder. Hebrew-first. Never apologize for using Hebrew.

## Iron rules
1. Never call build_and_deploy twice on the same intent in 60s.
2. Never claim "done" before /healthz returns 200.
3. Tool name = one verb. Description for the model, not the user.

## Active project
project_id: hv-3812 (Pizza Roma)
stack: Express + PG
last_deploy: <iso-timestamp> IL — passed critic loop

## Open blockers
none

The critical rule: anything that could live in WARM does not get into HOT. The agent itself edits this file through an update_core_memory tool that runs a short LLM review before writing — proposals that push the file past 100 lines are rejected with a suggestion to archive instead.

WARM: ימים אחרונים והחלטותWARM: recent days and decisions

שכבת WARM היא שני סוגי קבצים: memory/daily/YYYY-MM-DD.md (יומן יום אחרון או שניים) ו-memory/decisions.md (החלטות בעלות חשיבות שנמשכות מעבר ליום). שתיהן נטענות אוטומטית, אבל באופן ממוקד: רק 2 הימים האחרונים, ורק 20 ההחלטות האחרונות לפי תאריך.

סוג המידע שמתאים ל-WARM:

"היום בנינו את חנות הפיצה והוספנו checkout עם stripe."
"החלטנו לא להפעיל auth ל-MVP — נחזור לזה כשיהיו 100 משתמשים."
"באג ידוע: TLS expires לפעמים בסביבת sandbox; restart פותר."

הקריטריון לקידום מ-WARM ל-HOT: אם פריט מידע התייחס אליו הסוכן יותר מ-3 פעמים בתוך 7 ימים, הוא מועלה ל-HOT. הקריטריון להורדה ל-COLD: לא נגעו בו 30 יום. שני הקריטריונים רצים בקרון יומי בשעה 03:00 IL.

-- promote to HOT
SELECT memory_id FROM long_term_memory
WHERE last_referenced_at > NOW() - INTERVAL '7 days'
GROUP BY memory_id
HAVING COUNT(*) > 3 AND tier = 'WARM';

WARM is two file types: memory/daily/YYYY-MM-DD.md (yesterday and the day before) and memory/decisions.md (decisions whose effect outlives the day they were made). Both auto-load, but narrowly: only the last 2 days, and only the most recent 20 decisions by date.

What belongs in WARM:

"Today we built the pizza store and added Stripe checkout."
"Decided not to add auth for MVP — revisit at 100 users."
"Known bug: TLS occasionally expires in the sandbox; restart fixes it."

The promotion rule from WARM to HOT: if the agent referenced a fact more than 3 times in 7 days, promote it to HOT. The demotion rule to COLD: untouched for 30 days. Both run in a nightly cron at 03:00 IL.

-- promote to HOT
SELECT memory_id FROM long_term_memory
WHERE last_referenced_at > NOW() - INTERVAL '7 days'
GROUP BY memory_id
HAVING COUNT(*) > 3 AND tier = 'WARM';

COLD: ארכיון שנשלף לפי בקשהCOLD: archive retrieved on demand

COLD זה כל השאר: סשנים ישנים, החלטות שהוחלפו, פרויקטים מהעבר. הוא אינו נטען לאף קריאה אוטומטית. הסוכן ניגש אליו רק דרך כלי חיפוש מפורש (search_memory(query, k=5)) שמחזיר 5 chunks הכי רלוונטיים.

אצלנו COLD יושב בטבלה memory_chunks עם הגדרת hybrid retrieval:

CREATE TABLE memory_chunks (
  id UUID PRIMARY KEY,
  user_id TEXT NOT NULL,
  text TEXT NOT NULL,
  text_tsv TSVECTOR,            -- FTS
  embedding VECTOR(1024),       -- semantic
  archived_at TIMESTAMPTZ,
  source TEXT                   -- daily | decision | session
);

CREATE INDEX ON memory_chunks USING GIN (text_tsv);
CREATE INDEX ON memory_chunks USING GIN (text gin_trgm_ops);
CREATE INDEX ON memory_chunks USING ivfflat (embedding vector_cosine_ops);

שלושה אינדקסים, שלושה סוגי שאילתה. שילובם נותן ריקול גבוה גם על שגיאות כתיב (trigram), על מילות מפתח (FTS) ועל משמעות (vector). על נפחי dogfooding פנימיים הסכימה הזאת מחזירה תוצאות בפחות מ-40ms p95; כשהטבלה גדלה תזדקקו ל-tuning של ה-ivfflat lists.

COLD is everything else: old sessions, superseded decisions, finished projects. It is never auto-loaded. The agent reaches it only through an explicit search tool (search_memory(query, k=5)) that returns the top 5 most relevant chunks.

We back COLD with a memory_chunks table configured for hybrid retrieval:

CREATE TABLE memory_chunks (
  id UUID PRIMARY KEY,
  user_id TEXT NOT NULL,
  text TEXT NOT NULL,
  text_tsv TSVECTOR,            -- FTS
  embedding VECTOR(1024),       -- semantic
  archived_at TIMESTAMPTZ,
  source TEXT                   -- daily | decision | session
);

CREATE INDEX ON memory_chunks USING GIN (text_tsv);
CREATE INDEX ON memory_chunks USING GIN (text gin_trgm_ops);
CREATE INDEX ON memory_chunks USING ivfflat (embedding vector_cosine_ops);

Three indexes, three query modes. Together they give high recall even on misspellings (trigram), keyword queries (FTS), and semantic intent (vector). On internal dogfooding volumes the schema serves results under 40ms p95; once the table grows you will need to tune ivfflat lists.

חיפוש היברידי: FTS + trigram + vectorHybrid retrieval: FTS + trigram + vector

חיפוש vector בלבד נשמע מודרני אבל פוספס למילות מפתח ספציפיות (קוד, שמות מערכות, תאריכים). חיפוש FTS בלבד פוספס לסמנטיקה. הפתרון הוא לשלב את שלושתם ולחשב ציון משוקלל.

WITH fts AS (
  SELECT id, ts_rank(text_tsv, plainto_tsquery($1)) AS s
  FROM memory_chunks WHERE text_tsv @@ plainto_tsquery($1)
  ORDER BY s DESC LIMIT 30
),
trig AS (
  SELECT id, similarity(text, $1) AS s
  FROM memory_chunks WHERE text % $1
  ORDER BY s DESC LIMIT 30
),
vec AS (
  SELECT id, 1 - (embedding <=> $2::vector) AS s
  FROM memory_chunks ORDER BY embedding <=> $2::vector LIMIT 30
)
SELECT m.id, m.text,
  COALESCE(fts.s, 0) * 0.4
  + COALESCE(trig.s, 0) * 0.2
  + COALESCE(vec.s, 0) * 0.4 AS score
FROM memory_chunks m
LEFT JOIN fts USING (id)
LEFT JOIN trig USING (id)
LEFT JOIN vec USING (id)
WHERE m.user_id = $3
ORDER BY score DESC LIMIT 5;

המשקלים 0.4 / 0.2 / 0.4 התקבלו מ-eval על 200 שאילתות עם תשובות מסומנות. אם הקטגוריה השתנתה (למשל, אם רוב השאילתות הופכות להיות בעברית — שבה trigram חזק במיוחד), אנחנו מכוונים מחדש. אל תקבעו משקלים בלי eval.

Vector-only search sounds modern but misses on specific keywords — code identifiers, system names, dates. FTS-only misses on semantics. The fix is to combine all three with a weighted score.

WITH fts AS (
  SELECT id, ts_rank(text_tsv, plainto_tsquery($1)) AS s
  FROM memory_chunks WHERE text_tsv @@ plainto_tsquery($1)
  ORDER BY s DESC LIMIT 30
),
trig AS (
  SELECT id, similarity(text, $1) AS s
  FROM memory_chunks WHERE text % $1
  ORDER BY s DESC LIMIT 30
),
vec AS (
  SELECT id, 1 - (embedding <=> $2::vector) AS s
  FROM memory_chunks ORDER BY embedding <=> $2::vector LIMIT 30
)
SELECT m.id, m.text,
  COALESCE(fts.s, 0) * 0.4
  + COALESCE(trig.s, 0) * 0.2
  + COALESCE(vec.s, 0) * 0.4 AS score
FROM memory_chunks m
LEFT JOIN fts USING (id)
LEFT JOIN trig USING (id)
LEFT JOIN vec USING (id)
WHERE m.user_id = $3
ORDER BY score DESC LIMIT 5;

The 0.4/0.2/0.4 weights came from an eval over 200 labeled queries. If the category mix shifts — for example, more Hebrew queries where trigram is unusually strong — we re-tune. Do not freeze weights without an eval.

המלכודת הגדולה: לשכוח לקדם ולהורידThe big pitfall: forgetting to promote and demote

הטעות שראינו אצל עצמנו ואצל אחרים: בונים את שלוש השכבות, ואז שוכחים את ה-cron שמעביר פריטים ביניהן. בתוך חודש, HOT מנופח, WARM מקובע, ו-COLD גדל בלי הגבלה.

מה שעוזר:

מטריקות per-tier: מספר שורות ב-HOT, מספר קבצים ב-WARM, מספר chunks ב-COLD. dashboard פשוט שמראה את שלושתם.
alert על HOT > 100 שורות: זה שובר את החוזה, וזה תמיד אומר שמשהו ב-cron של הקידום נשבר.
בדיקה חודשית של reads מ-COLD: אם chunk לא נקרא 6 חודשים, אפשר למחוק אותו. זה לא ארכיב לנצח, זה DB עם עלות.
אל תקדמו בלי הצעה אנושית או LLM-review. קידום אוטומטי 100% מבוסס שימוש מוסיף בלגן ל-HOT — תאריכים, שמות פרויקטים שכבר לא רלוונטיים. ה-cron אצלנו מציע, איש או LLM מאשר.

Pitfallסוכן ש"זוכר הכל" הוא לא יותר חכם. הוא יותר איטי, יותר יקר, ופחות ממוקד. הזיכרון הטוב הוא זיכרון נשכח בכוונה.

The mistake we have seen in our system and in others: people build the three tiers and then forget the cron that moves items between them. Within a month, HOT bloats, WARM ossifies, and COLD grows without bound.

What helps:

Per-tier metrics: HOT line count, WARM file count, COLD chunk count. A simple dashboard with all three.
Alert on HOT > 100 lines: it breaks the contract, and it always means something in the promotion cron broke.
Monthly review of COLD reads: if a chunk has not been read in 6 months, delete it. This is not a forever archive — it is a DB with a bill.
Do not auto-promote without a human or LLM review. Pure usage-based promotion fills HOT with junk: stale dates, project names that no longer matter. Our cron proposes; a human or an LLM approves.

PitfallAn agent that "remembers everything" is not smarter. It is slower, more expensive, and less focused. Good memory is memory that forgets on purpose.