Back to Blog

Context Engineering Is Working Memory Design for AI Agents

AIAgentsContext EngineeringMemoryReliability
2026-04-22 Homer Quan
Context Engineering Is Working Memory Design for AI Agents

Most people still explain LLM performance through the model:

  • bigger model
  • longer context window
  • better benchmark score
  • faster inference

But in real agent systems, failures increasingly come from something else:

working memory.

The model may be capable, but the system gives it the wrong context. It sees too much noise, misses the key fact, trusts stale information, forgets what already happened, or mixes private policy with raw user text.

This is why we are moving from prompt engineering to context engineering.

Prompt engineering asks:

How should we phrase the request?

Context engineering asks:

What should exist in the model's working memory right now?

That second question is harder — and much closer to system design.

Why agent failures are often memory failures

A single chatbot can survive with a long conversation history.

A multi-agent workflow cannot.

Once agents call tools, share evidence, delegate tasks, retry failed steps, and operate over long periods of time, the system needs a real memory discipline. Otherwise, the workflow turns into a pile of text that everyone reads differently.

Common agent failures are really context failures:

Failure modeWhat it looks likeMemory-design problem
Context overloadThe model sees too much and misses the important partNo prioritization or pruning
Missing factsThe agent ignores an earlier decision or requirementImportant state was never promoted into working memory
Stale memoryOld facts override newer updatesNo freshness, expiry, or conflict policy
Summary driftA compressed summary slowly changes the meaningCompression is not validated against source state
Tool amnesiaThe agent repeats an API call or sends duplicate workOperational state lives only in chat history
Policy leakagePrivate rules or unsafe data enter the wrong promptNo boundary between policy, evidence, and task context
Cross-agent noiseEvery agent sees everything and loses focusNo role-specific context isolation

The key point:

Most agent systems do not need more memory. They need better memory management.

Prompt engineering vs. context engineering

Prompt engineering is still useful. But it is not enough for durable agent systems.

LayerMain questionMain artifactTypical failure
Prompt engineeringHow should the model behave?Instructions, examples, style rulesThe prompt is good, but the evidence is wrong
RAGWhat external knowledge should we retrieve?Retrieved chunksThe chunks are stale, duplicated, or too broad
Context compressionHow do we reduce token cost?Summaries, compressed text, selected tokensThe wrong details are removed
Long-term memoryWhat should the system remember across sessions?User facts, project memory, vector/graph recordsMemory grows forever or remembers noise
Runtime stateWhat actually happened?Events, checkpoints, tool calls, approvalsThe model invents state from text
Context engineeringWhat belongs in working memory for this step?Scoped context packetNo policy decides what the model should see

A useful distinction:

Prompting shapes behavior. Context engineering shapes operating conditions. Runtime memory preserves truth.

The working memory model

A good agent should not send the model "everything we have."

It should send a scoped working memory image for the current step.

That working memory image should answer five questions:

QuestionWhy it matters
What is the current goal?Prevents the model from optimizing for the wrong task
What facts are active?Keeps the reasoning grounded in relevant evidence
What has already happened?Prevents duplicate work and false assumptions
What is allowed?Enforces tool, privacy, and approval boundaries
What must be produced?Makes the output checkable by the runtime

In other words, context is not just text. It is a runtime-controlled view of the world.

Flowchart
Mermaid

The most important part of this loop is the commit boundary.

A model response should not automatically become truth. The runtime should interpret it, validate it, and commit only the parts that are safe, useful, and consistent with the workflow state.

The three kinds of memory every serious agent needs

Most teams talk about "memory" as if it were one thing.

It is not.

A reliable agent system needs at least three different memory types.

1. Semantic memory

Semantic memory is what the agent knows for reasoning:

  • documents
  • prior conversations
  • user preferences
  • examples
  • search results
  • retrieved facts
  • domain knowledge

This is where RAG, vector search, graph memory, and contextual compression are useful.

But semantic memory is not automatically authoritative. Retrieved text may be stale, duplicated, adversarial, incomplete, or irrelevant. It needs provenance, trust scoring, and conflict handling.

2. Operational memory

Operational memory is what the system knows happened:

  • tool calls
  • retries
  • failures
  • approvals
  • checkpoints
  • generated artifacts
  • committed side effects
  • active branches of a workflow

This memory should not depend on the model "remembering" it.

The runtime should know whether an email was already sent, whether an API call succeeded, whether a file was generated, and whether a human approval is still pending.

3. Policy memory

Policy memory defines what the system is allowed to do:

  • tool permissions
  • data-access rules
  • approval thresholds
  • privacy boundaries
  • escalation paths
  • sandbox limits
  • forbidden actions

This is where context engineering becomes security engineering.

Memory separation is the foundation

Memory typeOwned byShould the model directly control it?Example
Semantic memoryKnowledge layerPartiallyRetrieved product docs for the current question
Operational memoryRuntimeNoStep 4 succeeded, retry count is 2, approval is pending
Policy memorySystem/security layerNoThis agent may draft emails but may not send them

This separation matters because each memory type has a different failure mode.

Semantic memory can be noisy. Operational memory must be exact. Policy memory must be enforced.

Multi-agent memory: share less, not more

A common mistake in multi-agent design is to make all agents share the same full context.

That feels collaborative, but it often creates noise.

Human organizations do not work that way. A lawyer, engineer, salesperson, and auditor may work on the same project, but they do not all need the same information at the same time. Each role gets a filtered view shaped by responsibility.

Multi-agent systems need the same pattern.

Agent roleShould seeShould not seeWhy
Research agentSource documents, search results, retrieval notesPrivate policy rules, unrelated tool logsKeeps research focused on evidence
Planning agentGoal, constraints, available tools, summarized evidenceRaw noisy transcripts unless neededPrevents overfitting to low-level details
Execution agentCurrent step, allowed tools, exact inputsFull long-term memory or unrelated historyReduces accidental side effects
Critic agentProposed output, evidence used, rules to checkHidden chain of operational retries unless relevantFocuses review on correctness
Memory managerEvents, summaries, provenance, retention policyUnnecessary private task contentControls what gets promoted, compressed, or forgotten

The design principle is simple:

Shared state should be structured. Agent context should be scoped.

A shared memory layer does not mean every agent sees everything. It means every agent receives the smallest useful view for its role.

Flowchart
Mermaid

The shared memory is not a transcript. It is a structured coordination layer.

Context as RAM

A useful way to reason about agent context is to map it to computer memory.

Memory tierAI-agent equivalentDesign goalExample
RegisterCurrent instructionTiny, precise, immediately active"Classify this evidence"
L1 cacheCurrent task factsVery small, high precisionActive constraint, current user goal
L2 cacheRecent tool results and compressed summariesFast reuse without flooding the promptLast API response, latest plan summary
RAMQueryable working memoryFlexible task-level recallProject state, selected user preferences
DiskDurable logs and source systemsFull history, not always loadedTranscripts, files, database rows
Kernel stateRuntime checkpoints and permissionsTruth the model should not inventApproval status, retry count, allowed tools

This is why "just increase the context window" misses the point.

More RAM does not remove the need for memory management. It makes memory management more important.

The techniques that matter now

Context engineering has moved beyond naive summaries. The current landscape looks more like a memory hierarchy.

TechniqueWhat it doesBest used forRisk
Importance-based pruningKeeps high-signal content and drops low-value textHot context before model callsMay remove rare but decisive facts
Contextual RAG compressionCompresses retrieved documents relative to the queryRetrieval-heavy workflowsMay compress away source nuance
Adaptive memoryExtracts, updates, and retrieves salient memoriesPersonal agents and long-running assistantsMay remember wrong or sensitive facts
Graph memoryRepresents relationships among factsMulti-step reasoning and entity-heavy domainsGraph quality can drift without curation
Hierarchical summariesCompresses chunks into layered summariesLong documents and repeated loopsSummary drift over time
Event logsRecords what happened exactlyRuntime recovery and auditabilityToo verbose for direct model context
KV-cache compressionOptimizes internal inference memoryServing infrastructureNot a replacement for product memory

The important part is that these layers solve different problems.

Compressing a retrieved document is not the same as remembering a user preference. Remembering a user preference is not the same as knowing whether a tool call already happened. Knowing whether a tool call already happened is not the same as deciding which evidence should enter the next model call.

A serious agent system needs these distinctions.

Compression is becoming learned

Projects like LLMLingua show why context compression is becoming more sophisticated.

The key idea is not simply "summarize the prompt in English." Learned compression can identify lower-value tokens and preserve the parts of context that are more likely to matter for the target model.

That changes the mental model.

Compression does not have to produce beautiful prose for humans. It has to preserve the signal the model needs.

Memory is becoming adaptive

Systems like Mem0 point at another shift: memory should not be a passive transcript.

A personal or long-running agent should not keep every message forever. It should learn:

  • what matters
  • what changed
  • what expired
  • what conflicts
  • what should be retrieved only for a specific task
  • what should never be used without permission

This is basically LLM-native garbage collection.

But adaptive memory alone is not enough. The agent also needs runtime memory: what happened, what was approved, what failed, what can be retried, and what must not be repeated.

That is the missing bridge between memory as personalization and memory as infrastructure.

The missing layer in personal AI agents

Many personal AI agent frameworks are powerful because they make local workflows tangible. They can operate tools, browse, write files, call APIs, and execute useful actions.

But the memory layer is often still naive.

Context grows uncontrollably.

Old state competes with new state.

Summaries drift.

Tool results are mixed with user intent.

There is no real garbage collector.

There is no adaptive prioritization.

There is no clear boundary between semantic memory, operational state, and policy.

That is the gap.

The next step for personal AI is not only a better model or a larger context window. It is a working memory layer between the runtime and the model.

Not a chatbot memory feature.

Not a vector database alone.

A runtime-level subsystem that decides what to keep, compress, forget, retrieve, isolate, and surface at the right moment.

What a working memory layer should do

A real working memory layer should behave less like a note-taking app and more like an operating-system subsystem.

CapabilityWhat it meansWhy it matters
Importance scoringDecide which facts deserve to stay activePrevents context overload
Recency handlingPrefer fresh state without blindly deleting old commitmentsPrevents stale-memory bugs
CompressionReduce low-risk context while preserving decisionsSaves tokens without losing truth
ForgettingRemove stale, duplicated, expired, or unsafe contextKeeps memory clean and safe
Retrieval routingPull from the right memory source for the taskAvoids mixing unrelated knowledge
ProvenancePreserve where facts came from and how trusted they areEnables audit and conflict resolution
Role scopingGive each agent the right viewReduces multi-agent noise
Policy checksKeep unsafe or unauthorized context out of model callsTreats context as a capability
Runtime integrationTie memory to checkpoints, retries, approvals, and side effectsMakes agents reliable over time
ObservabilityRecord what entered context and whyMakes failures debuggable

The memory manager should not merely store information. It should make decisions about the working set.

A useful context packet has structure

One practical shift is to stop thinking in terms of prompts and start thinking in terms of context packets.

A context packet is the working memory image sent to the model for one step. It should be explicit about purpose, scope, sources, constraints, permissions, and output expectations.

yamlcopy-ready
context_packet: task: goal: "Draft the next email follow-up for approved leads." step_id: "campaign.followup.write_variant" run_id: "email-campaign-2026-04-21" durable_state: previous_steps: - "audience research completed" - "lead list approved by human reviewer" pending_approval: false retry_count: 1 memory_policy: token_budget: 6000 keep: - "current task" - "human approvals" - "source-of-record facts" compress: - "old conversation" - "low-risk research notes" forget: - "expired tool results" - "duplicate retrieved chunks" working_memory: instructions: - "Write in a concise, specific, non-spammy tone." - "Do not claim a meeting happened unless it appears in source data." retrieved_evidence: - source: "crm://lead/42" trust: "system-of-record" - source: "docs://campaign-brief" trust: "team-authored" boundaries: allowed_tools: - "draft_email" forbidden_actions: - "send_email_without_approval" - "export_contact_list" output_contract: format: "json" required_fields: - "subject" - "body" - "evidence_used" - "needs_human_review"

This looks more like an operating-system structure than a prompt.

That is the point.

The model sees enough to reason. The runtime owns the truth.

Context management needs a lifecycle

A durable AI workflow should manage context like a living resource.

Flowchart
Mermaid

The lifecycle matters because memory changes over time. A fact can be useful now, stale tomorrow, dangerous in another workflow, and irrelevant next week.

Good context engineering is not a one-time prompt trick. It is a continuous memory-management loop.

Design principles for working memory

The most useful principles are simple:

PrincipleMeaning
Separate truth from textDo not let chat history become the database
Scope by roleEach agent gets the context needed for its job, not everything
Preserve provenanceEvery important fact should know where it came from
Treat context as permissionIf the model can see it, it can influence behavior
Compress only with policySome information can be compressed; decisions and approvals often should not be
Forget deliberatelyStale context is not harmless; it competes with current truth
Validate before commitModel output becomes state only after runtime checks
Make context observableYou should be able to inspect why a fact entered the prompt

The takeaway

The future is not just better models.

It is not just longer context windows.

It is agents that manage working memory like an operating system:

  • what to keep
  • what to compress
  • what to forget
  • what to retrieve
  • what to isolate
  • what to hide
  • what to surface at the right moment

Once context is engineered well, small models can feel bigger, simple agents can behave more reliably, and multi-agent systems can coordinate without drowning each other in noise.

That is where context engineering becomes the foundation — not just a technique.

References