Harness Engineering: The Secret to Stable AI Agents

Table of Contents

Introduction
#

Harness Engineering has been making waves across the AI community lately, showing up in tech blogs, podcasts, and conference talks everywhere. I spent a good amount of time digging into the concept, including public case studies from Anthropic and OpenAI, and the more I read, the more convinced I became that this is something anyone building Agents can’t afford to ignore. A common frustration when building Agents: the model is solid, the prompts have been iterated multiple times, yet the system still doesn’t run reliably. The root cause often lies not in the model itself, but in the runtime system wrapped around it. That outer layer now has a unified name: Harness. This article distills what I’ve learned, covering what Harness Engineering is and what core components it involves.

Three Shifts in Focus
#

Over the past two years, AI engineering has gone through three distinct shifts: from Prompt Engineering to Context Engineering, and now to Harness Engineering. It might look like a string of buzzwords, but each corresponds to a core challenge in AI system development:

Did the model understand what you’re asking?
Did the model receive enough correct information?
Can the model consistently do the right thing during real execution?

These questions expand outward, layer by layer.

Prompt Engineering: Say It Right
#

When large language models first took off, the most striking observation was that the same model could produce vastly different results depending on how you phrased things. “Summarize this article” gets you a flat summary; a more structured phrasing yields something much better.

So people went all-in on prompts: role-playing, style constraints, few-shot examples, step-by-step guidance, output formatting. Why do these work? Because a large language model is fundamentally a probability-based generation system that’s extremely sensitive to context. Give it a role, and it responds in character. Give it examples, and it follows the pattern. Emphasize constraints, and it treats them as priorities¹.

The essence of Prompt Engineering isn’t commanding the model. It’s shaping a local probability space. The key skill at this stage is language design.

But Prompt Engineering soon hit a ceiling. Many tasks aren’t about saying things clearly; they require actual information. Analyzing internal company documents, answering questions about the latest product specs, writing code to a detailed specification, orchestrating across multiple tools. No matter how well-crafted the prompt, it can’t substitute for the facts themselves.

Prompts are good at: Constraining output, activating the model’s existing capabilities, short-chain tasks. Not so good at: Filling knowledge gaps, managing dynamic information, handling state across long task chains.

Context Engineering: Get the Information Right
#

When Agents gained traction, models were no longer just answering questions. They had to enter real environments and do things. Multi-turn conversations, browser automation, reading and writing code, operating databases, passing intermediate results between steps, revising plans based on feedback.

The system was no longer facing “did it answer this one question correctly?” but “can the entire task pipeline run end to end?” Take a real-world task: “Analyze this requirements document, identify potential risks, combine with historical review comments to produce improvement suggestions, then generate a feedback draft for the product manager.” This is far beyond what any single prompt can handle. It needs the current requirements document, historical review records, relevant specifications, current goals, intermediate analysis conclusions, the output recipient, tone requirements, and more.

The core of Context Engineering becomes one sentence: The model doesn’t necessarily know. The system must deliver the right information at the right time.

Context here isn’t just background material. In engineering terms, it represents the sum of all information influencing the model’s current decision: user input, conversation history, retrieval results, tool outputs, task state, intermediate artifacts, system rules, safety constraints, and structured results from other Agents. Prompts are actually just one part of the context².

RAG is a classic Context Engineering practice. But mature context engineering goes far beyond retrieval: how to chunk documents, how to rank results, how to compress long texts, when to keep vs. summarize conversation history, whether to pass raw tool output to the model or process it first, whether to pass raw text or structured fields between Agents. All of this requires careful design.

Agent Skill’s progressive disclosure is also an advanced Context Engineering practice. It addresses a real problem: if you stuff a dozen tool descriptions and parameter definitions into the model, it theoretically knows more, but practically performs worse. Context window space is scarce, and information overload scatters attention. The Skill approach shows only minimal metadata upfront and dynamically loads detailed references and scripts only when needed.

The key insight: context optimization isn’t about giving more; it’s about giving on demand, in layers, at the right moment.

Harness Engineering: Keep the Model on Track
#

Context Engineering isn’t the endpoint either. Even with correct information, the model might not execute reliably. It might plan well but drift during execution, call a tool but misinterpret the result, gradually veer off course across a long chain while the system fails to notice.

Prompts optimize intent expression. Context optimizes information supply. But in complex tasks, there’s a harder question: When the model starts taking actions continuously, who supervises it, constrains it, and corrects its course?

That’s what Harness Engineering addresses.

What Is a Harness
#

The word “harness” originally refers to the gear and straps used to hitch and control a draft animal. In AI systems, it’s a reminder of something elementary: when a model transitions from answering questions to executing tasks, the system doesn’t just feed it information. It must also manage the entire process.

There’s a concise formula: Agent = Model + Harness. Everything in an Agent system, aside from the model itself, that determines whether it can deliver reliably, falls under Harness.

Here’s an analogy. Imagine sending a new employee on an important client visit.

Prompt is about telling them the plan clearly: greet first, present the proposal, ask about needs, confirm next steps. The key is saying the right things
Context is about preparing all the materials: client background, previous communication records, product pricing, competitive landscape, meeting objectives. The key is getting the information right
Harness is about building a complete operational safety net: have them bring a checklist, report at key milestones, verify meeting minutes and recordings afterward, correct deviations immediately, and accept deliverables against clear criteria. The key is having a mechanism for continuous observation, correction, and final acceptance

These three aren’t replacements for each other. They’re nested: Prompt engineering formalizes instructions, Context engineering formalizes the input environment, Harness engineering formalizes the entire runtime system. Each layer’s boundary is larger than the last. The first two generations of engineering focused on “making the model think better.” Harness focuses on “keeping the model from drifting, running stably, and being recoverable when things go wrong.”

The Six Layers of a Harness
#

A mature Harness can be decomposed into six layers.

Layer 1: Context Management
#

Whether a model performs stably often depends less on its intelligence and more on what it sees. The Harness’s first responsibility is ensuring the model thinks within the right information boundaries. This typically involves three things:

Role and goal definition: The model needs to know who it is, what the task is, and what success looks like
Information curation and selection: More context isn’t always better. Relevant context is better
Structural organization: Fixed rules, current task, runtime state, and external evidence should be separated into clear sections. Once information gets jumbled, the model starts missing key points, forgetting constraints, or even self-contaminating

On information curation, OpenAI fell into a classic trap early on: they wrote a massive agent instruction document that crammed in every standard, framework, and convention, and the agent only got more confused. Context window space is scarce, and overstuffing it is equivalent to saying nothing at all. Their fix was to turn the document into a directory page with only core indexes, splitting detailed content into sub-documents for architecture, design, execution plans, quality scoring, and security rules. The agent reads the table of contents first and drills down on demand. This is the same progressive disclosure philosophy as Agent Skills: don’t give everything at once; expose on demand.

Anthropic ran into a related issue with long-running autonomous tasks. The context grows fuller over time, and the model starts dropping details and key points. There’s even an interesting phenomenon where the model seems to sense it’s running out of space and starts rushing to wrap up. The common remedy is context compression, but Anthropic found that for some scenarios, compression alone isn’t enough. It makes things shorter but doesn’t truly relieve the burden. So they did something more radical: Context Reset. Instead of compressing within the existing context, they spin up a clean new agent and hand off the work. This is analogous to how engineers handle memory leaks: not by clearing caches, but by restarting the process and restoring state³.

Layer 2: Tool System
#

Without tools, a large language model is essentially a text predictor. With tools, it can actually do things: search the web, read documents, write code, call APIs.

But the Harness does more than just attaching tools. It needs to solve three problems:

Which tools to provide: Too few and capabilities are insufficient; too many and the model gets confused
When to invoke tools: Don’t search when unnecessary, but don’t bluff when you should search
How to feed tool results back: Dozens of search results shouldn’t be dumped raw. They need to be distilled and filtered for task relevance

OpenAI’s practice here is fairly extreme: they don’t just give the Agent a code editor. They connect a browser so the Agent can take screenshots and simulate user interactions, hook into logging and metrics systems so it can check logs and monitors, and run each task in an isolated environment. The Agent doesn’t just say “done writing code.” It can actually run the code, see the results, find bugs, fix them, and verify the fix. Tool system design directly determines how “real” the Agent’s capabilities can be.

Layer 3: Execution Orchestration
#

The core question this layer answers: What should the model do next?

Many Agents fail not because they can’t do individual steps, but because they can’t string steps together. They can search, summarize, and write code, but the whole process is ad hoc, delivering a pile of half-finished work. A complete task typically needs a track: understand the goal → assess information sufficiency (supplement if needed) → analyze based on results → generate output → verify output → revise or retry if requirements aren’t met.

OpenAI offered an insightful perspective here: the engineer’s job in the Agent era is no longer writing code, but designing the environment. Specifically, three things: decompose product goals into tasks within the Agent’s capability range; when the Agent fails, don’t tell it to try harder, ask what capability is missing from the environment; and build feedback loops so the Agent can see its own work results. “When an Agent has problems, the fix is almost never to make it try harder. It’s to identify what capability it’s missing.” This is itself classic Harness thinking.

Layer 4: Memory and State
#

An Agent without state is amnesiac every turn. It doesn’t know what it just did, which conclusions are confirmed, or which problems remain unresolved. The Harness must manage state, separating at least three categories:

Current task state
Intermediate results within the session
Long-term memory and user preferences

If these three types get mixed together, the system degrades over time. Once properly separated, the Agent starts behaving like a reliable collaborator.

Layer 5: Evaluation and Observability
#

This is the layer most teams overlook. Many systems can generate output but have no idea whether that output is actually good. Without independent evaluation, the Agent stays in a perpetual state of self-satisfaction.

This layer typically includes: output acceptance criteria, environment validation, automated testing, logging and metrics, and error attribution. The system must not only be able to do things, but also know whether it did them right.

There’s a key engineering principle here: production and evaluation must be separated. When a model both does the work and grades itself, it tends to be overly optimistic, especially for subjective questions like design quality or product completeness. Anthropic’s approach is to split roles: a Planner expands vague requirements into complete specifications, a Generator implements step by step, and an Evaluator tests like a QA engineer. Critically, the Evaluator doesn’t just read code. It actually operates the page, checks interactions, and verifies real results. As long as the evaluator is sufficiently independent, the system forms an effective loop of generate → check → fix → re-check.

Layer 6: Constraints, Validation, and Failure Recovery
#

The final layer is what truly determines whether a system can go to production. In real environments, failure isn’t the exception; it’s the norm. Inaccurate search results, API timeouts, messy document formats, model misunderstandings. These are everyday occurrences.

Without recovery mechanisms, every error means starting over. A mature Harness must include:

Constraints: What’s allowed and what’s not
Validation: How to check before and after output
Recovery: How to retry, switch paths, or roll back to a stable state after failure

OpenAI did something noteworthy here: they encoded senior engineers’ experience directly as system rules. How modules should be layered, which layers must not depend on which, when to block, and how to fix issues when found. These rules don’t just flag errors; they feed the fix back to the Agent for the next round of context. This is no longer traditional code style enforcement. It’s a continuously running automated governance system⁴. Agents submit code far faster than humans can review it, so system rules must serve as the safety net rather than relying on manual code review.

Summary
#

Putting it all together:

Prompt Engineering solves how to say the task clearly
Context Engineering solves how to deliver the right information
Harness Engineering solves how to keep the model consistently doing the right thing in real execution

Harness doesn’t replace the previous two. It encompasses them at a larger system boundary. When the task is simple single-turn generation, prompts matter most. When tasks depend on external knowledge and dynamic information, context becomes critical. When the model enters real-world scenarios with long chains, executable actions, and low tolerance for error, Harness becomes virtually unavoidable.

This explains why the same model can perform so differently across products. The model determines what’s possible to ship. But Harness determines whether it actually lands and delivers reliably. The core challenge of AI deployment is shifting from “making the model look smart” to “making the model work reliably in the real world.”

This context sensitivity is both a strength and a weakness of large language models. The upside is that carefully designed inputs can steer outputs. The downside is that subtle wording changes can produce drastically different results. ↩︎
Broadly speaking, a prompt is also a form of context. In engineering practice, however, we typically distinguish user-authored instructions as “prompts” from system-organized injected information as “context,” to differentiate their management approaches. ↩︎
Context Reset is essentially a more aggressive context management strategy. Rather than compressing within the existing context, it abandons the current context entirely, serializing state and loading it into a fresh Agent to restart the reasoning process. ↩︎
This approach of “encoding engineer experience as system rules” shares the same philosophy as traditional software engineering lint rules and CI checks, except the executor has shifted from humans to Agents. ↩︎

Introduction #

Three Shifts in Focus #

Prompt Engineering: Say It Right #

Context Engineering: Get the Information Right #

Harness Engineering: Keep the Model on Track #

What Is a Harness #

The Six Layers of a Harness #

Layer 1: Context Management #

Layer 2: Tool System #

Layer 3: Execution Orchestration #

Layer 4: Memory and State #

Layer 5: Evaluation and Observability #

Layer 6: Constraints, Validation, and Failure Recovery #

Summary #