Skip to main content

Command Palette

Search for a command to run...

Beyond Prompt Engineering: Context Engineering and Harness Engineering

Going Beyond Prompts: How Context and Harness Shape LLM Systems

Published
12 min read
Beyond Prompt Engineering: Context Engineering and Harness Engineering

Date: March 11, 2026 Purpose: Breaking down context engineering and harness engineering for anyone who's confused by the buzzwords


Everyone's talking about prompt engineering like it's the ultimate skill for working with AI. Write better prompts, get better results. And that's true -- to a point. But if you've spent any real time building with LLMs, you've probably noticed that the prompt is only a small piece of the puzzle.

Two concepts have been floating around that actually explain what's going on when you go beyond writing clever prompts: context engineering and harness engineering. They sound fancy but they're not. Let me break them down the way I wish someone explained them to me.


Prompt Engineering: Where Everyone Starts

Before we get into the new stuff, let's be clear about what prompt engineering actually is.

Prompt engineering is crafting the text you send to the model. System prompts, few-shot examples, chain of thought, structured output instructions -- all of that lives in the prompt.

System: You are a helpful coding assistant.
User: Write a Python function that reverses a string.

That's prompt engineering. You're tweaking the words to get better output.

It works. But it has limits. A really well-crafted prompt talking to a model with no tools, no memory, no external data -- you're basically talking to a very smart person locked in a room with no internet, no books, and no way to check their work.

That's where context engineering comes in.


Context Engineering: The Full Picture

Andrej Karpathy put it well -- the LLM is a CPU, the context window is RAM, and you are the operating system. Your job is loading exactly the right information for each task.

Context engineering is about designing the entire information environment the model operates in. Not just the prompt, but everything that goes into and around it.

What Actually Goes Into Context

When you send a message to Claude or GPT, a lot more is happening behind the scenes than your message and a system prompt:

[System prompt]              <- who the model is, rules, format
[Tool definitions]           <- what the model can do (functions, APIs)
[Retrieved documents]        <- RAG results, search hits
[Conversation history]       <- what was said before
[Working memory]             <- scratchpad, intermediate results
[User message]               <- the actual question

Every single one of these affects output quality. Context engineering is about optimizing all of them together.

Prompt Engineering vs Context Engineering

Aspect Prompt Engineering Context Engineering
Scope The text you write The entire information environment
Focus How you phrase things What information is available and when
Tools System prompts, few-shot RAG, tool definitions, memory, history management
Analogy Writing a good email Designing the entire briefing package for a decision maker
When it matters Single-turn, simple tasks Multi-turn, agentic, complex workflows

Why Context Engineering Matters More Now

When models were simple chatbots, prompt engineering was enough. You type, it responds, done.

But now we have agents. Multi-turn conversations. Tool use. RAG pipelines. Long-running tasks. The model isn't just answering one question -- it's making decisions, calling tools, reading results, and deciding what to do next.

In that world, what information is in the context window at each step matters way more than how you phrased the system prompt.

Practical Context Engineering

Here's what it actually looks like in practice:

Prune irrelevant history. Don't send 50 turns of conversation if only the last 5 matter. I've seen agents fail because their context was full of old, irrelevant messages and they got confused about what they were supposed to be doing now.

Summarize, don't truncate. When context gets long, summarize older messages instead of cutting them off mid-conversation. Cutting mid-sentence creates confusion. A good summary preserves intent.

Order matters. Models pay more attention to the beginning and end of their context window. This is the "lost in the middle" problem. Put critical instructions at the top, the immediate task at the bottom, reference material in the middle.

Dynamic system prompts. The system prompt doesn't have to be static. Change it based on what the user is doing. If they're writing code, load coding-specific instructions. If they're doing research, load research-specific context. Same model, different behavior.

Be specific about tools. Tool descriptions are part of the context. Vague descriptions mean the model picks the wrong tool. Clear descriptions with examples mean it picks the right one.

Memory management. For long-running agents, you need to decide what to remember and what to forget. Store key decisions in external memory (files, databases), load them back when relevant. Don't rely on the context window to be your permanent storage -- it's not.


Harness Engineering: The Infrastructure Around the Model

If context engineering is about what the model sees, harness engineering is about everything around the model -- the scaffolding, the guardrails, the tool integrations, the feedback loops.

An agent harness is the software infrastructure that wraps around an LLM and handles everything the model can't do on its own.

The Model Alone Can't Do Much

Think about what a raw LLM actually does: it takes text in and produces text out. That's it. It can't:

  • Read files

  • Call APIs

  • Remember things across sessions

  • Verify its own output

  • Recover from errors

  • Run code

All of that comes from the harness. The harness is what turns a text generator into something that can actually get work done.

What a Harness Does

User Request
     |
     v
[Harness] ---> Parse intent, select tools, manage context
     |
     v
[LLM] ------> Think, plan, generate tool calls
     |
     v
[Harness] ---> Execute tools, capture results, feed back
     |
     v
[LLM] ------> Review results, decide next step
     |
     v
[Harness] ---> Verify output, apply guardrails, respond

The model is just one part of the loop. The harness handles:

Component What It Does
Tool integration Connecting the model to APIs, databases, file systems, browsers
Memory Storing information across sessions -- files, databases, knowledge graphs
Context management Deciding what information to load into the context window and when
Planning Breaking complex goals into steps the model can handle
Verification Checking that the model's output is actually correct
Guardrails Preventing the model from doing things it shouldn't
Error recovery Handling failures and retrying with different approaches
Orchestration Managing the loop between model calls, tool execution, and user interaction

Why the Harness Matters More Than the Model

Here's the part that surprises people: improving the harness often has a bigger impact than improving the model.

LangChain's coding agent went from 52.8% to 66.5% on a benchmark by changing nothing about the model. Same LLM, better harness, jumped from top 30 to top 5. That's not a small improvement -- that's a different league.

This makes sense when you think about it. The model is already pretty good at reasoning and generating text. What usually goes wrong is:

  • The model didn't have the right information (context problem)

  • The model couldn't verify its work (harness problem)

  • The model picked the wrong tool (tool description problem)

  • The model lost track of what it was doing (memory problem)

  • The model made an error and nobody caught it (guardrails problem)

All of these are harness problems, not model problems.

Real Example: Claude Code

Claude Code is a good example of what a well-designed harness looks like. The model behind it (Claude) is the same model you can use through the API. But the harness adds:

  • File system access -- read, write, edit, search files

  • Shell execution -- run commands, tests, builds

  • Git integration -- commit, diff, status, branch management

  • Context management -- CLAUDE.md files, project-level instructions, memory

  • Tool orchestration -- the agent loop that chains actions together

  • Sub-agents -- spawn specialized agents for specific tasks

  • Plugins -- extend capabilities with custom tools and workflows

Strip all that away and you just have Claude answering questions. The harness is what makes it useful for actual development work.

Another Example: My Spec-Driven Plugin

When I built the spec-driven development plugin for Claude Code, I was doing harness engineering without calling it that.

The plugin adds structure to how Claude works:

  1. Phase 0 - Brainstorm: Explore the problem space before committing

  2. Phase 1 - Requirements: Define what the system should do using EARS notation

  3. Phase 2 - Design: Architecture, data models, component design

  4. Phase 3 - Tasks: Break it into discrete, trackable implementation steps

Then it provides execution tools -- /spec-exec runs one task, /spec-loop runs until done, /spec-team coordinates four specialized agents (Implementer, Tester, Reviewer, Debugger).

Same Claude model underneath. But the harness (the plugin) constrains and guides the model's behavior so it produces better, more structured output. That's harness engineering.


How They Work Together

Context engineering and harness engineering aren't competing concepts -- they're layers of the same system.

┌─────────────────────────────────────────┐
│           Harness Engineering           │
│  (infrastructure, tools, guardrails,    │
│   orchestration, memory, verification)  │
│                                         │
│   ┌─────────────────────────────────┐   │
│   │      Context Engineering        │   │
│   │  (what goes into the context    │   │
│   │   window at each step)          │   │
│   │                                 │   │
│   │   ┌─────────────────────────┐   │   │
│   │   │   Prompt Engineering    │   │   │
│   │   │  (the specific text     │   │   │
│   │   │   and instructions)     │   │   │
│   │   └─────────────────────────┘   │   │
│   └─────────────────────────────────┘   │
└─────────────────────────────────────────┘
  • Prompt engineering is about the words

  • Context engineering is about the information

  • Harness engineering is about the system

You need all three. A great prompt in bad context produces garbage. Great context with no harness means the model can think but can't act. A great harness with bad context means the model can act but makes wrong decisions.


The Evolution

Here's how I think about the progression:

2023: Prompt Engineering Era

Everyone was learning to write better prompts. "You are an expert Python developer. Think step by step." That was the cutting edge.

2024: RAG and Tool Use

People realized the model needs information and capabilities beyond what's in the prompt. RAG pipelines, function calling, tool use. This was the beginning of context engineering.

2025: Agents

Full agent loops -- models that plan, act, observe, repeat. MCP standardized tool integration. This forced people to think about harness engineering whether they called it that or not.

2026: Harness Engineering

The realization that the system around the model matters more than the model itself. Companies competing not on which model they use but on how good their harness is. Better harnesses make worse models outperform better models with bad harnesses.


Practical Takeaways

If you're building with LLMs right now, here's what this means:

Stop obsessing over the perfect prompt. A good prompt matters, but it's maybe 20% of the outcome. The other 80% is context and harness.

Design your context pipeline. Think about what information the model needs at each step. What should it see? What should it not see? When should information be loaded vs summarized vs dropped?

Build feedback loops. The model should be able to check its own work. Run the tests, read the output, try again if it failed. That's harness engineering.

Use tools for facts, models for reasoning. Don't ask the model to remember your API schema. Give it a tool to look it up. Don't ask it to guess if code works. Give it a tool to run it.

Invest in guardrails. Especially for production systems. The model will occasionally do something unexpected. Your harness should catch it before it reaches the user.

Think in systems, not prompts. The prompt is one component. The system is what delivers value.


Quick Reference

Concept One-Line Explanation
Prompt Engineering Crafting the text instructions sent to the model
Context Engineering Designing the full information environment the model operates in
Harness Engineering Building the infrastructure around the model (tools, memory, guardrails, orchestration)
Agent Loop The cycle of think -> act -> observe -> repeat
KV Cache Stored attention computations that grow with context length
RAG Retrieving relevant documents and stuffing them into context
MCP Universal protocol for connecting models to tools
Guardrails Systems that prevent the model from doing things it shouldn't
Tool Orchestration Managing which tools are available and when they're called
Dynamic Context Changing what the model sees based on what it's doing

This is how I think about it from actually building this stuff -- running local models, building plugins, wiring up agent teams. The concepts click when you see them in action. If you're just getting started, build something small with tools and a loop. You'll learn more from that than from reading 50 articles about prompt engineering.