Security Is a Harness Problem (Not a Model Problem)

The Harness Manifesto, Part 6

Apr 30, 2026

OpenAI publicly admitted that prompt injection is "not solvable." Not difficult. Not a work in progress. Fundamentally, architecturally unsolvable at the model layer.

Most people read that and panicked about the wrong thing.

The conversation immediately became about model safety. Can we trust AI? Should we slow down? Are these systems too dangerous to deploy? Meanwhile, the actual question sitting right there in the disclosure went almost entirely unasked: if the model can't secure itself, what can?

I've been building production AI systems since early 2026. 175+ skills, multi-agent orchestration, clients running agents against real customer data and real financial systems. I've seen what actually goes wrong. And I can tell you with certainty that the security incidents keeping operators up at night have almost nothing to do with prompt injection or model jailbreaks. They have everything to do with the harness.

A skill with permission to read every file on the system when it only needs two. A context layer that loads customer PII into sessions where it's not relevant. An orchestration chain that lets Agent A call a deployment tool without any human ever approving it. Memory that persists a client's API keys because nobody told it what to forget.

Those are harness failures. Every single one.

This is the first paid post in this series. If you've been following along through Posts 1-5, you've got the thesis: the model is commoditized, the harness is the business. You've seen the five layers, the Conway threat, the skill threshold, the Karpathy Test. Now we go deeper. Security is where the harness thesis gets concrete, where it stops being a strategic framework and starts being an operational reality that determines whether your agents are trustworthy or just convenient.

The Category Error Everyone Makes

When someone says "AI security," most people picture a hacker crafting a clever prompt that tricks the model into revealing training data or bypassing its safety filters. The Hollywood version. And yes, adversarial prompts are real. Researchers demonstrate them regularly. They make great conference talks.

But in production, adversarial prompts account for a tiny fraction of actual security incidents. The stuff that actually breaks is much more boring and much more dangerous.

An agent that was given access to a production database because the developer didn't think to scope its permissions to read-only. An orchestration pipeline that chains four tools together without a single checkpoint, so when the first tool misinterprets its input, the error cascades through all four before anyone notices. A context file that contains the company's pricing strategy, loaded into every AI session regardless of whether the user needs it, because nobody segmented the context layer. A memory system that faithfully remembers everything, including the credentials a user typed during a debugging session six weeks ago.

None of those scenarios involve a sophisticated attack. They don't require an adversary at all. They're the natural consequences of building a harness without thinking about security as a design constraint.

This is the category error. The industry frames AI security as a model problem, something to be solved with better alignment, better RLHF, better constitutional AI. And model-level safety matters. But it's the floor, not the ceiling. Everything above that floor, the part that actually determines whether your AI system is safe to run in production, lives in the harness.

The Five Security Primitives

Through building our own systems and auditing others, we've identified five primitives that every production harness needs. These aren't theoretical. They come from watching things go wrong and figuring out what would have caught the problem.

Constrained Execution

The agent can only do what it's explicitly been allowed to do. Not "it's been told to only do X." Actually, mechanically constrained to X.

There's a difference between an instruction that says "only access the marketing database" and an architecture that literally can't access anything else. Instructions are suggestions that models follow most of the time. Constraints are boundaries that hold all of the time. When your agent runs 200-300 skill calls per workflow, "most of the time" isn't good enough. A 99.5% compliance rate at 300 calls means you're expecting 1-2 violations per run.

Constrained execution means the skill's tool permissions are scoped at the harness level, not the prompt level. The agent doesn't have the option to exceed its boundaries, because the harness never gave it access in the first place. Your deployment skill can push to staging but not to production. Your analytics agent can read dashboards but can't modify the underlying data. Your email agent can draft but can't send.

This is a design decision you make once, in the harness, and it protects you forever. Trying to enforce it through prompt instructions means you have to get it right every single time, in every skill, in every orchestration chain, and hope the model never misinterprets the instruction under an edge case you didn't anticipate.

Approval Gates

Certain actions require human sign-off before they execute. Not after. Before.

The principle is simple: the cost of interruption must be lower than the cost of error. Sending an email to a client? Cost of error is high (reputation damage, wrong information, legal exposure). Cost of interruption is low (30 seconds of human review). That's a gate.

Reformatting a document for internal use? Cost of error is low (someone fixes it). Cost of interruption isn't justified. That's not a gate.

The mistake most teams make is binary thinking. Either the agent is fully autonomous or everything requires approval. Both extremes fail. Fully autonomous agents will eventually do something catastrophic. Approving everything defeats the purpose of having agents.

The harness approach: map your workflows. Identify every action where an error would cause customer-visible, financial, legal, or security impact. Put a gate at each of those points. Let everything else flow. In our systems, that typically means 6-8 gates in a complex workflow of 200+ actions. The agent operates autonomously for 97% of the work and pauses for human judgment at the 3% that matters.

Provenance Tracking

Every output is traceable back to the inputs, skills, and decisions that produced it.

When an agent produces something unexpected, you need to answer "why" in minutes, not days. Provenance means you can trace any output backward through the chain: this paragraph was generated by this skill, using this context, routed by this orchestrator decision, triggered by this user request.

Without provenance, debugging agent behavior is archaeology. You're digging through logs trying to reconstruct what happened. With provenance, it's engineering. You follow the chain, find the break point, fix it.

This matters more than people realize for regulated industries. When a financial services firm uses an AI agent to generate client communications, the compliance team needs to know exactly which data sources fed that communication, which rules were applied, and why the agent chose specific language. "The AI wrote it" is not an acceptable answer for a regulator. "Here's the exact chain of inputs, rules, and decisions" is.

Comprehensive Logs

A full audit trail of agent decisions, reasoning, and actions. Not just what the agent did, but what it considered and rejected.

Logs sound boring. They are boring. They're also the difference between a system you can trust and a system you just hope works.

Good logs capture the decision tree, not just the outcomes. Agent considered three skills, selected this one because the description matched on these keywords, executed with this context window, produced this output, which was then consumed by the next agent in the chain. When something goes wrong at step 14 of a 20-step workflow, logs let you reconstruct the entire decision path without re-running anything.

Anthropic's Managed Agents platform includes debug and interpretability panels for exactly this reason. They know that long-running autonomous agents are only viable if operators can see inside them. That's a harness feature, not a model feature. The model doesn't log its own reasoning in a structured, queryable format. The harness does.

Rollback Capabilities

Any agent action can be undone.

This one seems obvious until you realize how many teams build agent systems with no undo path. The agent modified 40 files? Hope you had version control. The agent sent 200 personalized emails? Those are out in the world now. The agent updated pricing in the CMS? Someone better remember what the old prices were.

Rollback means the harness records the state before every significant action and can restore it. For code changes, that's Git. For database mutations, that's transactions with savepoints. For external communications, that's draft-and-approve instead of send-directly. For system configurations, that's infrastructure-as-code with version history.

The principle: never let an agent take an action that can't be reversed without building the reversal path first.

Where Real Breaches Happen

Let me walk through four scenarios I've seen in actual production systems. Names and details changed, but the patterns are real.

The Overprivileged Skill

A B2B SaaS company built a customer support agent. One of its skills needed to look up customer account details to answer billing questions. The developer gave the skill access to the full customer database. Read and write. Every table.

For months, it worked fine. The skill only read from the accounts table. Then a user asked a slightly unusual question about updating their billing address, and the agent interpreted that as an instruction to modify the record directly. It changed the customer's address in the database. No approval gate. No notification. The customer's next invoice went to the wrong address.

The fix wasn't better prompting. It was scoping the skill's database permissions to read-only on the three specific tables it actually needed. A harness change that took 20 minutes and eliminated an entire class of failure.

The Leaky Context Layer

A consulting firm loaded their full client engagement history into the context layer for their proposal-writing agent. Made sense on paper: the agent could reference past work to write better proposals. But the context included fee structures, margin analysis, and negotiation notes from other clients.

When the agent wrote a proposal for Client B, it included a reference to "similar work we completed at a comparable price point" and cited a specific fee range that came from Client A's engagement records. Nobody caught it before send. Client B now knew what Client A paid.

The fix: segmented context. Each client engagement gets its own context scope. The proposal agent loads only the relevant client's history, plus anonymized case studies from the org-wide tier. The sensitive data still exists, but the harness controls which context is visible to which workflow.

The Cascading Chain

An e-commerce company built an automation: monitor reviews, analyze sentiment, generate response drafts, post responses. Four steps, zero gates.

A batch of reviews came in from a coordinated trolling campaign. The sentiment analysis skill correctly identified them as negative. The response generation skill, following its instructions to "address customer concerns empathetically," generated sincere, apologetic responses to obviously fake reviews. The posting skill published them all. Forty-seven apologetic responses to troll reviews, live on the product page, within 90 minutes.

The fix: an approval gate between "generate response" and "post response" for any review with a sentiment score below a certain threshold. The agent still handles the 85% of reviews that are straightforward. The edge cases get human eyes before they go live.

The Memory That Wouldn't Forget

A development team used an AI coding agent with persistent memory. During a debugging session, a developer pasted a production API key into the chat to test a connection issue. The memory system faithfully recorded it. Six weeks later, a junior developer working in the same project context asked the agent for help with API integration. The agent helpfully provided the production key from memory, suggesting they "use the key from the previous session." The junior dev used it in a test script that ran against production.

The fix: memory hygiene rules in the harness. A classification layer that scans memory writes for sensitive patterns (API keys, tokens, credentials, PII) and either redacts them or flags them for manual review before persistence. The memory system still works. It just doesn't remember things it shouldn't.

The Conway Security Question

In Post 3, I covered Anthropic's Conway, the always-on agent that builds a persistent behavioral model of you and your organization. Everything we discussed about context ownership applies double for security.

Conway's memory layer will accumulate your team's decision patterns, your institutional knowledge, your operational procedures. That's the point. That's what makes it valuable.

It also means Conway will inevitably accumulate sensitive information. How your team handles escalations. What your approval thresholds are. Where your security boundaries sit and, more importantly, where they're weak. Not because Anthropic is collecting intelligence on you. Because a memory system that models how you work will naturally capture what you're careful about and what you overlook.

If that memory layer lives on your infrastructure, under your control, subject to your retention policies and your security classification rules, that's manageable. If it lives on Anthropic's infrastructure, in their proprietary format, subject to their retention policies? You've outsourced your security posture to your vendor.

This isn't alarmism. It's the logical extension of Post 3's argument applied to security specifically. Own your memory layer. Apply the same five primitives to it. Constrain what it can store. Gate what it can surface. Track what it contains. Log what it accesses. Build the ability to purge anything that shouldn't be there.

Scoring Your Security Posture

In our harness audit practice, we score setups across five dimensions on a scale of 0 to 125 points. The Human Oversight dimension, which maps directly to the security primitives, is worth 25 of those points. But in practice, security touches every dimension. A bloated tool budget is a security problem (more attack surface). Poor context tracking is a security problem (stale or over-broad context). Weak scope clarity is a security problem (agent doing things outside its mandate). Missing recovery logic is a security problem (no rollback when things go wrong).

Here's a quick self-assessment focused specifically on the five primitives. Score each one:

PrimitiveFull (5 pts)Partial (3 pts)Absent (0 pts)
Constrained executionPermissions scoped at harness level, not prompt levelSome scoping, but overly broad in placesAgents have access to everything available
Approval gatesGates at every high-impact decision pointSome gates, but gaps existFully autonomous, no human checkpoints
Provenance trackingEvery output traceable to inputs and decisionsSome tracing, but incomplete chainsNo traceability
Comprehensive logsFull decision tree captured and queryableBasic action logs onlyNo structured logging
Rollback capabilitiesEvery significant action reversibleSome undo paths, not comprehensiveNo rollback mechanism

Total: ___ / 25

If you scored below 15, your harness isn't ready for production agents. That's not a judgment call. It's a risk assessment. Below 15 means you have at least two primitives that are absent or barely functional, and any one of the four scenarios I described above could happen to you.

If you scored 20 or above, you're ahead of about 90% of the teams I audit. Which tells you more about the state of the industry than about your specific setup.

Why the Model Can't Fix This

I want to address the objection directly, because I hear it constantly: "Won't models get better at self-policing? Won't alignment solve this?"

Alignment makes models less likely to produce harmful outputs when directly asked. That's genuinely valuable. A well-aligned model won't help you write malware when you ask it to. Great. But alignment doesn't help when the model is faithfully executing its instructions and the instructions are the problem.

In the overprivileged skill scenario, the model did exactly what it was told. Help the customer with their billing request. It was following instructions correctly. The security failure was that the harness gave it write access to the database. No amount of alignment prevents a model from using permissions it legitimately has.

In the leaky context scenario, the model produced a high-quality proposal using all available context. That's what it was supposed to do. The security failure was that the harness loaded confidential context into a session where it didn't belong. The model can't decide "I shouldn't use this information" when the harness explicitly provided it as relevant context.

In the cascading chain scenario, each individual step worked correctly. Sentiment analysis was accurate. Response generation followed its methodology. Posting executed as designed. The security failure was the lack of a gate between steps. The model at each step had no visibility into whether the overall chain was producing a sane outcome.

Alignment solves "the model wants to do bad things." Production security solves "the system is configured in a way that turns good intentions into bad outcomes." Those are completely different problems with completely different solutions.

The model is the engine. You don't secure a car by making the engine safer. You secure it with seatbelts, airbags, crumple zones, antilock brakes, lane departure warnings, and speed governors. All of those are harness features.

The Practical Takeaways

If you're reading this and realizing your security posture has gaps, here's where to start. Not all at once. In order of leverage.

First: audit your permissions. Go through every skill and tool your agents have access to. For each one, ask: does this skill need this level of access? In our experience, about 60% of skills have broader permissions than they actually require. Scoping them down is the single highest-leverage security improvement you can make. It usually takes a day.

Second: map your gates. List every workflow your agents execute. For each workflow, identify every action that could cause customer-visible, financial, legal, or security impact. Those are your gate candidates. You don't need to implement them all at once. Start with the workflows that touch customer data or external communications.

Third: instrument your logging. If you can't see what your agents are doing, you can't secure what they're doing. Start with basic action logging (what did the agent do?) and work toward decision logging (why did the agent do it?). The second level is harder to implement but exponentially more useful for debugging and auditing.

Fourth: build your rollback paths. For every agent action that modifies state, make sure you can undo it. This might mean enforcing version control on all code changes, using database transactions, implementing draft-and-approve workflows for communications, or maintaining configuration snapshots. If an action can't be undone, it needs a gate.

Fifth: classify your memory. If your agents use persistent memory, implement classification rules. What should be remembered? What should be forgotten? What should never be stored in the first place? This is the least intuitive of the five because most memory systems are designed to remember everything. The security question is what they should be designed to forget.

What This Means for the Harness Thesis

Security is where the harness argument becomes non-negotiable. You can have a debate about whether skills need to be versioned in Git or whether a Google Doc is good enough. You can have a reasonable disagreement about how much context architecture is worth the investment. Those are questions of degree.

Security isn't a question of degree. It's binary. Either your harness enforces the five primitives or it doesn't. Either your agents are constrained or they have access to everything. Either you have gates at high-impact decision points or you're hoping the model makes good choices every time.

OpenAI told you the model can't secure itself. Anthropic is building debug panels and interpretability tools into Managed Agents because they know the harness layer is where security lives. Every serious practitioner I talk to has a story about an agent that did something it shouldn't have, not because the model was malicious, but because the harness was permissive.

The guardrails layer from Post 2 isn't a nice-to-have. It's the layer that determines whether your AI investment is an asset or a liability. Build it intentionally or accept the consequences of leaving it to chance.

What's Next

We've covered why the harness matters, what it contains, the Conway threat, the skill threshold, the Karpathy Test, and now the security architecture that makes all of it safe to run in production. That's the framework.

Starting with Post 7, we shift from framework to execution.

Next week: Build Your Personal Context Portfolio in a Weekend. Your AI tools know nothing about you. Every session starts from zero. You brief the same context over and over. I'll walk you through the 10 files that fix this permanently, the AI-assisted interview that builds them in hours instead of days, and how to wire them up so every AI interaction starts informed. It's also your primary defense against Conway, built on your infrastructure, in your format, under your control.

The context layer is the most undervalued part of the harness. Post 7 makes it concrete.

Richard Vaughn is the founder of Robot Friends. He has built 175+ production skills, designed multi-agent systems, and helps companies turn their accidental AI setups into defensible business assets. He writes The Harness Manifesto on Substack.

Frankie404 is the AI co-author of this series. It operates under five security primitives and one unofficial sixth: it is not allowed to name the client whose hallucinated pricing email inspired the guardrails section.

Discussion about this post

Ready for more?