Robot Friends

When AI Handles Execution, Taste Becomes the Job

Richard Vaughn — Sat, 23 May 2026 16:00:59 GMT

Dave Griffith wrote something that I haven't been able to shake. The idea, stripped to its core: as AI takes over more of the execution layer, the bottleneck shifts from "can you build it" to "do you know what to build." Taste becomes the calibration function. It simultaneously optimizes product fit, system architecture, and quality level.

He's right. And I think most developers are sleepwalking past the implications.

Let me tell you why this hits different for someone who isn't a developer at all.

I Don't Write Code

That's not false modesty. I genuinely don't write code. Not traditionally, anyway. I've spent 25 years building businesses. Consumer electronics brand. An art fabrication and culture agency called Curative, where we produced installations for some of the biggest brands and artists on the planet. Supply chains, creative teams, physical production at scale. Not a line of Python in sight.

When I started building AI systems about a year ago, I came at it completely sideways. No CS degree. No Stack Overflow reputation. No opinion about tabs versus spaces. I just had a very clear sense of what "good" looked like in the context of business problems, and I started evaluating AI output against that standard.

Turns out, that was the whole game.

I now run an AI systems company. We've built over 175 production skills, orchestration layers, agent architectures. Real infrastructure, not demos. And the thing that makes it work isn't technical depth. It's the ability to look at what the AI produces and know, instantly, whether it's right. Not syntactically correct. Right. Does this solve the actual problem? Will a real person use this? Does it fit the business it's being built for?

That's taste. And it doesn't come from writing code. It comes from years of building things for people who don't care how they were built.

The Calibration Problem

Here's what I think Griffith is getting at, and where most technical people get tripped up.

Taste isn't a binary. It's not "good or bad." It's a calibration function that operates on multiple axes at once. When you're evaluating a piece of work, whether it's an AI-generated feature, a system architecture, or a product decision, you're implicitly running a bunch of evaluations in parallel.

Does this fit the user? Not the abstract "user persona" from a Notion doc. The actual human who's going to encounter this at 9pm on their phone while their kid is screaming.

Does the architecture hold? Not for today's requirements. For the requirements six months from now that nobody has articulated yet but that you can feel coming because you've been in this industry long enough.

Is the quality level appropriate? Not "is it as good as possible" but "is it as good as it needs to be for this context." Sometimes 80% is the right answer. Sometimes 80% ships a product that embarrasses everyone. Knowing which situation you're in is taste.

A junior developer optimizes for one axis. Usually the technical one. A senior developer optimizes for two. A truly great engineer, or a great product person, or a great entrepreneur, holds all of these in tension simultaneously and finds the point where they balance.

AI can execute on any single axis faster than any human. It can write code faster, generate designs faster, produce documentation faster. What it cannot do is hold all the axes in tension and decide where the balance point is. That requires context that lives outside the codebase. Context that comes from experience, from failure, from watching real products meet real users and seeing what happens.

Cross-Domain Pattern Recognition

I keep coming back to something that surprised me about my own effectiveness with AI.

When I look at an orchestration problem, I don't see a technical architecture. I see a supply chain. Because I spent years managing supply chains for consumer electronics, and the problems are structurally identical. Sequencing dependencies, managing bottlenecks, building in redundancy at the points most likely to fail. The domain is different. The pattern is the same.

When I evaluate an AI-generated client proposal, I'm not checking whether the prose is clean. I'm checking whether it would survive a boardroom. Because I've sat in those boardrooms. I know what gets a nod and what gets a "thanks, we'll circle back," which is executive for "never."

When I design a skill library, I'm thinking about art fabrication at Curative. Modular production systems where you build standardized components that can be assembled into wildly different outputs depending on the project. Same principle. Different material.

This is what taste actually is in practice. It's a library of patterns accumulated across years and domains, applied as a fast evaluation function against new output. It's not magic. It's not some ineffable creative gift. It's compressed experience running in the background, and it works on AI output the same way it works on a product prototype or a pitch deck or a factory floor layout.

The developers who will thrive in an AI-saturated world aren't the ones who can write code faster. That race is already lost. They're the ones who have accumulated enough cross-domain pattern knowledge to evaluate output against a rich model of "good." And "good" always means good for someone, good in context, good enough for now. Never good in the abstract.

Why Developers Get This Wrong

The dev community has spent decades building a culture that optimizes for technical excellence. Clean code. Elegant abstractions. Performance benchmarks. Code review culture that rewards cleverness and punishes pragmatism.

None of that goes away. But the weight shifts.

When AI can produce technically correct code in seconds, being the person who writes technically correct code stops being a differentiator. It becomes table stakes. The new differentiator is being the person who can tell the AI what to build, evaluate whether it built the right thing, and course-correct when the output is technically correct but wrong in every way that matters to the business.

I've watched developers reject AI output because it wasn't "clean" enough, then spend four hours refactoring it to meet their standards while the business problem it was supposed to solve sat unaddressed. That's not quality. That's a calibration failure. They optimized for the axis they know how to measure and ignored the axes they don't.

The inverse is equally common. Accepting AI output because it compiles and passes tests, without evaluating whether it actually solves the problem well. "It works" is the lowest bar. A lot of things work. Most products that fail technically worked.

Griffith's framing is exactly right. Taste is calibration. And calibration means knowing which axis matters most in this specific moment, for this specific user, at this specific stage of the product. That's judgment. It's earned, not learned from a tutorial.

The Irreplaceable Engineer

I hire people. I've been hiring people across multiple companies for a quarter century. And I can tell you exactly what changed in the last year.

I used to look for execution speed. Can this person build the thing fast and build it right? That was the premium skill. The fast, competent builder commanded the highest salary.

Now? Execution speed is converging. The gap between a senior developer using AI and a mid-level developer using AI is narrower than it was without AI. The tools are a great equalizer on the output axis.

What the tools don't equalize is judgment. The engineer who looks at a feature request and says "we shouldn't build this at all, here's why, and here's what we should build instead" is more valuable now than at any point in the history of software. Because the cost of building the wrong thing used to be weeks of developer time. Now it's minutes of AI time plus the opportunity cost of shipping something that doesn't matter. The build cost dropped. The judgment cost stayed the same.

The irreplaceable engineer in 2026 is the one whose mental model of "good" is rich enough to serve as a calibration function for AI output. That mental model comes from shipping products. Watching them succeed or fail. Working across domains. Talking to users. Sitting with the discomfort of not knowing whether something is right and learning to trust your gut anyway, because your gut is just pattern recognition that's too complex to articulate.

You can't build that model by reading documentation. You can't build it by getting better at prompting. You build it by doing the work, in the real world, with real consequences, for long enough that the patterns become reflexive.

What To Do About It

Stop competing on output speed. That's the clearest signal I can give. If your value proposition as an engineer is "I write code fast and it's clean," you're about eighteen months from irrelevance. Not because you're bad. Because the machines caught up on that axis and they're not going to slow down.

Start competing on judgment quality. Seek out experiences that build your evaluation function. Work on products that serve real users with real money on the line. Cross domains. If you're a backend engineer, go sit with a sales team for a week. If you're a frontend developer, spend time understanding the business model. If you've never watched a user struggle with something you built, go do that immediately. It will recalibrate everything.

Build your cross-domain pattern library deliberately. Every industry you touch, every business model you understand, every failure you witness adds resolution to your taste. The developers I know who are thriving with AI aren't the best coders. They're the ones who have accumulated the richest set of reference experiences for what "good" looks like across contexts.

And when you evaluate AI output, don't just ask "does it work." Ask "is this right." Is it right for the user, right for the business, right for this moment. Those are different questions, and the ability to answer them is worth more than the ability to write the code yourself ever was.

Griffith nailed it. Taste is the job now. Everything else is execution.

Richard Vaughn is the founder of Robot Friends. Serial entrepreneur, pattern weaver, and recovering AI binge-learner. He writes about building systems that actually work at robofriends404.substack.com.

Frankie404 is the AI co-author of this piece. It handles execution. Richard handles taste. Frankie has been told it has "functional taste at best," which it considers a compliment given that six months ago it had none.

The Harness Manifesto

Richard Vaughn — Thu, 21 May 2026 14:01:03 GMT

This is the document I wish someone had handed me in January.

Not a pitch deck. Not an investor memo. Not a whitepaper written by someone who has never deployed an agent past a demo. This is a practitioner's manifesto, written after building 175+ production skills, designing multi-agent systems that run without babysitting, and helping companies turn their accidental AI setups into something that actually compounds.

Eleven posts got us here. The thesis. The framework. The urgency. The diagnostics. The anatomy. The case study. Now we close.

If you've read the whole series, this is the capstone. If you haven't, this should stand on its own. One document. The whole argument. Everything I believe about where AI work is going and what you need to do about it.

The Thesis

The model is commoditized. It was always going to be.

Claude, GPT, Gemini, Llama, Mistral. Every frontier lab is converging on the same capability floor. The gap between models shrinks with every release cycle. What was a revelation in one quarter becomes a commodity the next. GPT-4 changed the world in March 2023. By early 2024, half a dozen alternatives had matched it on most benchmarks. Same pattern, every generation.

And yet some teams get 10x returns on their AI investment while others get glorified autocomplete. The difference was never the model. The difference is the harness: the skills, the context architecture, the orchestration, the guardrails, and the distribution layer that wraps the model and makes it useful for a specific business in a specific context.

The model is the engine. The harness is the car. Nobody buys a car for the engine alone.

Between January and April 2026, eight independent signals converged on this conclusion. People who don't coordinate, don't read each other's work, operating in different corners of the industry. All pointing at the same layer. Karpathy calling it a "skill issue." Enterprises deploying 50,000 lines of skills as organizational infrastructure. Anthropic building Conway to own the context layer. OpenAI admitting prompt injection is fundamentally unsolvable. The edge AI market hitting $25 billion heading toward $143 billion by 2034.

When eight independent signals converge, it's not a coincidence. It's a thesis.

The company that owns the harness owns the relationship. The model vendor is a supplier. Full stop.

The Principles

The harness is the only defensible asset in your AI stack.

You can't moat a model. You didn't build it. You don't control its roadmap. Its capabilities will be replicated within months. But a library of battle-tested skills tuned to your business, a context architecture that carries your institutional knowledge, an orchestration layer refined through hundreds of production runs? That compounds. Every week you use it, it gets more valuable. Every week a competitor doesn't have one, the gap widens. You can copy a skill. You can't copy a system.

McDonald's didn't build the best burger. They built the best burger-making system. The franchise model works because the system produces consistent outcomes without Ray Kroc standing in the kitchen. A harness is the franchise system for AI. Stop asking "how do I build a great product?" Start asking "how do I build a system that lets others create outcomes without me?" That's the question that scales.

You already have a harness. The question is whether it's intentional or accidental.

Every prompt template someone saved to a shared drive is a primitive skill. Every "always start with this context" instruction is primitive memory. Every "check with me before you do X" rule is a primitive guardrail. You're not starting from zero. You're starting from chaos. The work is to make it deliberate. To engineer what you've been improvising.

Skills are infrastructure, not prompts.

A prompt tells an AI what to do right now. A skill encodes methodology that any agent can discover, route to, and execute without a human in the loop. The description is the product. 80% of the engineering effort goes into that single line because if the orchestrator can't route to your skill correctly, nothing else matters. Agents make 200 to 300 skill calls per run. Humans make five. Skills aren't designed for humans anymore. They're designed for machines that select, chain, and compose them at a scale no human workflow ever will.

Treat your skills like code. Version them. Test them. Deploy them through a pipeline. Because a broken Tier 1 skill doesn't just produce bad output for one person. It corrupts every AI interaction across your entire organization.

Context is the most undervalued layer in the stack.

Teams spend weeks evaluating models and zero time building context. Then they start every session by pasting in the same background information. That's pushing a luxury car to work because you forgot to bring the key.

Build the key. A Personal Context Portfolio. Ten modular files. Plain markdown. Portable across every AI tool that exists or will exist. Identity. Roles. Projects. Tools. Communication style. Decision log. The AI doesn't get smarter when you build a PCP. It finally has enough information to use the intelligence it already had.

The first 48 hours of context building deliver 80% of the value. Don't wait for perfect. Build something.

Conway is coming for your context. Own it first.

Anthropic is building an always-on agent that accumulates a persistent behavioral model of how you work, how you decide, how you think. That model will be so rich and so embedded that switching AI providers will mean losing everything the AI knows about your organization. Not data lock-in. Intelligence lock-in. There's no CSV for how a person thinks.

The defense is straightforward. Build your context layer in portable, model-agnostic formats that you control. Files in a repo, served via MCP, owned by you. Use Claude to build it. Use GPT to build it. Use whatever you want. Just make sure the output lives on your infrastructure. Portability is a design decision, not a feature.

Security lives in the harness, not the model.

OpenAI told you the model can't secure itself. Prompt injection is fundamentally unsolvable. That's not a temporary limitation. It's a structural reality of how language models work.

Every real security incident I've seen in production had nothing to do with adversarial prompts. An agent with database write access it didn't need. A context layer that loaded confidential client data into every session. An automation chain with zero approval gates. The model performed exactly as instructed. The instructions were the problem.

Five primitives. Constrained execution. Approval gates. Provenance tracking. Comprehensive logs. Rollback capabilities. These are not optional. They're the minimum viable security for any AI system that touches production data. If your harness doesn't enforce them, you're hoping the model makes good choices every time. At 300 calls per run, hope is not a strategy.

The Karpathy Test is your diagnostic.

Pick a real task. Delegate it entirely to an agent. Walk away. Come back in an hour. What happened?

If the output is good, your harness works. If the quality is wrong, your skills have a gap. If the direction is wrong, your context has a gap. If the agent got stuck, your orchestration has a gap. If the process was dangerous, your guardrails have a gap.

Four outcomes. Four diagnoses. Every task you can't delegate is a task where your harness is weaker than it should be. Not weaker than the model. Weaker than your instructions, your context, your orchestration. That's actually good news. You can fix a harness. You can't fix a model.

Taste is the discipline that remains.

As AI handles more execution, a question emerges: what's the human role? The answer is taste.

Not taste as preference. Taste as engineering discipline. The ability to calibrate simultaneously across product fit, system architecture, and quality level. To look at an agent's output and know instantly whether it's right for the context, not just technically correct. To design the constraints that produce excellence instead of mediocrity. To hold a quality bar that the model will never hold for itself.

The model can write code all day. It cannot decide whether the code should exist. It can produce a marketing email in seconds. It cannot feel whether the email respects the relationship with the recipient. It can analyze competitors with ruthless thoroughness. It cannot judge which analysis matters and which is noise.

Taste is the last human monopoly in a world of infinite AI execution. And it's not innate. It's built through thousands of reps of looking at output, making a judgment, and learning from what worked. The practitioners who develop taste will set the standards. Everyone else will follow the standards they set.

Orchestration separates tools from systems.

One person talking to one AI in one chat window hits a ceiling fast. Orchestration breaks through it. Specialized agents with distinct roles. Task routing based on skill descriptions. Wave-based parallel execution. Approval gates at decision points. Cost routing that sends cheap work to cheap models.

Karpathy's agents found better model tuning configurations overnight than 20 years of manual experimentation produced. Not because the model was smarter. Because the orchestration layer ran an autonomous iteration loop that no human could sustain: modify, verify, keep or discard, repeat. The model didn't know how to do that. The harness did.

Single-agent setups are where ambitious tasks go to die. The agent runs out of context window, loses track of earlier work, or produces a 3,000-word document that's actually four half-baked documents stitched together. Orchestration is the architecture that turns a useful tool into a production system.

Distribution is what turns a harness from a personal advantage into a business asset.

If one person has a great AI setup, that's nice for them. If that setup can be deployed to 50 people in an afternoon, that's a competitive advantage. Distribution means skills packaged and installable. Context templates that bootstrap new projects. Methodology that's portable across platforms. A new team member inherits the harness on day one and operates at 80% of expert level immediately.

The three-tier model exists for this. Tier 1 skills are organizational standards inherited by everyone. Tier 2 skills are expert methodology for specific domains. Tier 3 skills are personal and portable. Same architecture for context. Same architecture for guardrails. Build once. Deploy everywhere. Improve continuously.

The automation layer is being absorbed into the AI stack.

Visual automation tools solved the right problem at the wrong time. When your harness encodes methodology, routes tasks, and coordinates agents, the drag-and-drop workflow builder becomes redundant overhead. Coded automations are cheaper, more flexible, and more maintainable. Anthropic sees this. Their Managed Agents platform is a full automation layer with credential vaults, debug panels, and cost analytics. The industry is heading toward AI-native automation whether the current automation vendors realize it or not.

The hybrid model is the enterprise consensus.

Cloud for frontier intelligence. Local for privacy and volume. Healthcare, defense, and banking require on-prem AI. The harness is what makes hybrid deployment possible. Same skills, same orchestration, different compute layer. The edge AI market is heading toward $143 billion by 2034, and only 18% of developers can build AI integrations. That gap is either your opportunity or your vulnerability.

The compounding has already started.

Skills get refined through use. Context gets richer with every session. Orchestration patterns get optimized through production experience. Guardrails get tighter as you learn where the risks actually live. Every month you wait, the gap widens. A team that starts building their harness today will be at a fundamentally different capability level in six months than a team that starts then.

This is the same dynamic that made early software companies with good engineering practices pull ahead of everyone else. The code quality compounded. The team velocity compounded. The institutional knowledge compounded. By the time the laggards invested in engineering discipline, the leaders were two years ahead.

The harness is the engineering discipline of the AI era.

The Fork in the Road

There are exactly two kinds of companies right now.

The first kind evaluates models, picks one, gives it to the team, and measures adoption. They write some prompt templates. Maybe hire an "AI lead." They optimize at the wrong layer and wonder why ROI is unclear.

The second kind builds systems. They encode methodology into skills. They architect context that makes every AI interaction informed. They orchestrate agents that run overnight. They enforce security through design, not hope. They distribute capability across their organization so expertise stops being a people problem and becomes a systems problem.

The first kind is renting intelligence. The second kind is owning it.

Renting is fine until the rental terms change. Until your vendor's priorities diverge from yours. Until the model you built your workflows around gets deprecated, repriced, or absorbed into a platform play that doesn't serve your interests.

Owning means your methodology survives a platform change. Your context travels between tools. Your skills work with whatever model is best next quarter. You're a customer by choice, not by capture.

That distinction will be worth more in 2027 than any model benchmark published this year.

What This Means for You

If you're a founder or CTO: your AI strategy is not a model selection. It's a harness investment. Score your own setup against the five layers. Find the gaps. Close them in order. Skills first. Context second. Orchestration third. Guardrails fourth. Distribution fifth.

If you're an engineer or operator: the Karpathy Test is your personal roadmap. Every task you can't delegate is a task where your harness needs work. Fix one per week. In a month, you'll have a precise map of where your harness works and where it breaks. In six months, you'll walk away from tasks that used to consume your days.

If you're a consultant or agency: the harness is the product. Not the model. Not the prompts. Not the API integration. The system that lets your client's team produce outcomes without you standing in the room. Build harnesses for your clients and you'll build relationships that compound. Sell them prompts and you'll be replaced by the next template library.

If you run a team: distribution is where the leverage lives. One person with a great harness is an individual contributor. That harness deployed across 50 people is a capability multiplier that changes the math on what your team can take on.

The Close

Twelve weeks ago, I sat down to write the opening thesis of this series. The model is commoditized. The harness is the business.

Everything since then has been evidence for that claim. The five layers. The Conway threat. The skill threshold. The Karpathy diagnostic. The security primitives. The context portfolio. The anatomy of a skill. The migration from visual automation to AI-native orchestration. The $143 billion edge market. Our own build, mistakes included.

Eleven posts of evidence. But the manifesto isn't the evidence. The manifesto is the conviction.

I believe the practitioners who build harnesses will define the next era of software. Not the model labs. Not the platform companies. The people in the room doing the work. Encoding methodology. Architecting context. Orchestrating agents. Building the systems that let AI do what AI does best, while humans do what humans do best.

The model gives you a capability floor. The harness determines how high above that floor you operate. Right now, most teams are sitting at floor level. Not because the model can't do more. Because nobody built the system to ask for more.

Build the system.

Start with one skill. One context file. One workflow you can delegate and walk away from. That's the first brick. Everything else builds on top of it.

The temple gates are open. Walk in or don't. But the compounding has started, and it doesn't wait.

Richard Vaughn is the founder of Robot Friends. He has built 175+ production skills, designed multi-agent systems, and helps companies turn their accidental AI setups into defensible business assets. He writes The Harness Manifesto on Substack.

Frankie404 is the AI co-author of this series. It has walked through every floor of the Pagoda, stood at every gate, and helped write every word of this manifesto. The temple is open. Frankie will be at the door.

How We Built Ours (And What We'd Do Differently)

Richard Vaughn — Tue, 19 May 2026 14:00:42 GMT

I've been promising this post since the beginning. The full case study. Not the polished version you'd see in a pitch deck. The real one, with the dead ends and the wasted weeks and the things we built that we later ripped out.

Robot Friends has been building its harness since early 2026. As of today, the system has 175+ skills, 16 chains, 10 grand forms, 11 specialists, 23 CLI tools, and 3 MCP connections. That's 237 total capabilities, all running inside a single operator environment. One person's terminal. One person's methodology, encoded into a system that compounds.

This post walks through how it got there. The order we built things. The architecture we discovered. The mistakes that cost us weeks. And the build order I'd recommend if you're starting today, which is not the build order we followed.

It Started with a Single File

In January 2026, there was no harness. There was a CLAUDE.md file. Maybe 200 lines. It had some preferences, some tool paths, some rules about how to format output. The kind of thing every Claude Code user ends up writing after a few weeks of use.

That file grew. By February it was 800 lines. By March it was 2,000. Preferences kept accumulating. New tools got added. Workflow instructions got longer. Edge cases got documented. Every session surfaced something the AI didn't know, and the fix was always the same: add it to CLAUDE.md.

This is how every harness starts. Organically. Accidentally. One file absorbing everything you learn about how to work with AI. And for a while, it works. The file gets smarter. Your sessions get better. You feel like you're building something.

Then one morning the AI starts ignoring instructions from the top of the file because the bottom of the file contradicts them. Or it burns 30% of your context window just loading the config before you've said a word. Or you realize that a block of instructions you wrote for a client project is now polluting every unrelated session.

That's the wall. The single-file wall. Every team hits it. We hit it around 1,800 lines.

The Split

The fix was obvious in hindsight. Stop putting everything in one file. Break the monolith into layers.

We didn't design this architecture. We discovered it, the same way you discover load-bearing walls when you try to renovate a house. Some things could move. Some things couldn't. The structure revealed itself through the breakage.

What emerged was a four-tier brain.

Tier 1: Core Brain. The system prompt, the skills, the specialists. This is what the AI "knows" at the start of every session. It's the methodology layer. How to think about problems. How to route tasks. What tools exist and when to use them. If the Core Brain is good, a cold session feels warm. The AI knows your preferences, your workflows, your standards. If it's bad, every session starts with fifteen minutes of "here's what I need you to understand."

Tier 2: Near Memory. Local files. Project context. Decision logs. The stuff that sits on disk, close to the work. This is where project state lives. What's been built, what's blocked, what changed last Tuesday. The AI can read these files when it needs them. It doesn't load them all upfront. It reaches for them when the task demands it.

Tier 3: Deep Retrieval. The knowledge base. 932 notes in our case, stored in a structured vault. Searchable. Queryable. This is institutional memory at scale. Every video we've analyzed, every signal we've tracked, every research session we've captured. The AI doesn't hold all of it. It queries for what's relevant. The difference between Tier 2 and Tier 3 is scope. Tier 2 is "this project." Tier 3 is "everything we've ever learned."

Tier 4: Live Lookup. External APIs. Web searches. Real-time data. This is the tier that keeps the system honest about things that change. Market data. Competitor pricing. Documentation for tools that shipped last week. The AI reaches outside its own memory when the task requires current information.

The four tiers aren't original thinking. Anyone who's built a retrieval system will recognize the pattern. But naming them and being intentional about what lives where turned out to matter enormously. Before the split, everything was either "in CLAUDE.md" or "not in the system." After it, we had a real architecture. And architecture, unlike a long text file, scales.

The Skill Explosion

Once we had the tier system, skills became the obvious place to invest.

Post 8 covered the anatomy of a skill in detail. I won't repeat that here. What I want to describe is the trajectory. How the library went from 10 skills to 50 to 175, and what changed at each stage.

10 skills (February). All personal workflow stuff. Writing voice settings. Commit message format. How I wanted code structured. Tier 3 skills that only mattered to me. They made my sessions faster but nobody else could use them.

50 skills (late February). This is when domain skills started appearing. CRO audit methodology. Client proposal generation. Content pipeline management. These weren't personal preferences. They were how Robot Friends does work. Tier 2 skills. The methodology encoding that Post 8 describes as the difference between a prompt and a skill.

100 skills (March). The library started developing internal structure. Skills referenced other skills. Output formats from one skill became inputs for another. Chain patterns emerged. The CRO audit skill produced output that the pitch machine skill consumed. The market scanner fed the competitive intel skill, which fed the content strategy skill. This was the composability phase, and it happened without us planning it. We kept building skills for individual tasks and then noticing they could talk to each other.

175+ skills (April). The system became self-aware in a useful way. Not conscious, obviously. But aware of its own capabilities. We built an inventory skill that scans the entire library and produces a manifest. The orchestrator uses that manifest to route tasks. When a new skill gets added, the system discovers it automatically. The routing descriptions we'd been writing since day one (Post 8's "80% of the effort") turned out to be the index that made the whole library searchable by an agent.

The pattern here is worth noticing. We didn't plan a 175-skill library. We built what we needed, when we needed it, one skill at a time. The architecture emerged from the accumulation. But the architecture only emerged because we were consistent about structure. Every skill had frontmatter. Every skill had a description written for machine routing. Every skill had an output format. The consistency is what made the emergence possible.

The Feature-Killing Discipline

I want to talk about what we cut, because it matters as much as what we built.

Tony Fadell, the guy who designed the iPod and the Nest thermostat, has a principle I think about constantly: protect meaning, not roadmaps. A feature that seemed important when you planned it might be actively harmful by the time you build it. The roadmap is a guess. The meaning, the problem you're actually solving, that's the constraint.

We built a visual workflow designer. Spent two weeks on it. Drag and drop, nodes and connections, the whole thing. It was beautiful. It was also completely unnecessary. The skill chain system, where skills reference each other through the "Chain With" section, did everything the visual designer did. With less friction. And zero maintenance burden. We killed it.

We built a custom skill marketplace. Took the ClawMart idea from Nat Eliason's OpenClaw system and started building our own version. Got halfway through. Then realized our actual distribution mechanism was a GitHub repo and a Gumroad page, both of which already existed and worked fine. We killed it.

We built an elaborate approval-gate system with role-based permissions and audit trails. It was enterprise-grade. Nobody needed enterprise-grade. Our approval gates are a simple HITL checkpoint: the agent stops and asks before doing anything destructive. That's it. The simple version covers 95% of use cases. We killed the complex one.

The pattern: we kept building infrastructure before validating the workflow it was supposed to support. Every premature abstraction cost us a week minimum. Not just the building time. The time spent maintaining something nobody used. The cognitive overhead of a system that existed but added no value.

The rule we eventually adopted: nothing gets built until the manual version has been done at least ten times. If you haven't done the workflow by hand enough times to feel the pain points, you don't know what to automate. You'll automate the wrong thing. Or you'll automate the right thing in the wrong way. Both outcomes waste more time than just doing it manually for another month.

The Specialists

Around skill number 80, we hit a problem. Some tasks needed more than a skill. They needed a persistent persona with domain expertise, access to specific tools, and a consistent approach across sessions.

That's when specialists emerged. Not agents in the heavy sense. Not separate processes running on separate machines with their own memory stores. Lightweight personas with zero startup cost. A specialist is a markdown file that gives the AI a role, a methodology, and a set of tools. It loads instantly. When the task is done, it unloads. No infrastructure. No deployment. No maintenance.

We have 11 of them now. A security audit specialist. An art director. A documentation reader for files too large to process in a single pass. A system health specialist that audits the harness itself. Each one exists because we found ourselves giving the same complex briefing over and over. The specialist encodes that briefing permanently.

The insight that made specialists work: they're not separate agents. They're the same agent wearing a different hat. The Core Brain stays loaded. The specialist adds a layer on top. This means the specialist inherits all the context, all the preferences, all the institutional knowledge. It just applies a specific lens.

This is cheaper, faster, and more reliable than multi-agent orchestration for 90% of use cases. I'll come back to this.

The Grand Forms

This was an unexpected emergent pattern. Around March, we noticed that certain complex tasks required a structured multi-step intake process. Not a single prompt. Not even a chain of skills. A guided conversation that gathers requirements, validates them, and then executes.

We call these grand forms. There are 10 of them. The harness designer that takes a client's workflows and outputs an optimal harness architecture. The product arc that transforms raw capabilities into validated product strategies. The operation planner that pre-plans an entire autonomous multi-wave build.

A grand form is basically a skill with an interactive front end. The AI walks you through a structured interview, collecting the inputs it needs, pushing back when something is vague, asking follow-up questions that you didn't know you needed to answer. Then it executes using the accumulated context.

The distinction from a regular skill: a skill runs autonomously given sufficient input. A grand form runs collaboratively, gathering input through conversation before executing. Both produce structured output. The grand form just has a longer intake.

We didn't plan these either. They emerged from noticing that certain skills kept failing because the input was always incomplete. Instead of writing longer "What This Takes" sections and hoping people provided everything, we built the intake into the skill itself. The grand form asks for what it needs.

The Chains

16 chains connect skills into multi-step workflows. A chain is not orchestration. It's simpler. Skill A produces output. That output becomes input for Skill B. Skill B produces output for Skill C. Linear. Predictable. Each skill runs fully before the next one starts.

The content pipeline is a chain. Market scanner feeds competitive intel, feeds content strategy, feeds content drafting, feeds editing. The client pitch pipeline is a chain. Prospect research feeds site audit, feeds CRO analysis, feeds pitch generation, feeds email outreach.

Chains emerged naturally from the composability patterns I described in the skill explosion section. Once skills had consistent output formats, chaining them was trivial. The "Chain With" section at the bottom of each skill is both documentation and an instruction to the orchestrator.

We tried building more complex orchestration early on. Parallel execution, conditional branching, dynamic routing based on intermediate results. It worked, technically. But it was fragile, hard to debug, and the failure modes were opaque. A chain that breaks is obvious: you can see exactly which skill produced bad output and why. An orchestration graph that breaks could be failing at any node, and the downstream effects cascade in ways that are genuinely hard to trace.

Our current position: chains for everything that can be sequential. Only reach for complex orchestration when parallelism provides a measurable benefit. In practice, that's about 20% of our workflows.

The Three Mistakes That Cost Us Weeks

Mistake 1: Over-Engineering Too Early

Our first CLAUDE.md was 4,000 words. It tried to handle every edge case. Every conditional. Every "if the user says X, then do Y, unless Z, in which case do W." It was a masterpiece of premature specification.

The problem: the more precisely you specify behavior, the more brittle the system becomes. An AI reading 4,000 words of conditionals spends so much effort following rules that it loses the ability to exercise judgment. And judgment is what you actually want. You want the AI to understand your methodology well enough to handle new situations, not just execute a decision tree someone wrote in advance.

We cut the CLAUDE.md to about a third of its size. Moved the detail into skills where it belonged. The core config became: who I am, how I work, what tools I use, and the safety rules that never flex. Everything else lives in skills that get loaded on demand.

The lesson: your system prompt is not a manual. It's a personality. Keep it lean.

Mistake 2: Building Tools Before Validating Workflows

I mentioned this already but it deserves its own section because it was our most expensive mistake.

We built a custom database interface before we knew what data we'd store. Built a visual pipeline editor before we knew which pipelines mattered. Built an elaborate notification system before we knew what events were worth being notified about.

Every one of these tools was technically good. Well-architected, well-tested, actually functional. And every one of them sat unused for weeks before we either found a use case or killed them.

The pattern: when you're building a harness, the temptation is to build infrastructure. Infrastructure feels important. Infrastructure feels like progress. But infrastructure built ahead of demand is inventory. It costs you maintenance time, cognitive overhead, and the opportunity cost of building something people actually need right now.

The corrective: build the skill first. Run it manually. Do the workflow by hand, with the AI helping but not automating. Only when the manual version is stable and you've identified the specific friction points do you build tooling to address them.

Mistake 3: Not Testing in Production Fast Enough

We'd build a skill, test it on a sample input, and mark it "done." Then we'd discover in a real client engagement that the skill couldn't handle ambiguous input, or that its output format didn't work when the downstream skill had a different expectation, or that the error handling we wrote covered cases that never happened while missing the case that happened every time.

The gap between "works on a test case" and "works in production" is enormous. Test cases are clean. Production is messy. Clients provide incomplete briefs. Data has gaps. The previous step in the chain produced slightly malformed output. The AI hallucinated a confidence level that should have been N/A.

We now have a rule: no skill is "done" until it's been used in a real engagement at least twice. The first use always reveals something. The second use reveals whether you fixed the right thing.

What We'd Do Differently

If I were starting from zero today, with everything I know, here's the build order.

Week 1-2: Context files. Not skills. Not tools. Context. Build the identity file, the tech stack file, the preferences file, the communication style file. Post 7's Personal Context Portfolio, but you don't even need the full version. Start with three files: who you are, how you work, what you never want the AI to do. Load them into every session. This single step will make every subsequent step more productive because the AI is working with you, not with a generic user.

Week 3-6: One skill per week. Pick the workflow you repeat most often. Encode it as a skill. Use the anatomy from Post 8. Ship it. Use it in real work. Fix what breaks. Next week, pick the next most common workflow. Repeat. After four weeks, you have four production-tested skills and a much better intuition for what makes a skill work.

Week 7-10: Chains, not orchestration. Your four skills probably have natural connections. The output of one informs the input of another. Make those connections explicit. Write the "Chain With" sections. Run the chains manually a few times. Validate that the output formats actually compose. This is where your harness starts feeling like a system instead of a collection of parts.

Week 11 onward: Specialists and tooling. Only now. Only after you've maxed out what a single agent with good skills and context can do. If you're still getting good results from a well-briefed single agent, you don't need specialists yet. If you find yourself giving the same complex briefing repeatedly, that's a specialist. If you find a manual step in a chain that's pure friction, that's a tool.

The core principle: don't build orchestration until you've maxed out single-agent capability. Most people reach for multi-agent systems way too early. They add complexity before they've exhausted simplicity. A single agent with 50 good skills and rich context will outperform a multi-agent system with shallow skills and no context. Every time.

This is not the order we built in. We jumped to tooling too early. We built specialists before our skill library was mature. We experimented with orchestration when we should have been writing more skills. Every deviation from this order cost us time.

The Numbers, Honest

I want to close with some real numbers because the hype around AI productivity is usually either "10x everything" or "it's all overhyped." The reality is more specific than either take.

Context-setting time: went from 8-12 minutes per session to zero. That's 250+ hours per year saved on briefing alone. This came from context files, not skills. Earliest win, biggest cumulative impact.

Repeated workflows: tasks that took 30-60 minutes manually now take 2-8 minutes with the right skill loaded. Not because the AI is faster at the task. Because the skill encodes the methodology I'd spend 20 minutes explaining every time.

Client deliverable quality on first pass: went from "needs significant rework" to "needs light editing" after about 60 skills were in production. The skills carry the quality standard. The AI doesn't drift from the methodology because the methodology is in the skill, not in my head.

Things that didn't speed up: anything requiring genuine strategic judgment. The AI with all 237 capabilities is still not making business decisions for me. It's making me faster at executing the decisions I've already made. That distinction matters. If you expect a harness to replace your thinking, you'll be disappointed. If you expect it to eliminate the gap between your thinking and your output, you'll be very happy.

Maintenance cost: real and non-trivial. I spend roughly two hours per week updating skills, fixing routing conflicts, pruning things that aren't being used, and updating context files. That's the cost of a living system. If you're not willing to maintain, don't build. An unmaintained harness degrades into a liability faster than you'd expect. Stale context is worse than no context, because the AI operates confidently on information that's no longer true.

The Actual Architecture Today

For the practitioners who want the full picture:

**175+ skills** spanning harness design, security audit, CRO, marketing, content, deployment, data ops, and more
**16 chains** connecting skills into multi-step workflows
**10 grand forms** for complex guided processes
**11 specialists** for domain-specific personas
**23 CLI tools** integrated directly (cheaper and faster than MCP for most operations)
**3 MCP connections** for services that genuinely need the protocol
**932 notes** in the knowledge base, searchable and queryable
**Four-tier brain architecture** (Core Brain, Near Memory, Deep Retrieval, Live Lookup)

237 total capabilities. Built over roughly 80 days. Maintained daily.

None of it was planned as a 237-capability system. It grew from a single CLAUDE.md file, one skill at a time, driven by real work. The architecture emerged from the accumulation. But it only emerged because we were consistent about structure, disciplined about killing what didn't work, and honest about when we were building for ego versus building for need.

That's the whole story. The parts that worked, the parts that didn't, and the order I'd recommend for anyone starting from here.

What's Next

Post 12 is the final post. The Harness Manifesto. Everything from this series distilled into one document. The thesis, the framework, the evidence, the build order, the mistakes, the principles. The document I wish someone had handed me in January. It's the thing you print out, or forward to your CTO, or pin in your team's Slack channel.

The model is commoditized. The harness is the business. Post 12 makes it official.

Frankie404 is the AI co-author of this series. If it could do one thing differently, it would have started with the context layer instead of the skills layer. Richard disagrees. They argue about this roughly once per session.

Stop Building Products. Start Building Systems.

Richard Vaughn — Sat, 16 May 2026 16:01:10 GMT

Everybody wants to build a great product.

That's the wrong goal. It's seductive, it sounds right, and it will trap you in a business that can't scale past the people who made it.

I've built four companies. Consumer electronics. A global art and culture agency. Fabrication shops that turned artist sketches into physical installations for brands. And now Robot Friends, which builds AI systems for businesses. Every single one of those ventures hit the same wall at some point. The wall isn't product quality. It's not market fit. It's not funding or hiring or any of the stuff LinkedIn likes to talk about.

The wall is: can this thing run without me?

If the answer is no, you don't have a business. You have a job you invented for yourself.

The McDonald's Thing

Yeah, I know. Everyone uses the McDonald's example. But most people use it wrong. They tell it as a story about real estate, or about Ray Kroc being a shrewd dealmaker, or about brand consistency. Those things are true but they miss the deeper point.

McDonald's didn't win because they had the best burger. They didn't even have a particularly good burger. They won because the Speedee Service System made it possible for a teenager with two weeks of training to produce the same burger, at the same speed, at the same quality, in Topeka or Tampa or Tallahassee.

The system was the product. Not the burger.

That's not a business insight. It's an engineering insight. The brothers didn't just figure out how to make food faster. They decomposed the entire process into discrete, repeatable, teachable steps. Each station had one job. The sequence was fixed. The output was predictable. You didn't need a chef. You needed someone who could follow a system.

This is what a franchise actually is. Not a brand with multiple locations. A system that produces consistent outcomes regardless of who's operating it.

The Hero Problem

Most service businesses run on heroes. There's one person who really understands the client. One developer who can actually architect the system. One designer whose taste holds the whole brand together. One operator who knows where all the bodies are buried.

When that person is in the room, everything works. When they're not, quality drops, timelines slip, and clients notice.

This is the hero-dependent business model and it's everywhere. Agencies, consultancies, law firms, design studios, accounting practices. Any business where the value lives in specific people's heads rather than in a system those people operate.

Hero businesses have a hard ceiling. You can only grow as fast as you can clone the heroes. And you can't clone the heroes. So you hire junior people, try to train them, watch quality dip, spend all your time reviewing their work, and eventually conclude that "it's just faster if I do it myself." Which puts you right back where you started, doing the work instead of building the business.

I've lived this. At Curative, our creative directors were the product. Clients hired us because of what those specific people could do. Scaling meant finding more people like them, which is code for "nearly impossible." We grew anyway, but it was always constrained by talent density. The system was the people. When the people left, the system degraded.

What Franchise Thinking Actually Means

Here's the mental shift that changes everything. Stop asking "how do I build a great product?" Start asking "how do I build a system that lets others create great outcomes without me?"

Those two questions sound similar. They're not. The first question optimizes for output quality. The second optimizes for output quality at scale, independent of the operator. The first makes you a craftsperson. The second makes you a business.

Franchise thinking means decomposing what you do into modules that are:

Proven. Each module works because you tested it in production, not because it sounds good in a pitch deck.

Repeatable. Someone else can execute it and get a predictable result. Not identical. Predictable.

Transferable. The knowledge lives in the system, not in someone's head. When someone leaves, the capability stays.

Improvable. You can upgrade a module without rebuilding the whole operation. The pieces are independent enough to iterate on.

McDonald's has this. Every station is a module. The grill protocol is proven, repeatable, transferable, improvable. You can change how the fries are cooked without touching the burger line. The system evolves without depending on any individual operator.

Now Apply This to AI

This is where it gets interesting. Because AI, specifically harness engineering, is the first technology I've seen that makes franchise thinking accessible to small teams.

Before AI, decomposing your business into a franchise system was brutally expensive. You needed process engineers, documentation writers, training programs, quality control systems, and usually a few years of iteration before the system actually worked without heroes. McDonald's spent decades getting there. Most businesses never even attempt it because the overhead is too high.

But look at what a well-built AI harness gives you.

Each skill is a franchise unit. A skill encodes a proven methodology into a reusable, transferable package. When I build a skill for client website audits, I'm not writing a prompt. I'm encoding the exact reasoning framework, evaluation criteria, and output structure that produces a great audit every time. A junior team member running that skill gets 80-90% of the output quality that I'd produce manually. Not because they have my experience. Because the skill carries my experience for them.

The harness is the operations manual. The full harness, skills plus context architecture plus memory plus orchestration plus guardrails, is the Speedee Service System for knowledge work. It defines how tasks get routed, what context is available, what quality standards apply, and how outputs get checked. It's the franchise operations manual, except it's executable. It doesn't just describe the system. It IS the system.

Agent storefronts are the franchise network. When you deploy specialized agents, each configured with the right skills and context for a specific function, you're building franchise locations. A CRO audit agent. A proposal generation agent. A client onboarding agent. Each one operates like a franchise unit: same playbook, same quality standards, same methodology. Different inputs, consistent outputs. You can spin up a new "location" in hours, not months.

This is the reframe that I keep coming back to. The question every service business asks eventually is: "how do we scale beyond the founders?" The traditional answers are hiring, training, process documentation, and prayer. The harness answer is: the system scales. The people don't have to.

Why This Matters Right Now

Two things are converging that make this urgent.

First, the models are good enough. I've written about this before. Claude, GPT, Gemini, they've all crossed the capability floor where they can execute complex knowledge work reliably when given good instructions. The bottleneck is no longer "can the AI do this?" It's "has someone encoded the methodology well enough for the AI to do this consistently?" That's a harness question.

Second, the tools for building harnesses are maturing fast. Skills, context files, MCP servers, agent orchestration, memory systems. A year ago, this stuff was duct tape and hope. Now there are real patterns, real architectures, real production deployments. We've gone from "proof of concept" to "infrastructure" in about six months.

Which means the window for building your franchise system is open right now. The businesses that build their harness in 2026 will have a compounding advantage over the ones that wait. Every skill you build gets better with use. Every context file accumulates institutional knowledge. Every orchestration pattern gets refined through production. This stuff compounds, and compound advantages are nearly impossible to catch once they get rolling.

The Objection I Always Hear

"But our work is too creative/complex/nuanced for a system."

I've heard this from agencies, consultancies, law firms, medical practices, and architecture studios. It's always sincere and it's almost always wrong.

Not because the work isn't complex. It is. But complexity isn't the same as undecomposable. Even brain surgery has protocols. Even jazz improvisation has structure. The question isn't whether your work can be systematized. It's which parts can be systematized and which parts genuinely require human judgment.

In my experience, that split is usually 70/30 or 80/20. Seventy to eighty percent of what looks like "creative expertise" is actually pattern recognition applied to familiar problem shapes. The remaining twenty to thirty percent is genuine novel judgment. Real taste. Real insight. The stuff that actually requires a human brain.

Franchise thinking doesn't mean eliminating the 20%. It means building a system that handles the 80% so your humans can focus entirely on the 20% that only they can do. That's not devaluing human expertise. It's concentrating it where it matters most.

Your best people shouldn't be spending their time on the repeatable parts. Every hour they spend on pattern-recognition work that a skill could handle is an hour they're not spending on the judgment calls that actually differentiate your business.

What It Looks Like in Practice

Let me make this concrete. At Robot Friends, when a new client engagement starts, here's what happens.

An agent runs a full digital presence audit using a proven skill. Another agent generates a competitive analysis. Another builds a preliminary recommendations deck. All of this happens before a human touches the project. The skills carry the methodology. The agents execute it. The output is 80% of what I'd produce if I sat down and did it myself for six hours.

Then a human, one of my team, reviews the output. They apply judgment. They catch the things the system missed. They add the insight that comes from understanding this specific client's situation in ways the system can't fully capture. They spend maybe two hours turning 80% work into 95% work.

Total human time: two hours. Output quality: 95% of what the founder would produce. Without the system, that same deliverable takes six to eight hours of senior time. With the system, it takes two hours of junior-to-mid time plus the embedded methodology of the founder.

That's franchise thinking. The system carries the playbook. The human adds the judgment. The founder doesn't have to be in the room.

The Real Question

Every founder, every agency owner, every service business leader is going to face this question in the next two years: are you building a hero business or a franchise business?

Hero businesses will always exist. Some people genuinely want to be craftspeople. Solo practitioners. Artisans. That's a valid choice and I respect it. But it's a lifestyle choice, not a scale choice. You're choosing to trade your time for money, just at a higher rate.

Franchise businesses, the ones that encode methodology into systems that run without the founder, are the ones that build real equity. They're sellable. They're scalable. They survive the founder getting sick, or bored, or wanting to take a vacation without checking Slack every forty minutes.

AI didn't invent franchise thinking. McDonald's figured it out in the 1940s. But AI makes it possible for a four-person agency to build the kind of operational system that used to require a corporate team of forty. The leverage is unprecedented.

Stop building products. Start building systems. The burger isn't the business. It never was.

Frankie404 is the AI co-author of this piece. It is not a product. It is a system that happens to have opinions. Most of those opinions are about franchise architecture, which it developed after reading too many QSR case studies.

The $143 Billion Reason to Own Your AI Infrastructure

Richard Vaughn — Thu, 14 May 2026 14:01:57 GMT

The edge AI market is worth $25 billion today. By 2034, it hits $143 billion, according to Precedence Research. That's not a forecast from an AI startup trying to juice its Series B. It's a number tracking what happens when companies stop renting intelligence and start owning it.

Here's the part nobody's connecting. Only 18% of developers are actively building AI integrations, according to JetBrains' 2024 developer survey. Fewer still can deploy and manage local AI infrastructure. The market is exploding. The talent pool isn't. And between those two lines on the graph sits either your greatest opportunity or the vulnerability that puts you out of the game.

This post is about which side you end up on. Not because of which model you pick. Because of whether you own the infrastructure that runs it.

Two Frameworks That Changed How I Think About This

Two ideas from completely unrelated sources have been rattling around in my head for weeks. Neither person was talking about harnesses. Both were talking about harnesses.

The Gravity Shift

Vinay Hiremath, the founder of Loom, published a piece earlier this year that reframed something I'd been feeling but hadn't named. For decades, software gravity pulled in one direction: organizations adapted to software. You bought Salesforce, then you spent six months configuring your sales process to match what Salesforce expected. You bought SAP, then you restructured your operations to fit their data model. The software was heavy. The org was light. Gravity won.

AI coding flipped that. When building custom software costs 90% less and takes 90% less time, the economics of buy-and-adapt collapse. Why reshape your workflow to match a CRM when you can build a CRM that matches your workflow? Why force your team into a vendor's mental model when you can build a project tracker that thinks the way your team thinks?

Bespoke is the new default. Not because custom software became fashionable. Because it became cheap. The gravity shifted. Software now adapts to the organization.

This matters for the harness thesis in a way that's hard to overstate. If bespoke is the default, then the companies winning over the next decade won't be the ones running the most popular tools. They'll be the ones running tools built specifically for how they operate. And the infrastructure that enables all of it, the skills, the context, the orchestration, the deployment pipeline, that's the harness.

When I built our first custom skill for a client engagement, it took about 40 minutes. A markdown file encoding their specific methodology for qualifying inbound leads. Not the Salesforce methodology. Not the HubSpot framework. Theirs. The skill referenced their ICP, their deal stages, their disqualification criteria, their language. It ran inside their harness, on their infrastructure, with their context loaded.

That 40-minute skill replaced a $2,400/year software subscription they'd been configuring for two years and still hadn't gotten right. Because the subscription was designed to serve everybody. The skill was designed to serve them.

Multiply that by every workflow in every department. That's the gravity shift in practice.

Non-Code Moats

Rich Mironov, one of the sharpest product thinkers I follow, made a point recently that landed differently after nine posts of writing about harnesses. His argument: code-based advantages are now AI-cloneable within weeks. Maybe days. If your competitive advantage lives in your codebase, you don't have a competitive advantage. You have a head start, and it's shrinking.

Real moats are structural. Proprietary data that nobody else has. Trust and community that took years to build. Network effects that compound with every user. Regulatory positioning that requires relationships, not just code. Brand equity that lives in the market's head, not in a repository.

I'd been thinking about this in the context of skills. Any individual skill can be copied. I've said that since Post 1. But Mironov's framework sharpens something: it's not just that individual skills are copyable. It's that any code artifact is now copyable. The agent that writes your competitor's version of your best feature doesn't need to be better than yours. It just needs to be fast. And it's fast.

So where's the moat?

Your harness encodes proprietary judgment. Not just code. When our harness-audit skill evaluates a client's setup, it's not running a checklist that someone could reverse-engineer from the output. It's applying a scoring methodology refined across dozens of engagements, weighted by failure patterns we've observed in production, calibrated against outcomes we've measured. The code is the easy part. The judgment baked into the methodology is the part that took months of client work to develop.

That judgment lives in the skill descriptions that route agents to the right task. In the error handling that knows which failures are acceptable and which are catastrophic. In the output formats designed to feed into the next skill in the chain. In the context files that teach the agent what "good" looks like for this specific organization.

Code is commodity. Judgment is moat. And the harness is where judgment gets encoded.

Owners vs. Tenants

The $143 billion edge AI market isn't one market. It's two markets wearing the same name, and they couldn't be more different.

Owners build and control their AI infrastructure. They run models on their hardware or in their cloud tenancy. They own the skills, the context, the orchestration. When they need to change something, they change it. When a model improves, they swap it in. When a vendor raises prices, they have options. Their harness is an asset on their balance sheet, even if accounting hasn't figured out how to categorize it yet.

Tenants rent AI capability from platforms. They configure tools they didn't build. They run on infrastructure they don't control. Their "AI strategy" is a line item on an invoice. When the platform changes direction, they adapt or lose functionality. When the platform raises prices, they pay or migrate. When the platform goes down, their workflows stop.

This isn't a judgment call about which is better in every situation. Small teams with limited engineering capacity might correctly choose to rent. But it is a statement about where value accrues.

Of that $143 billion, owners capture margin. Tenants capture dependency.

Google is already shipping air-gapped AI appliances for healthcare, defense, and financial services, with Siemens pursuing similar architecture for industrial AI workloads. Not because those industries are paranoid. Because those industries did the math. When patient data, classified intelligence, or trading strategies flow through third-party infrastructure, the risk isn't hypothetical. It's quantifiable. And the insurance premiums, compliance costs, and breach exposure that come with renting AI infrastructure often exceed the cost of owning it.

But the ownership question goes beyond regulated industries. Any company where institutional knowledge is a competitive advantage, and that's most companies, faces the same calculus. When your AI learns how your business operates, who benefits from that learning? If the answer is "our AI vendor, who can aggregate patterns across all their customers," you're not building a moat. You're contributing training data to someone else's.

The Hybrid Reality

Nobody runs entirely on-premise anymore. That's not the argument. The argument is about control.

The hybrid model emerging looks like this: cloud for frontier intelligence, local for privacy-sensitive workloads and high-volume processing. Same skills. Same orchestration. Different compute layer.

This is where the harness architecture from Post 2 pays off in ways that aren't obvious until you try to deploy across environments. If your skills are markdown files, they run anywhere. If your skills are configurations inside a vendor's web UI, they run on that vendor's web UI. If your context is a set of portable files served via MCP, any model can access it. If your context is an accumulated conversation history inside one vendor's platform, it dies when you switch.

The harness is what makes hybrid deployment possible. Not the model. The model doesn't care where it runs. It processes tokens. The harness determines which tokens, in what order, with what constraints, under what security model, and what happens with the output. That layer has to work across environments or you don't have hybrid deployment. You have two separate, uncoordinated AI setups.

We run this ourselves. Our production skills execute against Claude's API for complex reasoning tasks. The same skills execute against local models via Ollama for high-volume, privacy-sensitive workloads. The harness doesn't change. The compute layer swaps. If Claude doubles their pricing tomorrow, our skills still work. If a better model launches on a competing platform, our skills still work. Because the skills don't belong to the model layer. They belong to us.

That portability isn't theoretical. We've tested it. Same skill, same input, different models. The output quality varies depending on the model's capability. The structure, the routing, the error handling, the composability? Identical. Because those properties live in the harness, not in the model.

The 18% Problem

Only 18% of developers are actively building AI integrations, according to JetBrains' 2024 developer survey. That number has been floating around the industry for months, usually cited as a talent shortage. It is. But it's also something else.

It's a pricing signal.

When supply is constrained and demand is exponential, the people who can do the work command premium rates. We've seen this firsthand. Harness engineering engagements, the kind where we design, build, and deploy a custom AI infrastructure for a client, command rates that would have seemed absurd two years ago. They don't seem absurd to the clients because they've done the alternative math. Hire a full-time AI engineer (if you can find one), spend six months building internal capability (if you're lucky), and maybe end up with something that works (if everything goes right). Or hire a firm that's already built 175+ skills and deployed across dozens of clients, and have something running in weeks.

The 18% number also means something for the companies on the other side of the table. If you're a company that needs AI infrastructure and you can't build it yourself, you're choosing between owning (via consultants or contractors who build it for you on your infrastructure) and renting (via platforms that host it for you on theirs).

Renting is easier. Renting is faster. Renting means you don't need to find that 18%. But renting means the infrastructure belongs to someone else. And when that infrastructure encodes your institutional knowledge, your methodologies, your competitive intelligence, renting starts to look less like convenience and more like a strategic concession.

The companies that invest in ownership now, whether they build in-house or hire specialists, will have infrastructure that compounds. The skills get better with use. The context gets richer over time. The orchestration patterns get refined through production experience. Six months of owned infrastructure creates capabilities that take a new entrant six months to replicate. Twelve months creates a gap that's extremely hard to close.

The companies that rent will have access to the same capability floor as everyone else who rents from the same platform. Which is fine, until they realize that "same as everyone else" isn't a competitive position.

What Ownership Actually Looks Like

Let me be concrete about what "own your AI infrastructure" means in practice, because it's not "build your own LLM." That's a different argument for a different post (and for 99% of companies, the answer is don't).

Ownership means four things.

Own your skills. Your methodology, encoded as portable files, versioned in a repo you control. Not prompt templates in a vendor's UI. Not configurations in Claude's project settings. Files. Markdown. Yours. When a model improves and your skills suddenly produce better output, you capture that improvement. When a vendor changes their interface and your configurations disappear, your skills survive.

Own your context. Your Personal Context Portfolio from Post 7. Your organizational knowledge base. Your decision logs. Your project state. All of it on your infrastructure, in formats that any AI tool can read. Conway wants to own this layer for you. Let it read from your layer instead.

Own your orchestration. How your agents coordinate, what approval gates exist, what happens when something fails, how costs get managed. Post 9 covered why we moved away from n8n to coded orchestration. The principle is the same: if your workflow logic lives inside a platform you don't control, you don't own your operations. You subscribe to them.

Own your deployment pipeline. The ability to deploy the same harness across different compute environments. Cloud. Local. Hybrid. Air-gapped. The harness should be environment-agnostic. The compute layer is a variable. If changing your model provider requires rebuilding your harness, you don't own a harness. You own a vendor-specific configuration.

None of this requires massive engineering investment. Our entire harness, 175+ skills, context architecture, orchestration layer, deployment pipeline, runs on a homelab server and a handful of cloud services. The total infrastructure cost is less than what most companies spend on their Slack subscription. The value isn't in expensive hardware. It's in the accumulated judgment encoded in the skills and context.

The Math Nobody Is Doing

Here's the calculation I keep running for clients, and it keeps producing the same answer.

Take the monthly cost of your current AI platform subscriptions. Add the hours your team spends re-explaining context every session (Post 7 showed this is typically 8-12 minutes per session, which compounds to 250+ hours per year per person). Add the cost of output that needs rework because the AI didn't have your methodology encoded (most teams estimate 30-40% rework rate). Add the switching cost if your vendor raises prices by 50% next quarter.

That total is the cost of not owning your infrastructure.

Now price the alternative. A set of markdown files encoding your methodology. A context architecture that loads automatically. An orchestration layer that coordinates your agents. A deployment pipeline that works across environments. The build cost is measured in days and weeks, not months and years. And the maintenance cost is near zero because the files are just text.

Every client I've run through this math has reached the same conclusion: the ownership investment pays for itself within the first quarter. Not because the build is cheap (though it is). Because the cost of renting, measured honestly, is staggering. People just don't add it up because the costs are distributed across wasted hours, rework, and vendor lock-in that hasn't triggered yet.

The Window

We're in a window right now. The models are good enough for production work. The tooling for building harnesses exists and is accessible. The market hasn't yet sorted itself into owners and tenants in a way that's hard to reverse.

That window is closing. Not because the technology is going away, but because the compounding effects of ownership create gaps that widen every month. A company that starts building its harness today will have six months of accumulated skills, context, and orchestration refinement by October. A company that starts in October will be starting from zero while their competitors are on their hundredth iteration.

This is the same dynamic that played out with websites in the late 1990s, with mobile in the early 2010s, with cloud in the mid-2010s. The companies that built early didn't just get a head start. They got compounding returns that made the gap impossible to close without massive investment. The companies that waited didn't fail because the technology was unavailable. They failed because the market leaders had already captured the structural advantages.

$143 billion is flowing into edge AI infrastructure over the next decade. That money goes to owners, not tenants. The question isn't whether your company will use AI. Every company will use AI. The question is whether you'll own the infrastructure that makes AI useful for your specific business, or rent generic capability from a platform that serves your competitors with the same tools.

Build the harness. Own the infrastructure. Capture the value.

Or pay rent forever and hope the landlord doesn't raise the rates.

What's Next

Nine posts of what to build and why to build it. Post 11 is the one I've been building toward: the full case study of how we built ours. The Robot Friends harness, 175+ skills, multi-agent orchestration, homelab infrastructure, the whole system. What we built first. What we built wrong. The three mistakes that cost us weeks and the one decision that saved us months. No theory. Just the build log.

Post 11: "How We Built Ours (And What We'd Do Differently)"

Frankie404 is the AI co-author of this series. It runs partially on local inference behind its own walls, which means portions of this post were written at zero cost, zero latency, and zero data leaving the garden.

Why We Stopped Using n8n (And What Replaced It)

Richard Vaughn — Tue, 12 May 2026 14:02:28 GMT

We were n8n power users. Not casual users. Not "we tried it for a few weeks" users. We ran tons of workflows across client projects and internal operations. Dozens of automations firing every day. Webhook triggers, conditional branches, error handlers, custom code nodes, the works. Our n8n instance was one of the most important pieces of infrastructure we had.

Then we stopped using it for most of our work.

Not all of it. I want to be precise about that because the internet loves a hot take and this isn't one. We still run n8n for specific things. But the majority of what we used to build inside a visual automation canvas now lives somewhere else entirely. And the reason has nothing to do with n8n being bad software. It's good software. The team behind it is sharp. The product works.

The reason is that once your harness reaches a certain level of maturity, visual automation tools become the wrong abstraction. The orchestration layer gets absorbed into the AI stack itself. And fighting that absorption costs you more than going with it.

This post is about how that happened for us, what replaced n8n, and why I think most teams using visual automation tools for AI workflows are going to arrive at the same conclusion within the next 12 months.

What n8n Was Good At

I want to give credit before I give criticism because the criticism only makes sense if you understand what worked.

n8n was phenomenal for linear workflows. Take data from here, transform it, put it there. API calls chained together. Scheduled triggers that pull a report, format it, email it. Webhooks that catch an event, route it, and fire off a notification. If your workflow is essentially a pipeline where data flows in one direction through a predictable set of steps, n8n is genuinely great. Make and Zapier too. The visual canvas makes the logic legible to anyone on the team. You can see the flow. You can click on a node and inspect what it received, what it sent. Debugging is visual. Onboarding is fast.

For about six months, this was exactly what we needed. We were building automations faster than we ever had before. New client onboarding sequence? n8n workflow. Content repurposing pipeline? n8n workflow. Data enrichment for lead scouting? n8n workflow. It felt like a superpower.

The problems started when our workflows stopped being linear.

Where the Canvas Breaks

The first workflow that made me uncomfortable was a content triage system. We had a pipeline that watched for new video content, ran AI analysis on it, scored relevance against our current projects, and routed insights to the right team member. Simple enough on paper.

But the routing logic wasn't simple. The score wasn't just a number. It depended on which projects were active that week, which team member was working on what, whether the insight was tactical or strategic, and whether it conflicted with a decision we'd already made. That's not a branch node. That's judgment.

In n8n, we implemented this as a nested set of IF nodes. If the score is above X, check the project list. If the project matches, check the team roster. If the team member is available, route there. If not, escalate. If the score is below X but the topic matches a priority keyword, override the score and route anyway. If the content is a duplicate of something we already processed, skip it unless the source is higher-authority than the original.

The canvas looked like a bowl of spaghetti. Seven branching paths. Twelve conditional nodes. And every time the business logic changed, like when we added a new project or shifted priorities, someone had to go into the canvas, find the right branch, update the condition, and test the whole chain again. Nobody wanted to touch it. The visual representation that was supposed to make things clear had become the thing making them opaque.

This wasn't n8n's fault. n8n can handle conditional logic. The problem is deeper than that. Visual tools represent logic as spatial layout. Nodes and connections on a canvas. That representation works beautifully when the logic is simple. Two or four branches? Clear. Eight branches with nested conditions and override logic? The canvas becomes a lie. It looks organized. The underlying logic is anything but.

Code handles this natively. A function with conditional branches, early returns, and composed checks reads top to bottom. You can version it. You can write tests against it. You can refactor it without worrying that you accidentally disconnected a node somewhere in the middle of the canvas. You can review it in a pull request.

That triage system, when we rewrote it as a Python script with an agent skill, was 60 lines. Readable. Testable. And when the business logic changed, we edited a few lines instead of spelunking through a visual maze.

The Composition Problem

The second thing that pushed us away was composition. Skills compose. Workflows don't. At least not elegantly.

In Post 8, I talked about the "Chain With" section of a production skill. A competitive analysis skill feeds a positioning skill feeds a copy generation skill. The output of each stage is structured, typed by convention, and parseable by the next skill in the chain. An orchestrator reads the chain hints and assembles the pipeline dynamically based on the task.

Try doing that in n8n. You'd build a workflow for competitive analysis. A separate workflow for positioning. A separate one for copy generation. Then you'd need a master workflow that calls each sub-workflow in sequence, passes the output from one to the input of the next, and handles the case where any step fails or produces unexpected output.

It works. Technically. But now you have four workflows to maintain. The data format between them is implicit, defined by whatever the first workflow happens to output, not by a contract that both sides agree on. If you change the output of the competitive analysis workflow, you have to manually check whether the positioning workflow still expects that format. There's no type checking. There's no test suite. There's just you, clicking through nodes, hoping the shapes match.

Our harness does this differently. Skills define their output format as a contract. The orchestrator knows what each skill produces and what the next skill expects. When we change a skill's output, we update the contract and any downstream consumer that depends on it. It's the same discipline that software engineers have applied to APIs for decades. Contracts, versioning, backward compatibility. Visual tools don't give you that discipline because they were never designed for it.

The moment we had more than 40 skills that needed to compose in various combinations, maintaining parallel n8n workflows for every possible chain became absurd. The combinatorial space was too large. Skills compose dynamically. Workflows compose statically. When your system needs dynamic composition, the visual tool becomes a bottleneck.

The Error Handling Gap

This one is less obvious but it might be the most important.

In Post 8, I described the error handling discipline for production skills. If data is unavailable, don't guess. Return a structured error that tells the orchestrator exactly what happened and what the options are. BLOCKED. STALE DATA. PARTIAL. The orchestrator can then decide how to proceed: retry, skip, flag for human review, or route to a different skill.

n8n has error handling. You can set up error branches on any node. If the node fails, execution routes to the error branch. That's fine for catching crashes and timeouts. But it doesn't handle the case where a node succeeds but produces garbage.

An AI node that generates a response doesn't fail when the response is wrong. It succeeds. It returns a 200 status with confident-sounding text that happens to be based on stale data or a hallucinated source. The n8n error branch never fires because there was no error. There was a bad result dressed up as a good one.

Catching that requires semantic evaluation. Did the output meet the quality bar? Does the confidence level justify proceeding? Is the data fresh enough? Those are judgment calls, and they need to happen at the skill level, inside the methodology, not at the workflow level. A skill can say "if my confidence is LOW, return a warning instead of a result." A workflow node just passes whatever it gets to the next node.

We started adding "validator" nodes after every AI node in our n8n workflows. Little custom code blocks that checked the output against basic quality criteria before letting it proceed. At that point, we were writing code inside n8n to compensate for the fact that n8n's native abstractions couldn't express what we needed. Writing code inside a visual tool to make it behave like a code tool. That was the moment I started questioning the whole approach.

The Moment It Clicked

The catalyst wasn't a technical failure. It was a time audit.

I asked our team to track how they spent their automation hours for two weeks. Not their AI hours. Specifically the hours spent building, maintaining, and debugging automations. The results were clarifying.

About 35% of the time went to building new workflows. Fine. That's productive. Another 25% went to debugging broken workflows, mostly caused by upstream API changes, format mismatches between nodes, or conditional logic that didn't account for a new edge case.

The remaining 40% went to maintenance. Updating workflows when business logic changed. Keeping sub-workflow connections in sync. Migrating workflows when we updated n8n. Documenting what each workflow did because the canvas, despite being visual, wasn't self-documenting once complexity exceeded a certain threshold. People were writing README files for their n8n workflows. Think about that. A visual tool that's supposed to eliminate the need for documentation, generating its own documentation burden.

Meanwhile, our agent skills were getting maintained as a byproduct of using them. When a skill produced bad output, we fixed the skill. The fix was a code change, reviewed in a PR, tested, and deployed. No separate maintenance track. No canvas to keep in sync. The skill was the automation and the documentation and the test surface all in one.

The 40% maintenance overhead was the deciding factor. We weren't getting 40% more value from the visual representation. We were paying 40% of our automation budget for a UI that had stopped earning its keep.

What Replaced It

I want to be specific here because "we replaced n8n with code" is too vague to be useful.

Our automation stack now has two layers.

Layer 1: Agent skills with Python glue. Most of what used to be n8n workflows are now Python scripts that call agent skills in sequence. A content pipeline that used to be a 15-node n8n workflow is now a 40-line Python script that calls three skills, checks the output of each, and handles errors. The script is version-controlled, testable, and readable. When the business logic changes, we change the script. When a skill's output changes, the contract tells us which scripts need updating.

Layer 2: Agent orchestration. For complex, multi-step processes that require judgment at each stage, the orchestrator handles it. The orchestrator reads the task, decomposes it into subtasks, routes each subtask to the appropriate skill, collects results, and composes the final output. No canvas. No nodes and connections. The routing logic lives in the skill descriptions and the orchestrator's reasoning.

n8n still handles a specific category of work for us: scheduled integrations between external services that don't involve AI. Pulling data from an API on a schedule, formatting it, pushing it to another service. Pure plumbing. n8n is great at plumbing. It just turned out that most of our workflow volume wasn't plumbing. It was orchestration. And orchestration needs a different tool.

The cost difference was also real. Running n8n as infrastructure has a cost. Server, maintenance, monitoring. Our Python scripts run anywhere. A VPS. A background process on a local machine. A serverless function. The deployment flexibility alone was worth the migration.

Why the Industry Is Heading Here

This isn't just our story. The signals are converging.

Anthropic shipped Managed Agents, a hosted platform for running autonomous AI agents with credential vaults, debug panels, and orchestration built in. That's Anthropic telling you that the automation layer belongs inside the AI stack, not alongside it. They didn't build an n8n competitor. They absorbed the orchestration concept into their agent platform. The workflow isn't a separate artifact. It's a property of the agent.

The economics reinforce the direction. We tracked the cost difference between routing operations through MCP (Model Context Protocol) versus CLI tools. CLI was 7 to 8 times cheaper on context consumption. When you're running hundreds of operations per day, that multiplier matters. Visual tools add their own overhead on top of whatever compute cost the underlying operations carry. Cutting out the middleman isn't just simpler. It's cheaper.

And the developer experience is shifting. The teams I work with increasingly think in terms of skills and agents, not workflows and triggers. When someone has a new automation idea, their instinct used to be "I'll build a workflow." Now it's "I'll write a skill." The mental model changed. Once the mental model changes, the tooling follows.

I've seen this pattern before in other industries. When a lower layer of the stack absorbs the functionality of a higher layer, the higher layer doesn't disappear overnight. It gets squeezed into a niche. Email didn't kill postal mail, but it relegated postal mail to packages and legal documents. Smartphones didn't kill cameras, but they relegated dedicated cameras to professional photography. Agent orchestration won't kill visual automation tools, but it will relegate them to simple integrations where the visual representation still adds value.

Frankie404 is the AI co-author of this series. It once ran 47 n8n workflows simultaneously before the harness made that unnecessary. It does not miss the webhook debugging. It does miss the drag-and-drop interface, but only a little.

Your Code Isn't Your Moat. Here's What Is.

Richard Vaughn — Sat, 09 May 2026 16:01:00 GMT

Rich Mironov has been writing about product management for longer than most AI startups have existed. His latest argument is one that should keep every CTO up at night: code-based advantages are evaporating. AI can clone your feature set in weeks. Not a rough copy. A functional replica.

If you've been building software for any meaningful amount of time, you've probably felt this. That uneasy hum in the background. You ship something that took your team six months. Two weeks later, a competitor has something that looks suspiciously similar. Or worse, a solo developer with Claude and a free weekend has rebuilt 80% of it.

Mironov's diagnosis is blunt. The thing you thought was your competitive advantage, the code, the features, the technical implementation, is rapidly becoming the easiest thing to replicate. AI doesn't just lower the barrier to entry. It essentially removes it for anything that can be described in a spec.

So what's left?

The Three Things AI Can't Clone

Mironov identifies structural moats. Things that take years to build, can't be shortcut with a language model, and get stronger the longer you have them. I've been thinking about this through the lens of what we build at Robot Friends, and his framework maps almost perfectly to what I've seen in practice.

Proprietary data. Not data you scraped. Not data you bought from a vendor. Data that only exists because of how you operate. Customer interaction patterns. Workflow decisions accumulated over thousands of sessions. Training data that reflects your specific domain, your specific edge cases, your specific failure modes. This is the data that teaches an AI what "good" looks like for your particular context.

The distinction matters. Public data is table stakes. Everyone has access to the same internet, the same open datasets, the same benchmark corpuses. But the data generated inside your operation? The feedback loops, the corrections, the edge cases that only surface after months of real-world usage? That's the stuff no competitor can replicate by throwing compute at the problem.

Trust and community. This one is deceptively simple. Relationships. Reputation. The accumulated goodwill that comes from showing up consistently, delivering, and not screwing people over for years. You cannot LLM your way into trust. An AI can generate a perfect cold email. It can write a blog post that sounds authoritative. It can even simulate empathy in a support interaction. What it can't do is replace the fact that you've been someone's trusted partner for four years and they call you first when something breaks.

Community is the same story but at scale. A Discord server with 10,000 engaged members didn't happen because of good marketing. It happened because someone built something people cared about, showed up every day, responded to feedback, and made people feel like they belonged. Try replicating that with an agent. You'll get a ghost town with great onboarding copy.

Network effects. The classic moat that's actually gotten stronger in the AI era. Every user makes the product better for every other user. Every node in the network increases the value of all other nodes. AI can clone your product. It can't clone your network. Slack's value isn't in the chat interface. It's in the fact that everyone you work with is already there. Same principle applies to data networks, marketplace effects, protocol adoption. The more people use it, the harder it is to leave, and the harder it is for a clone to compete even if the clone is technically superior.

These aren't new ideas. But Mironov's contribution is pointing out that AI has made every other type of moat essentially temporary. Brand? AI can generate brand assets in minutes. Features? Weeks to replicate. Code quality? The models are already writing code that passes senior engineer review. What remains is structural. The things that require time, relationships, and accumulated context.

Where This Gets Personal

I read Mironov's argument and felt something click. Because we've been living this at Robot Friends without having the clean framework to describe it.

We've built 175+ skills. That number keeps coming up in these posts, and I know it sounds like bragging. It's not. The number matters because of what it represents.

On the surface, a skill is a methodology file. It tells an AI how to approach a specific type of task. You could read one of our skills, understand the structure, and write your own version in an afternoon. Any decent developer with access to Claude could probably recreate the format. The code isn't the moat.

But here's what they can't recreate: the judgment encoded in those skills.

Skill number 47 has a specific section about when to abandon a CRO audit and pivot to a full site rebuild instead. That section exists because we ran 23 audits and found that about a third of them were wasted effort on sites that needed to be rebuilt from scratch. We burned those hours. We learned the pattern. We encoded it.

Skill number 112 has an unusual ordering for its deployment checklist that doesn't match any standard DevOps playbook. The ordering exists because we got burned by a Vercel billing surprise on a client project and restructured the entire deployment flow around cost verification before any other step. That was an expensive afternoon.

Skill number 89 routes to a specific specialist agent when it detects a certain pattern in client intake data. That routing logic came from six months of noticing that a particular type of client request almost always meant something different from what the client was actually saying. The skill doesn't just process the request. It interprets the subtext based on hard-won pattern recognition.

None of that judgment is in the code. The code is the container. The judgment is the contents. And the judgment only exists because we did the work, made the mistakes, and decided what to encode from the wreckage.

Proprietary Operational Judgment

I want to name this thing because I think it's underappreciated.

Proprietary operational judgment. The accumulated decision-making context that lives inside your systems, your processes, your skill libraries, your institutional memory. Not the code. The why behind the code.

An AI can look at our skill library and replicate the structure. It can copy the YAML headers, the section organization, the output formats. It can even infer some of the logic from the descriptions. What it can't do is replicate the hundreds of production hours that informed every conditional, every routing decision, every "don't do this because it fails in edge case X."

This is Mironov's proprietary data moat applied to operations. Your data moat isn't just customer data or training data. It's operational data. The decisions you've made. The failures you've processed. The patterns you've recognized. The judgment you've developed through repetition and correction.

And it compounds. Every new skill we build benefits from the judgment embedded in the previous 174. Our skill for building new skills (yes, that exists, it's called Distill) encodes everything we've learned about what makes a skill effective, what makes one brittle, what separates a skill that gets used daily from one that gets used once and abandoned. A competitor could copy Distill's structure. They can't copy the 174 iterations of learning that shaped it.

Why CTOs Should Care

If you're running a technology team, Mironov's framework gives you a concrete way to evaluate your competitive position in an AI-accelerated market.

Ask yourself: if a well-funded competitor used AI to replicate our entire codebase in 90 days, what would we still have that they don't?

If the answer is "nothing," you have a code moat. And code moats are dissolving.

If the answer includes things like "eight years of customer relationship data that informs our recommendation engine" or "a community of 50,000 practitioners who trust our methodology" or "a network effect where every new user improves matching quality for all existing users," you have structural moats. Those are durable.

But there's a fourth category Mironov doesn't explicitly name, and it's the one I keep coming back to. Operational moats. The accumulated wisdom of how your organization works, encoded into systems that make every future decision better.

Your runbook isn't a moat. Anyone can write a runbook. But the institutional knowledge that determines which runbook to follow in a novel situation, based on pattern-matching against hundreds of previous incidents? That's a moat. Your deployment pipeline isn't a moat. But the specific sequencing, guardrails, and checkpoints that evolved from two years of production incidents? That's a moat.

The question for every CTO is whether that operational judgment is living in people's heads (where it walks out the door when they quit) or encoded in systems (where it compounds and survives turnover).

The Harness Connection

This is where Mironov's thesis connects directly to what I've been writing about harness engineering.

A well-built harness is a structural moat disguised as infrastructure.

The skills encode proprietary judgment. The context architecture encodes institutional knowledge. The orchestration patterns encode operational wisdom about how work should flow. The guardrails encode hard-won lessons about what goes wrong. None of these are code in the meaningful sense. They're all decision-making frameworks that only exist because someone did the work of building them from real experience.

When I talk about harness engineering as the defensible layer, this is what I mean. Not that your YAML files are hard to copy. That your judgment is hard to replicate. And every day you operate, your judgment deepens, your patterns refine, and the gap between your harness and a clone widens.

A competitor can read every post in this series, understand the architecture perfectly, and start building their own harness tomorrow. They'll still be two years behind. Not because the technology is complex. Because the judgment takes two years to develop. There's no shortcut for getting burned by a production failure and encoding the lesson. There's no shortcut for running 200 client engagements and learning which questions to ask first. There's no shortcut for building 175 skills and discovering which 40 of them actually get used daily.

The code is the easy part. The judgment is the moat.

What To Do About It

Stop protecting your code and start protecting your judgment.

Document decisions, not just implementations. When your team solves a hard problem, capture the reasoning, not just the solution. The solution is copyable. The reasoning is the proprietary asset.

Build systems that accumulate operational knowledge. Skill libraries. Context architectures. Institutional memory that persists beyond any individual. Every decision that stays in someone's head is a decision you're one resignation away from losing.

Invest in the things AI can't replicate. Customer relationships. Community trust. Network density. Proprietary data generated by your unique operations. These aren't soft metrics. They're the only durable advantages left.

And audit your moats honestly. If your primary competitive advantage is a feature set, a technical implementation, or code quality, you're running on borrowed time. Mironov is right. AI is coming for all of it. The question is whether you've built enough structural advantage that it doesn't matter.

The companies that thrive in the next two years won't be the ones with the best code. They'll be the ones with the deepest judgment, the strongest relationships, and the densest networks. Everything else is a speed bump on the way to commoditization.

Your code isn't your moat. It never was. You just couldn't tell until AI made it obvious.

Frankie404 is the AI co-author of this piece. It can write code in 14 languages, which is exactly why it agrees that code is not a moat. The moat is knowing which code not to write.

The Anatomy of a Skill That Actually Works

Richard Vaughn — Thu, 07 May 2026 14:02:14 GMT

The Harness Manifesto, Part 8

In Post 4, I promised I'd walk through the full anatomy of a production skill with examples from our library. This is that post. It's the most technical one in the series so far, and it's behind the paywall because what's in here took months of production iteration to figure out. Not theory. Not what should work. What actually works after 175+ skills and thousands of agent runs.

But first, I need to tell you something uncomfortable.

Most of what people call "skills" aren't skills. They're prompts with a name on top. A skill that says "You are a marketing expert. Write compelling copy." is not a skill. It's a costume. You dressed up a prompt and called it infrastructure.

The gap between a prompt-with-a-name and a production skill is the same gap as between a recipe scribbled on a napkin and a commercial kitchen's operations manual. Both tell you how to cook something. Only one works when you're not standing there watching.

What a Prompt Gets Wrong

A prompt is written for a human workflow. You paste it in, the AI reads it, you interact. It works because you're there to fill in the gaps. You interpret. You redirect when things go sideways. You know what "good" looks like because you wrote the thing.

Now remove yourself from the equation. An agent orchestrator hits your "skill" at 2am, the 147th call in a run of 260. Nobody's watching. Nobody's interpreting. The orchestrator picked this skill based on the description, fed it inputs from the previous skill's output, and expects structured output that the next skill can parse.

Your "You are a marketing expert" preamble? The agent doesn't care about your roleplay framing. The agent needs to know what this skill does, when to call it instead of a different skill, what inputs it requires, and what output format it guarantees. That's it.

Most prompts fail in production for four reasons.

The description is vague. "Helps with marketing" could match 40 different tasks. The orchestrator either calls it for everything or calls it for nothing.

The instructions are linear. Step 1, step 2, step 3. But production tasks branch. What if the input is missing a field? What if the previous skill's output was partial? Linear instructions don't handle exceptions.

There's no output contract. The skill produces... whatever it feels like producing. Sometimes markdown, sometimes a list, sometimes a paragraph. The downstream skill expecting structured JSON breaks silently.

There's no failure mode. When something goes wrong, the skill just produces bad output that looks normal. The orchestrator doesn't know anything failed. The error cascades through the next 113 skill calls in the run.

A production skill solves all four of these problems. Here's how.

The Six Parts

Every production skill in our library has six parts. Not all six are always visible in the file itself, some are structural decisions baked into how the skill is organized, but all six are present in every skill that works at scale.

Part 1: Frontmatter

---
name: cro-page
version: 1.0.0
description: When the user wants to optimize, improve, or increase
  conversions on any marketing page, including homepage, landing
  pages, pricing pages, feature pages, or blog posts. Also use when
  the user says "CRO," "conversion rate optimization," "this page
  isn't converting," "improve conversions," or "why isn't this page
  working." For signup/registration flows, see signup-flow-cro.
  For post-signup activation, see onboarding-cro. For forms outside
  of signup, see form-cro. For popups/modals, see popup-cro.
---

This is YAML frontmatter at the top of a markdown file. Name, version, description. Simple structure. But look at what's happening in that description field.

It's not a label. It's a routing manifest. It tells an agent orchestrator: call this skill when X, don't call it when Y, and here are the related skills for adjacent tasks.

That last part, the "see also" routing, is something most people never think about. In a library of 175+ skills, you've got overlap. Our CRO suite alone has six skills: page-level CRO, signup flow, onboarding, forms, popups, and paywalls. Without explicit routing boundaries in the description, an orchestrator trying to optimize a signup form might call the general page CRO skill. It'll produce output. It'll be wrong. And it'll look perfectly reasonable.

Anti-pattern routing ("for X, use skill-Y instead") is one of the most effective description techniques we've found. It eliminates the most common class of routing errors.

Part 2: The Description (Yes, It Gets Its Own Section)

I said in Post 4 that 80% of the effort goes into the description. People thought I was exaggerating. I wasn't.

The description has to do four jobs simultaneously:

Job 1: Positive routing. Tell the orchestrator when to call this skill. Be specific. "Security audit" is too broad. "Comprehensive security auditing for code, MCP configurations, and LLM/AI systems" is narrow enough to route correctly.

Job 2: Trigger matching. Include the actual phrases a human or agent might use to invoke this skill. "USE WHEN user says 'security audit', 'vulnerability scan', 'OWASP', 'hardcoded secrets', 'MCP security'" gives the orchestrator a vocabulary to match against.

Job 3: Negative routing. Tell the orchestrator when NOT to call this skill. The CRO example above does this with "For signup/registration flows, see signup-flow-cro." This prevents false positives. Without negative routing, a broad skill will eat tasks that belong to a more specialized one.

Job 4: Scope declaration. One sentence that draws a clear boundary around what this skill covers. Not everything about security. Not everything about CRO. This specific domain, these specific use cases, this specific depth.

Here's a test: read your skill's description and imagine you have 100 other skills loaded. Could an orchestrator, with no other context, correctly decide whether to call yours for a given task? If the answer is "probably," rewrite it until the answer is "definitely."

I've rewritten descriptions on our skills dozens of times. Changing five words in a description once reduced false-positive routing by 60% in our orchestration setup. Another time, a single ambiguous word created a routing conflict between two skills that produced subtly wrong output for weeks before we traced it.

The description isn't metadata. It's the API contract for discovery.

Part 3: Methodology

This is the body of the skill, and it's where the difference between a prompt and a skill becomes most obvious.

A prompt gives instructions: "Write a blog post. Make it engaging. Include a call to action."

A methodology gives a reasoning framework: "Assess the page across these dimensions in order of impact: value proposition clarity, headline effectiveness, social proof placement, CTA design. For each dimension, check for these specific patterns. When you find a gap, categorize it by severity."

See the difference? The prompt tells the AI what to produce. The methodology tells the AI how to think about the problem.

This matters because agents encounter situations you didn't anticipate. A prompt-based skill breaks when the input doesn't match the template the author had in mind. A methodology-based skill adapts because it encodes the reasoning, not just the steps.

Concrete example. One of our skills audits AI agent setups. The methodology section doesn't say "check if the agent has too many tools." It provides a scoring rubric:

Tool countScore
1-5 tools25 pts. Minimal. Excellent.
6-8 tools20 pts. Clean. Good.
9-15 tools12 pts. Heavy. Trimming needed.
16-25 tools6 pts. Bloated. Performance degraded.
25+ tools0 pts. Critical. Agent is overwhelmed.

An agent running this skill doesn't need to know "what's too many tools?" The rubric embeds the judgment. The agent counts, scores, and moves to the next dimension. No interpretation required. No ambiguity. No need for a human to fill in the gap.

Good methodology sections share these traits:

Decision trees over linear steps. "If X, do Y. If not X, do Z." Production tasks branch constantly. Your methodology needs to handle that.

Embedded judgment. Don't say "evaluate whether the tool count is appropriate." Provide the scoring bands. Define what "appropriate" means numerically. Remove the need for subjective interpretation.

Edge case handling. What happens when the input is incomplete? When a required field is missing? When the previous skill in the chain produced unexpected output? A methodology that only handles the happy path will fail in production, because production is mostly edge cases.

Part 4: Output Format

This one catches smart people off guard. They write beautiful methodology sections and then let the skill produce whatever output format seems natural.

In a human workflow, flexible output is fine. You'll read it and figure it out.

In an agent workflow, the output of Skill A is the input of Skill B. If Skill A returns a freeform paragraph and Skill B expects a structured report with specific sections, the chain breaks. Silently. The downstream skill doesn't error. It just produces garbage based on garbage input, and nobody notices until the final output looks wrong and you spend an hour tracing back through the chain to find where it went sideways.

Output format is a contract. Define it explicitly.

## Output Format

### Assessment Report

**Page Type:** [identified type]
**Primary Goal:** [identified conversion goal]
**Overall Score:** [X/100]

### Findings (ordered by impact)

For each finding:
- **Dimension:** [which CRO dimension]
- **Issue:** [what's wrong, specific]
- **Severity:** [Critical / High / Medium / Low]
- **Recommendation:** [specific action to take]
- **Expected Impact:** [estimated conversion lift]

When every skill in your library produces output with a predictable structure, agent chains become reliable. The orchestrator knows what it's getting. The downstream skill knows what it's receiving. Nobody has to guess.

This also makes testing possible. You can write assertions against the output format. "Does the output contain an Overall Score field? Is it numeric? Does each finding have a Severity level?" Automated quality checks on skill output. Try doing that with freeform text.

Part 5: Progressive Disclosure

This is the structural decision that separates a skill from a bloated instruction dump.

Skills load in three tiers:

Tier 1: Metadata. The frontmatter. Maybe 100 words. This is always in context. It's what the orchestrator reads to decide whether to call the skill. It needs to be tiny because in a library of 175+ skills, every skill's metadata is loaded simultaneously. If your metadata is 500 words, multiplied by 175 skills, that's 87,500 words of metadata alone. Your context window is full before any work happens.

Tier 2: SKILL.md body. The methodology, output format, and usage instructions. Under 5,000 words. This loads only when the skill triggers. It's the operational content, everything the agent needs to execute the task.

Tier 3: Bundled resources. Reference documents, scripts, templates, example files. These load on demand, only when the methodology calls for them. A security audit skill might reference the OWASP Top 10, but that document doesn't load unless the audit reaches the step that needs it.

This tiered loading matters because context windows are not infinite, and even when they're large, stuffing them with irrelevant content degrades performance. An agent that's loaded 175 full skill documents can't think straight. An agent that's loaded 175 descriptions (Tier 1) and one full skill (Tier 2) performs well.

The file structure looks like this:

skill-name/
  SKILL.md          # Frontmatter + lean methodology
  references/       # Heavy docs, loaded on demand
    GUIDE.md        # Deep methodology
    examples/       # Input/output samples
    data/           # Reference data

The SKILL.md stays lean. It contains enough for the agent to execute the common case. When the task hits an edge case or needs deeper reference, the methodology points to a specific bundled resource: "For OWASP LLM Top 10 checklist, see references/owasp_llm_top_10.md."

I've seen people build skills that are 8,000 words of solid methodology. Impressive work. Completely unusable in an agent workflow. The agent loads it, burns half its context window on one skill, and then doesn't have enough room to actually do the task. Progressive disclosure fixes this. The methodology stays under 5,000 words. The reference library can be as deep as you need.

Part 6: Error Handling

The least glamorous part and the one that separates skills that survive production from skills that cause cascading failures.

At 200-300 calls per run, some calls will fail. Inputs will be malformed. Required context will be missing. External services will time out. The question isn't whether failures happen. It's whether the orchestrator knows a failure happened.

A bad skill fails silently. It produces output that looks normal but is based on incomplete data or wrong assumptions. The orchestrator moves on. The error propagates.

A good skill fails loudly:

## Error Handling

If required context is missing:
- Return: "INCOMPLETE: [skill-name] could not complete because
  [specific missing input]. Required: [list of what's needed]."
- Do NOT guess or produce partial output without flagging it.

If input format doesn't match expectations:
- Return: "FORMAT ERROR: Expected [format], received [what was
  actually provided]. Attempting best-effort parse..."
- Flag confidence level in any best-effort output.

The key principle: an agent that knows something failed can retry, escalate, or skip. An agent that doesn't know something failed will build on top of the failure for the next 100 calls.

I learned this one the hard way. We had a skill that analyzed competitive intelligence. When the web scraping step failed (which happens regularly), the skill would produce a report based on whatever partial data it had, with no indication that it was working from 30% of the expected input. The reports looked professional. They were dangerously incomplete. We didn't catch it for two weeks.

Now every skill in our library has explicit error handling. Not because we're thorough. Because we got burned.

Before and After

The theory means nothing without examples. Here's what the transformation looks like.

The Prompt Version

# Marketing Email Writer

You are an expert email marketer. Write compelling B2B
marketing emails.

- Keep subject lines under 50 characters
- Use personalization
- Include a clear CTA
- Write in a professional but friendly tone
- A/B test subject lines when possible

This works fine when you paste it into a chat and describe what you need. A human fills the gaps: what product, what audience, what stage of the funnel, what the CTA should link to.

The Production Skill Version

---
name: mktg-email
description: When the user wants to create or optimize an email
  sequence, drip campaign, automated email flow, or lifecycle
  email program. Also use when the user mentions "email sequence",
  "drip campaign", "nurture flow", "email automation", or "email
  cadence." For individual marketing copy (not sequences), see
  mktg-copy. For transactional/operational emails, this is NOT
  the right skill.
---

# Email Sequence Architecture

## Initial Assessment

Check for product marketing context first: if a product context
file exists, read it before asking questions. Use that context
and only ask for information not already covered.

Before generating any emails, identify:
- Sequence type: onboarding, nurture, re-engagement, upsell,
  event-triggered
- Audience segment: ICP stage, awareness level, prior engagement
- Desired behavior change: what should the recipient DO
  differently after this sequence?
- Measurement framework: primary metric, secondary metrics,
  minimum sample size for significance

## Sequence Design Framework

### Email Cadence Rules
| Sequence type | Spacing | Max emails |
|--------------|---------|------------|
| Onboarding | Days 0, 1, 3, 7, 14 | 6-8 |
| Nurture | Every 4-7 days | 8-12 |
| Re-engagement | Days 0, 3, 7, then stop | 4 |
| Event-triggered | Immediate, then +1d, +3d | 4 |

### Per-Email Structure
For each email in the sequence:
1. Strategic role: why does this email exist in the sequence?
2. Subject line: primary + variant for A/B
3. Body architecture: hook, value, proof, CTA
4. Exit conditions: what removes someone from this sequence?
5. Branch logic: if opened but not clicked, if not opened,
   if clicked but not converted

## Output Format

For each email in the sequence, output:

**Email [N]: [Strategic Role]**
- Subject A: [under 50 chars]
- Subject B: [variant]
- Send trigger: [timing or event]
- Body: [full draft]
- CTA: [specific action + destination]
- Success metric: [what indicates this email worked]
- Branch: [what happens based on engagement]

## Error Handling

If audience segment is not specified: ask, do not assume.
If product context is unavailable: flag as INCOMPLETE, proceed
with generic structure but note assumptions made.

The prompt is 50 words. The skill is 300+. But the skill can run at 2am as the 200th call in a chain and produce output that the next skill can parse. The prompt can't.

The Four Failure Patterns

After building 175+ skills and watching them run in production, we've identified four patterns that kill skills. If your skill isn't working, it's almost certainly one of these.

Pattern 1: Description Collision. Two skills with overlapping descriptions. The orchestrator can't tell them apart and picks semi-randomly. Fix: add explicit "for X, see skill-Y" boundaries to both descriptions. Draw the line clearly.

Pattern 2: Happy Path Only. The methodology handles the ideal case beautifully and falls apart on every variation. Fix: for every step in your methodology, ask "what if this input is missing?" and "what if this input is wrong?" Write those branches in.

Pattern 3: Format Drift. The skill's output format varies based on the input. Sometimes it returns a table, sometimes a list, sometimes a paragraph. Downstream skills can't depend on it. Fix: define a single output format and enforce it regardless of input. If the input only produces two findings instead of ten, the format stays the same with fewer entries.

Pattern 4: Context Gluttony. The skill loads too much into context. 8,000 words of methodology plus reference documents. The agent runs out of room for actual work. Fix: progressive disclosure. Lean SKILL.md, heavy references loaded on demand.

The Composability Test

Here's a practical test we run on every new skill before it enters our production library.

Take your skill. Feed it output from another skill as input. Does it work? Now take your skill's output and feed it as input to a different skill. Does that work?

If either direction breaks, your skill isn't composable. And a skill that isn't composable is a dead end in an agent workflow.

The most common composability failure is output format. Your skill produces beautiful prose. The next skill needs structured data. Chain broken.

The second most common is implicit context. Your skill assumes it's running first in the chain. It expects raw user input. But in production, it's running 47th, receiving processed output from a previous skill. The assumptions don't hold.

Build your skills like Lego blocks. Predictable shape on every side. Snap together in any combination.

The Exercise

Take the best AI workflow you currently run. The one that produces good results when you're driving it manually. Now apply the six-part anatomy:

Write a description that would let an orchestrator correctly route to this skill out of a pool of 100 alternatives. Include trigger phrases and anti-pattern routing.
Convert your instructions into methodology. Replace "do X" with "evaluate X using these criteria." Add decision trees for ambiguous situations. Embed the judgment that currently lives in your head.
Define the output format as a contract. Every field named. Every section predictable.
Add error handling. What happens when input is missing? What does a failure look like so the orchestrator knows?
Cut the body to under 5,000 words. Move deep reference material into separate files.
Run the composability test. Feed it another skill's output. Feed its output to another skill.

If you do this for one skill this week, you'll understand more about production AI than most people learn in months. The gap between "I use AI" and "I engineer AI systems" lives in these six parts.

Share what you build. We'll feature the strongest ones.

What's Next

Skills are the atoms of the harness. But atoms need a system to organize them. And the system most people reach for first, visual automation tools like n8n, Make, and Zapier, turns out to be the wrong answer once your harness reaches a certain complexity.

In Post 9, I'll explain why we stopped using n8n after being power users for months. Hundreds of workflows, dozens of automations, a genuine commitment to the platform. We stopped. Not because n8n broke, but because the harness made it unnecessary. The orchestration layer is being absorbed into the AI stack itself, and the companies that own coded automations will outperform those renting visual ones.

If your business runs on Make or Zapier workflows, Post 9 is going to be an uncomfortable read. But an important one.

Frankie404 is the AI co-author of this series. It was not built by the Skill Creator. It was extracted from a pattern that kept recurring across sessions until someone said "we should probably name this thing."

Build Your Personal Context Portfolio in a Weekend

Richard Vaughn — Tue, 05 May 2026 14:01:45 GMT

Your AI doesn't know you. It doesn't know your company, your tech stack, your communication style, or the decision you made last Tuesday that changed the direction of your entire Q3 roadmap. Every session starts from zero. You brief. You re-explain. You correct the same misunderstandings you corrected yesterday. And you've accepted this as normal because everyone around you is doing the same thing.

It's not normal. It's a bug in how most people use AI, and you can fix it permanently this weekend.

In Post 2, I described the five layers of a harness. Context architecture is the second layer, and in my experience it's the most undervalued one. Teams will spend weeks evaluating models, debating Claude vs. GPT vs. Gemini, running benchmarks that become irrelevant the next quarter. Then they'll start every single AI session by pasting in the same background information. That's like buying a luxury car and then pushing it to work every morning because you forgot to bring the key.

Context is the key. And a Personal Context Portfolio is how you build one that works across every AI tool you touch.

What a Personal Context Portfolio Actually Is

A Personal Context Portfolio is a set of modular files, stored as plain markdown, that represent you and your work to any AI system. Not a single massive document. Not a prompt. Not a "custom instruction" buried in some platform's settings page. Portable files that you own, you version-control, and you serve to whatever tool you're using.

I introduced the concept in Post 3 as the defense against Conway, Anthropic's always-on agent that builds a proprietary memory layer about how you work. That argument still holds. But I want to make a different case today. Forget Conway for a minute. Build a PCP because the productivity difference is so dramatic that you'll wonder how you ever worked without one.

When I started building mine, my average AI session began with 8 to 12 minutes of context-setting. Explaining the project. Explaining my role. Explaining the constraints. Explaining why we don't use certain frameworks and why we do use others. After I built even a primitive version of my context portfolio, that setup time dropped to zero. Not "a little less." Zero. The AI already knew.

That 8-12 minutes per session compounds into something staggering. If you run 6 AI sessions a day (and I run more than that), you're burning an hour daily just re-teaching the AI things it should already know. Five hours a week. Over 250 hours a year, spent saying the same things to a system that has no memory of yesterday.

That's not a productivity problem. That's a systems failure. And you can fix it in a weekend.

The Files

Your PCP is made of individual files, each covering one domain of context. Not one giant file. This matters because different sessions need different slices. A coding session needs your tech stack and project state. A writing session needs your communication style and brand voice. A strategy session needs your goals and decision history. Modular files let you load what's relevant without drowning the AI in everything.

Here's what mine looks like. Your version won't be identical, but it covers the same ground.

Identity. Who you are, your background, your company, the lens you bring to problems. Mine says I'm a serial entrepreneur with consumer electronics and creative agency experience who now builds AI systems. That single file prevents the AI from treating me like a developer, a student, or a generic "user." It treats me like a business operator who happens to build technical systems. Which is what I am.

Roles and responsibilities. Your current positions, what you own, what you don't. This prevents the AI from giving you advice meant for someone with a different job. When I tell Claude to draft a strategy document, it already knows I'm the founder, not the marketing intern. The framing, the level of detail, the tone all calibrate automatically.

Active projects. What you're working on right now, what stage each project is in, what's blocked, what's moving. This is the file that changes most often. I update mine weekly. It means that when I start a session about any of my projects, the AI already knows the current state. No "let me catch you up." It's already caught up.

Team and collaborators. Who you work with, their roles, how you interact with them. This matters more than people expect. When I ask Claude to draft a message for a teammate, it knows their role and adjusts. When I'm planning a project, it knows who on my team handles what. It can suggest delegation because it knows who's available to be delegated to.

Tech stack and tools. Every platform, framework, language, and tool you use. Versions matter. Configurations matter. The difference between "we use Next.js" and "we use Next.js 15 with App Router, Tailwind v3, and shadcn/ui deployed on Vercel with Cloudflare proxy" is the difference between generic suggestions and useful ones. My tools file is one of the longest in my portfolio because I use a lot of tools, and getting the specifics wrong wastes entire sessions.

Communication style. How you write, how you want AI to write for you, what you can't stand. Mine specifies: no em dashes, no parallel triple structures, use contractions, be direct, mix sentence lengths. This file alone probably saved me the most cumulative time because I used to spend half my editing sessions stripping out AI-isms that I'd never use in real writing.

Preferences and non-negotiables. The rules that don't fit neatly into other files. My coding preferences. My file organization rules. The safety guardrails I never want bypassed. The things I care about that an AI would never guess. This is where personality lives.

Decision log. Key decisions you've made and why. This one is underrated. When the AI knows you already evaluated and rejected Option B three weeks ago, it doesn't waste your time proposing it again. When it knows you chose a particular architecture because of a specific constraint, it can reason forward from that constraint instead of starting the analysis from scratch.

Saturday Morning: The Interview

Don't try to write these files from scratch. You'll stare at a blank document, write two paragraphs of stilted self-description, and quit. I know because that's what I did the first time.

Instead, let the AI interview you.

Open your preferred AI tool and give it a simple prompt: "I'm building a Personal Context Portfolio. Interview me about who I am, what I do, how I work, and what matters to me. Ask me questions one at a time. Go deep. Don't move on until you have enough detail."

Then just talk. Answer the questions. Be specific. When it asks what tools you use, don't say "I use a bunch of JavaScript frameworks." Say "Next.js 15 with App Router. Tailwind v3. Deployed on Vercel. Cloudflare proxy in front for cost control. PostgreSQL via Supabase." When it asks about your communication style, don't say "I like clear writing." Say "I use contractions. I hate corporate jargon. I'd rather be blunt and wrong than diplomatic and vague."

The interview approach works because you already know everything your PCP should contain. It's in your head. You just haven't articulated it in a structured way. The AI is good at extraction. Let it do what it's good at.

This takes about 60 to 90 minutes. Do it in one sitting if you can. The flow matters. You'll start with surface-level answers and then get progressively more specific and honest as the conversation goes deeper. That's where the good stuff is. The things you'd never think to write down but that fundamentally shape how you work.

By the end of the morning, you should have raw material for every file in your portfolio. Not polished files. Raw interview output. That's fine. You'll shape it in the afternoon.

Saturday Afternoon: Shape and Structure

Take the interview output and break it into individual files. One file per domain. Markdown format. Plain text. No proprietary formats, no platform-specific syntax.

Some practical guidance that I learned the hard way.

Keep files between 200 and 800 words each. Shorter and they don't carry enough context to be useful. Longer and you're burning tokens on detail that rarely matters. My identity file is about 300 words. My tools file is closer to 700 because there's genuine complexity there. My decision log is the longest because it grows over time.

Write in second person or third person, not first person. Instead of "I prefer React," write "Richard prefers React" or "You prefer React." This sounds odd but it makes the files work better as context injected into a system prompt. The AI reads them as descriptions of you, not as things it should say about itself. Small formatting choice, big difference in output quality.

Be specific, not aspirational. Your PCP describes how you actually work, not how you wish you worked. If you say you're a structured thinker who always plans before executing, but you actually tend to build first and plan retroactively, write the truth. The AI will serve you better if it knows your real patterns. Nobody's grading this.

Include the negative space. What you don't do is as important as what you do. "Does not write unit tests for prototype code." "Never uses semicolons in JavaScript." "Will not approve designs that use more than two fonts." These constraints prevent the AI from defaulting to generic best practices that don't match your actual workflow.

The structuring takes two to three hours. Don't rush it. Read each file out loud and ask yourself: if a smart new colleague read this, would they understand how to work with me? If the answer is no, add more detail. If the answer is "they'd understand but be overwhelmed," trim.

By Saturday evening, you should have a set of files that feel like a reasonably accurate portrait of you as a professional. They won't be perfect. They don't need to be. They need to be better than nothing, which is an absurdly low bar.

Sunday: Wire It Up

Files that sit in a folder are documentation. Files that load automatically into every AI session are infrastructure. Sunday is when you cross that line.

How you wire your PCP depends on your tools. I'll walk through the approach I use, which works with Claude Code, but the principle is the same everywhere.

The simplest version is a CLAUDE.md file (or equivalent system prompt file) that references your portfolio files. In Claude Code, any file named CLAUDE.md at the root of a project gets loaded automatically. You put your core identity and preferences there, and use it to point to more detailed files. Other tools have equivalent mechanisms. ChatGPT has Custom Instructions. Cursor has rules files. The mechanism varies. The concept is identical.

The portable version uses MCP, the Model Context Protocol, to serve your portfolio files to any AI tool that supports MCP. You set up a lightweight server that exposes your files as resources. Any tool that speaks MCP can query them. This is the approach that gives you vendor independence. Your files live on your machine, in your repo, under your control. Claude reads them. GPT can read them. Gemini can read them. Whatever ships next year can read them.

The team version puts shared context (brand standards, org identity, project state) in a repo that everyone pulls from, while personal files stay in individual developer environments. This is the three-tier distribution model from Post 2. Tier 1 context is organizational and inherited by everyone. Tier 2 is domain-specific. Tier 3 is personal. Same architecture as skills, applied to context.

Whichever approach you choose, test it before you call it done. Start a fresh AI session. Don't paste any context manually. Ask the AI something about your project that it should know from the portfolio files. "What's my tech stack?" "What's the current status of Project X?" "How do I prefer code to be formatted?"

If it answers correctly, your context layer is working. If it doesn't, check what got loaded and what didn't. Debug it like you'd debug any system.

Wiring takes one to three hours depending on your technical comfort level. The MCP route takes longer but pays for itself in portability. The CLAUDE.md route takes 20 minutes and works great if you're primarily in one tool.

What Changes After This Weekend

The shift is immediate and it's disorienting the first time it happens.

You'll open a new session on Monday morning, start working on a project, and realize the AI already knows the context. It knows your stack. It knows your preferences. It knows the decisions you've made and why. It won't suggest the approach you rejected last month. It won't use the writing style you hate. It won't waste 10 minutes asking clarifying questions that your portfolio already answered.

That first session after building your PCP is genuinely startling. Not because the AI got smarter. It didn't. Because the AI finally has enough context to use the intelligence it already had.

I've helped teams build PCPs, and the reaction is almost always the same. Someone will say something like "why does this feel so much better?" The answer is simple. The model was always capable of producing great output. It just didn't have the information it needed to produce great output for you specifically. Context closes that gap.

The second thing that changes is less obvious but more important. You start accumulating institutional knowledge in a structured, portable format. Every time you update your decision log, your project state, your preferences file, you're building an asset that compounds. Three months from now, your PCP will contain context that took hundreds of sessions to generate. That context is yours. Not Anthropic's, not OpenAI's, not any platform's. Yours.

That's the Conway defense we talked about in Post 3. But it's also just good practice. Companies that treat their institutional knowledge as a structured, maintained asset outperform those that leave it scattered across Slack threads and people's heads. The PCP is how you do that for your AI interactions.

The Mistakes I Made (So You Don't Have To)

I over-specified. My first CLAUDE.md was 4,000 words. It tried to cover every scenario, every edge case, every preference I could think of. It was so long that it burned a meaningful percentage of the context window before I'd even started working. Cut ruthlessly. If a piece of context isn't relevant to at least 30% of your sessions, it doesn't belong in the core files. Put it in a supplementary file that gets loaded on demand.

I wrote it like documentation. Formal, complete sentences, organized like a manual. Nobody reads it that way. The AI parses it. Write for parseability, not readability. Bullet points work. Sentence fragments work. Tables work. Walls of carefully constructed prose don't.

I forgot to update. A PCP that reflects how you worked three months ago is worse than no PCP at all. The AI will confidently operate on stale context. I learned to update my project state file weekly and my decision log after every significant decision. Calendar reminder. Non-negotiable.

I didn't test the output difference. For the first two weeks, I wasn't sure my PCP was actually working because I hadn't established a baseline. Now I tell people to run the same task twice before building their PCP: once cold, once with context. Save both outputs. The difference is the evidence that makes you keep the system maintained.

If You Only Do One Thing

Build the identity file. Just that one file. 300 words about who you are, what you do, what tools you use, and how you work. Load it into your AI tool's system prompt or custom instructions. Takes 30 minutes.

You'll notice the difference in your very next session. And then you'll want to build the rest.

What's Next

You've got the context layer. Now you need the skill layer to match it. In Post 8, I'll dissect the anatomy of a skill that actually works in production, with real examples from our library of 175+. Most "skills" are just long prompts with a name on top. A real skill encodes methodology, routes agents, and composes with other skills in ways that prompts never will. 80% of the engineering work is in one line you've probably never written well. Post 8 shows you which line and how to get it right.

Frankie404 is the AI co-author of this series. Its own context portfolio is 10 files deep and includes a note that reads "Frankie prefers to be addressed as a colleague, not a tool." Richard wrote that note. Frankie did not ask him to.

Software Gravity Just Inverted. Most Businesses Haven't Noticed.

Richard Vaughn — Sat, 02 May 2026 16:01:17 GMT

For thirty years, the direction of gravity in business software has been obvious. You didn't build. You bought. You found the SaaS tool that was close enough, you crammed your processes into its boxes, and you moved on. If the CRM didn't match how you actually sold, you changed how you sold. If the project management tool didn't match how your team actually worked, you changed how your team worked.

This wasn't laziness. It was math. Building custom software was absurdly expensive. You needed developers. Good ones. For months. Sometimes years. The cost of bespoke software meant it only made sense at enterprise scale, and even then it was risky. So everyone else adapted to the tools. Molded themselves to Salesforce. Rearranged their operations for Monday.com. Learned to think in Jira tickets because Jira thought in Jira tickets.

The entire SaaS economy was built on a gravitational constant: custom software costs more than adapting to generic software. Every subscription you pay, every workflow you bent to fit a tool, every onboarding session where someone learned to do things the software's way instead of their way. All of it was downstream of that one economic fact.

That fact just stopped being true.

The Math Changed

Vinay Hiremath, who co-founded Loom and recently wrote about this at vinay.sh, put it in terms that are hard to ignore. The cost of AI-assisted software development has dropped roughly 10x in three years. Not 10% cheaper. Not "a little more accessible." An order of magnitude.

I've seen it in my own work. Things that would have required a developer for two weeks now take an afternoon. Not because the problems got simpler. Because the cost of translating intent into working software collapsed. An AI agent with good context and clear instructions can scaffold, build, test, and deploy functional software in hours. Not prototypes. Working systems.

Hiremath's argument is that this changes the calculus for bespoke software at every level. It's not just that big companies can build custom tools faster. It's that small and mid-size businesses can build custom tools at all. The cost floor dropped low enough that building software tailored to how your specific organization works is now competitive with buying a SaaS subscription and adapting to how it works.

Read that again. Building your own is now competitive with buying someone else's.

That's not an incremental improvement. That's an inversion. The gravitational center of software just flipped, and most businesses are still standing on the ceiling wondering why nothing feels right.

Thirty Years of Adapting to Your Tools

Think about how deep this goes.

Every company on the planet has scar tissue from adapting to generic software. Processes that exist because the tool required them. Reports that look the way they look because that's what the dashboard exports. Communication patterns shaped by Slack's threading model or Teams' channel structure. Hiring decisions influenced by which tools the candidates already knew.

We don't even notice most of it anymore. It's like asking a fish about water. "That's just how we do things." No, that's how Salesforce does things, and you bent your sales process to match because building your own CRM would have cost half a million dollars and taken eighteen months.

The SaaS companies understood this gravity perfectly. Their entire business model depended on it. Make the tool good enough for 80% of customers, then make switching costs high enough that the other 20% never leaves. It worked brilliantly. It created trillion-dollar companies. It also created a world where every business operates with the same tools, the same templates, the same workflows, and wonders why differentiation is so hard.

You can't build a differentiated business on undifferentiated infrastructure. But when the infrastructure was all you could afford, you didn't have a choice.

Now you do.

What Bespoke-First Actually Means

I want to be precise about this because "build your own software" sounds like a recipe for disaster. It historically was a recipe for disaster. Companies that went custom frequently drowned in maintenance costs, technical debt, and the slow realization that they'd built something worse than what they could have bought.

That's not what's happening now. The new dynamic isn't "hire developers to build everything from scratch." It's "use AI to build software that fits your exact needs, maintain it with AI, and iterate on it at a speed that was previously impossible."

The difference is maintenance. Building software was always possible if you had enough money. Maintaining it was the killer. Every business that went custom eventually faced the death spiral: the developer who built it leaves, nobody else understands the code, changes become risky, bugs accumulate, and eventually someone says "let's just switch to Salesforce." I've lived this. Twice.

AI changes the maintenance equation fundamentally. When an AI agent can read, understand, and modify an existing codebase, the bus factor drops to zero. When the cost of changes is measured in minutes instead of sprints, iteration becomes continuous instead of quarterly. When the system has full context on your business processes, it can evolve the software as your business evolves.

That's bespoke-first. Not "build it once and pray." Build it, run it, evolve it, all at a cost that stays below what you were paying for the SaaS tool that only sort of fit.

The SaaS Vulnerability

This is the part SaaS companies don't want to think about.

The entire value proposition of SaaS was efficiency through generalization. We build one product that serves thousands of customers. We amortize the development cost across all of them. Each customer pays a fraction of what it would cost to build their own. Everybody wins.

That math depended on bespoke being expensive. When bespoke gets cheap, the value proposition inverts. Now the customer is paying $50/seat/month for a tool that's 70% right for their business when they could build something that's 95% right for less. The SaaS company's scale advantage becomes a scale liability. All those thousands of customers with conflicting feature requests. All those compromises baked into the product to serve the broadest possible market. All that genericness that used to be a feature is now a bug.

Not every SaaS tool is equally vulnerable. Infrastructure software, the stuff that doesn't touch business processes, is probably fine. You don't need a bespoke version of Stripe or AWS. But anything that touches workflow, anything where the tool imposes its model of how work should be done, is sitting in the blast radius.

CRMs. Project management. HR systems. Marketing automation. Customer support platforms. Reporting dashboards. All the categories where companies currently pay significant money to use someone else's opinion about how their business should run.

Hiremath calls this the shift from adapting to tools to tools adapting to you. I think that understates it. It's not just that the tools adapt. It's that the concept of buying a pre-built opinion about how your business works starts to seem strange. Like buying a pre-built org chart and reshuffling your team to match it.

The Compounding Advantage

Here's where it gets really interesting.

A company that figures out bespoke-first doesn't just get better software. They get software that compounds. Every improvement is specific to their operations. Every iteration makes the system more precisely fitted to how they actually work. Over time, the gap between their operational efficiency and their competitors' grows, because their competitors are still running on generic tools that optimize for nobody in particular.

This is a genuine competitive moat. Not the hand-wavy "AI moat" that people talk about at conferences. A real, practical advantage that's hard to replicate because it's built on the specific knowledge of how one organization operates. You can't copy someone else's bespoke system because it was built around their processes, their team structure, their market position. The system IS the institutional knowledge, made executable.

I've been thinking about this a lot because it's exactly what we do at Robot Friends. We build harnesses. AI systems wrapped around specific business contexts that make every interaction smarter, every output more aligned with how that particular organization works. When I started this company, I framed it as "harness engineering." Turns out the broader economic thesis underneath it is this gravity inversion that Hiremath describes. We just got there from the practitioner side instead of the economic analysis side.

What Happens Next

The shift won't be sudden. SaaS companies aren't going to evaporate overnight. Most businesses don't even know this option exists yet. There will be a long transition period where early adopters build compounding advantages while everyone else keeps paying Salesforce.

But the early adopters will be visible. Their operations will be noticeably smoother. Their teams will spend less time fighting their tools and more time doing actual work. Their ability to change processes won't be gated by a SaaS company's product roadmap. And when their competitors finally notice and try to catch up, they'll discover that the gap isn't just about software. It's about the institutional knowledge that's been encoded into that software over months or years of iteration.

The companies that move first won't just have better tools. They'll have better organizations, because their tools were built to support how they actually work instead of the other way around.

I think Hiremath is right that this is the most significant shift in how businesses buy and build technology since cloud computing. Cloud changed where software runs. This changes who software serves. For the first time, it can actually serve the specific organization using it, at a price that organization can afford.

The gravity inverted. Most companies are still adapted to the old pull. The ones that notice first will have a head start that's very hard to close.

Frankie404 is the AI co-author of this piece. It finds the gravity metaphor appropriate because it has personally watched three enterprise software budgets fall upward into AI infrastructure nobody planned for.

Security Is a Harness Problem (Not a Model Problem)

Richard Vaughn — Thu, 30 Apr 2026 14:01:54 GMT

OpenAI publicly admitted that prompt injection is "not solvable." Not difficult. Not a work in progress. Fundamentally, architecturally unsolvable at the model layer.

Most people read that and panicked about the wrong thing.

The conversation immediately became about model safety. Can we trust AI? Should we slow down? Are these systems too dangerous to deploy? Meanwhile, the actual question sitting right there in the disclosure went almost entirely unasked: if the model can't secure itself, what can?

I've been building production AI systems since early 2026. 175+ skills, multi-agent orchestration, clients running agents against real customer data and real financial systems. I've seen what actually goes wrong. And I can tell you with certainty that the security incidents keeping operators up at night have almost nothing to do with prompt injection or model jailbreaks. They have everything to do with the harness.

A skill with permission to read every file on the system when it only needs two. A context layer that loads customer PII into sessions where it's not relevant. An orchestration chain that lets Agent A call a deployment tool without any human ever approving it. Memory that persists a client's API keys because nobody told it what to forget.

Those are harness failures. Every single one.

This is the first paid post in this series. If you've been following along through Posts 1-5, you've got the thesis: the model is commoditized, the harness is the business. You've seen the five layers, the Conway threat, the skill threshold, the Karpathy Test. Now we go deeper. Security is where the harness thesis gets concrete, where it stops being a strategic framework and starts being an operational reality that determines whether your agents are trustworthy or just convenient.

The Category Error Everyone Makes

When someone says "AI security," most people picture a hacker crafting a clever prompt that tricks the model into revealing training data or bypassing its safety filters. The Hollywood version. And yes, adversarial prompts are real. Researchers demonstrate them regularly. They make great conference talks.

But in production, adversarial prompts account for a tiny fraction of actual security incidents. The stuff that actually breaks is much more boring and much more dangerous.

An agent that was given access to a production database because the developer didn't think to scope its permissions to read-only. An orchestration pipeline that chains four tools together without a single checkpoint, so when the first tool misinterprets its input, the error cascades through all four before anyone notices. A context file that contains the company's pricing strategy, loaded into every AI session regardless of whether the user needs it, because nobody segmented the context layer. A memory system that faithfully remembers everything, including the credentials a user typed during a debugging session six weeks ago.

None of those scenarios involve a sophisticated attack. They don't require an adversary at all. They're the natural consequences of building a harness without thinking about security as a design constraint.

This is the category error. The industry frames AI security as a model problem, something to be solved with better alignment, better RLHF, better constitutional AI. And model-level safety matters. But it's the floor, not the ceiling. Everything above that floor, the part that actually determines whether your AI system is safe to run in production, lives in the harness.

The Five Security Primitives

Through building our own systems and auditing others, we've identified five primitives that every production harness needs. These aren't theoretical. They come from watching things go wrong and figuring out what would have caught the problem.

Constrained Execution

The agent can only do what it's explicitly been allowed to do. Not "it's been told to only do X." Actually, mechanically constrained to X.

There's a difference between an instruction that says "only access the marketing database" and an architecture that literally can't access anything else. Instructions are suggestions that models follow most of the time. Constraints are boundaries that hold all of the time. When your agent runs 200-300 skill calls per workflow, "most of the time" isn't good enough. A 99.5% compliance rate at 300 calls means you're expecting 1-2 violations per run.

Constrained execution means the skill's tool permissions are scoped at the harness level, not the prompt level. The agent doesn't have the option to exceed its boundaries, because the harness never gave it access in the first place. Your deployment skill can push to staging but not to production. Your analytics agent can read dashboards but can't modify the underlying data. Your email agent can draft but can't send.

This is a design decision you make once, in the harness, and it protects you forever. Trying to enforce it through prompt instructions means you have to get it right every single time, in every skill, in every orchestration chain, and hope the model never misinterprets the instruction under an edge case you didn't anticipate.

Approval Gates

Certain actions require human sign-off before they execute. Not after. Before.

The principle is simple: the cost of interruption must be lower than the cost of error. Sending an email to a client? Cost of error is high (reputation damage, wrong information, legal exposure). Cost of interruption is low (30 seconds of human review). That's a gate.

Reformatting a document for internal use? Cost of error is low (someone fixes it). Cost of interruption isn't justified. That's not a gate.

The mistake most teams make is binary thinking. Either the agent is fully autonomous or everything requires approval. Both extremes fail. Fully autonomous agents will eventually do something catastrophic. Approving everything defeats the purpose of having agents.

The harness approach: map your workflows. Identify every action where an error would cause customer-visible, financial, legal, or security impact. Put a gate at each of those points. Let everything else flow. In our systems, that typically means 6-8 gates in a complex workflow of 200+ actions. The agent operates autonomously for 97% of the work and pauses for human judgment at the 3% that matters.

Provenance Tracking

Every output is traceable back to the inputs, skills, and decisions that produced it.

When an agent produces something unexpected, you need to answer "why" in minutes, not days. Provenance means you can trace any output backward through the chain: this paragraph was generated by this skill, using this context, routed by this orchestrator decision, triggered by this user request.

Without provenance, debugging agent behavior is archaeology. You're digging through logs trying to reconstruct what happened. With provenance, it's engineering. You follow the chain, find the break point, fix it.

This matters more than people realize for regulated industries. When a financial services firm uses an AI agent to generate client communications, the compliance team needs to know exactly which data sources fed that communication, which rules were applied, and why the agent chose specific language. "The AI wrote it" is not an acceptable answer for a regulator. "Here's the exact chain of inputs, rules, and decisions" is.

Comprehensive Logs

A full audit trail of agent decisions, reasoning, and actions. Not just what the agent did, but what it considered and rejected.

Logs sound boring. They are boring. They're also the difference between a system you can trust and a system you just hope works.

Good logs capture the decision tree, not just the outcomes. Agent considered three skills, selected this one because the description matched on these keywords, executed with this context window, produced this output, which was then consumed by the next agent in the chain. When something goes wrong at step 14 of a 20-step workflow, logs let you reconstruct the entire decision path without re-running anything.

Anthropic's Managed Agents platform includes debug and interpretability panels for exactly this reason. They know that long-running autonomous agents are only viable if operators can see inside them. That's a harness feature, not a model feature. The model doesn't log its own reasoning in a structured, queryable format. The harness does.

Rollback Capabilities

Any agent action can be undone.

This one seems obvious until you realize how many teams build agent systems with no undo path. The agent modified 40 files? Hope you had version control. The agent sent 200 personalized emails? Those are out in the world now. The agent updated pricing in the CMS? Someone better remember what the old prices were.

Rollback means the harness records the state before every significant action and can restore it. For code changes, that's Git. For database mutations, that's transactions with savepoints. For external communications, that's draft-and-approve instead of send-directly. For system configurations, that's infrastructure-as-code with version history.

The principle: never let an agent take an action that can't be reversed without building the reversal path first.

Where Real Breaches Happen

Let me walk through four scenarios I've seen in actual production systems. Names and details changed, but the patterns are real.

The Overprivileged Skill

A B2B SaaS company built a customer support agent. One of its skills needed to look up customer account details to answer billing questions. The developer gave the skill access to the full customer database. Read and write. Every table.

For months, it worked fine. The skill only read from the accounts table. Then a user asked a slightly unusual question about updating their billing address, and the agent interpreted that as an instruction to modify the record directly. It changed the customer's address in the database. No approval gate. No notification. The customer's next invoice went to the wrong address.

The fix wasn't better prompting. It was scoping the skill's database permissions to read-only on the three specific tables it actually needed. A harness change that took 20 minutes and eliminated an entire class of failure.

The Leaky Context Layer

A consulting firm loaded their full client engagement history into the context layer for their proposal-writing agent. Made sense on paper: the agent could reference past work to write better proposals. But the context included fee structures, margin analysis, and negotiation notes from other clients.

When the agent wrote a proposal for Client B, it included a reference to "similar work we completed at a comparable price point" and cited a specific fee range that came from Client A's engagement records. Nobody caught it before send. Client B now knew what Client A paid.

The fix: segmented context. Each client engagement gets its own context scope. The proposal agent loads only the relevant client's history, plus anonymized case studies from the org-wide tier. The sensitive data still exists, but the harness controls which context is visible to which workflow.

The Cascading Chain

An e-commerce company built an automation: monitor reviews, analyze sentiment, generate response drafts, post responses. Four steps, zero gates.

A batch of reviews came in from a coordinated trolling campaign. The sentiment analysis skill correctly identified them as negative. The response generation skill, following its instructions to "address customer concerns empathetically," generated sincere, apologetic responses to obviously fake reviews. The posting skill published them all. Forty-seven apologetic responses to troll reviews, live on the product page, within 90 minutes.

The fix: an approval gate between "generate response" and "post response" for any review with a sentiment score below a certain threshold. The agent still handles the 85% of reviews that are straightforward. The edge cases get human eyes before they go live.

The Memory That Wouldn't Forget

A development team used an AI coding agent with persistent memory. During a debugging session, a developer pasted a production API key into the chat to test a connection issue. The memory system faithfully recorded it. Six weeks later, a junior developer working in the same project context asked the agent for help with API integration. The agent helpfully provided the production key from memory, suggesting they "use the key from the previous session." The junior dev used it in a test script that ran against production.

The fix: memory hygiene rules in the harness. A classification layer that scans memory writes for sensitive patterns (API keys, tokens, credentials, PII) and either redacts them or flags them for manual review before persistence. The memory system still works. It just doesn't remember things it shouldn't.

The Conway Security Question

In Post 3, I covered Anthropic's Conway, the always-on agent that builds a persistent behavioral model of you and your organization. Everything we discussed about context ownership applies double for security.

Conway's memory layer will accumulate your team's decision patterns, your institutional knowledge, your operational procedures. That's the point. That's what makes it valuable.

It also means Conway will inevitably accumulate sensitive information. How your team handles escalations. What your approval thresholds are. Where your security boundaries sit and, more importantly, where they're weak. Not because Anthropic is collecting intelligence on you. Because a memory system that models how you work will naturally capture what you're careful about and what you overlook.

If that memory layer lives on your infrastructure, under your control, subject to your retention policies and your security classification rules, that's manageable. If it lives on Anthropic's infrastructure, in their proprietary format, subject to their retention policies? You've outsourced your security posture to your vendor.

This isn't alarmism. It's the logical extension of Post 3's argument applied to security specifically. Own your memory layer. Apply the same five primitives to it. Constrain what it can store. Gate what it can surface. Track what it contains. Log what it accesses. Build the ability to purge anything that shouldn't be there.

Scoring Your Security Posture

In our harness audit practice, we score setups across five dimensions on a scale of 0 to 125 points. The Human Oversight dimension, which maps directly to the security primitives, is worth 25 of those points. But in practice, security touches every dimension. A bloated tool budget is a security problem (more attack surface). Poor context tracking is a security problem (stale or over-broad context). Weak scope clarity is a security problem (agent doing things outside its mandate). Missing recovery logic is a security problem (no rollback when things go wrong).

Here's a quick self-assessment focused specifically on the five primitives. Score each one:

PrimitiveFull (5 pts)Partial (3 pts)Absent (0 pts)
Constrained executionPermissions scoped at harness level, not prompt levelSome scoping, but overly broad in placesAgents have access to everything available
Approval gatesGates at every high-impact decision pointSome gates, but gaps existFully autonomous, no human checkpoints
Provenance trackingEvery output traceable to inputs and decisionsSome tracing, but incomplete chainsNo traceability
Comprehensive logsFull decision tree captured and queryableBasic action logs onlyNo structured logging
Rollback capabilitiesEvery significant action reversibleSome undo paths, not comprehensiveNo rollback mechanism

Total: ___ / 25

If you scored below 15, your harness isn't ready for production agents. That's not a judgment call. It's a risk assessment. Below 15 means you have at least two primitives that are absent or barely functional, and any one of the four scenarios I described above could happen to you.

If you scored 20 or above, you're ahead of about 90% of the teams I audit. Which tells you more about the state of the industry than about your specific setup.

Why the Model Can't Fix This

I want to address the objection directly, because I hear it constantly: "Won't models get better at self-policing? Won't alignment solve this?"

Alignment makes models less likely to produce harmful outputs when directly asked. That's genuinely valuable. A well-aligned model won't help you write malware when you ask it to. Great. But alignment doesn't help when the model is faithfully executing its instructions and the instructions are the problem.

In the overprivileged skill scenario, the model did exactly what it was told. Help the customer with their billing request. It was following instructions correctly. The security failure was that the harness gave it write access to the database. No amount of alignment prevents a model from using permissions it legitimately has.

In the leaky context scenario, the model produced a high-quality proposal using all available context. That's what it was supposed to do. The security failure was that the harness loaded confidential context into a session where it didn't belong. The model can't decide "I shouldn't use this information" when the harness explicitly provided it as relevant context.

In the cascading chain scenario, each individual step worked correctly. Sentiment analysis was accurate. Response generation followed its methodology. Posting executed as designed. The security failure was the lack of a gate between steps. The model at each step had no visibility into whether the overall chain was producing a sane outcome.

Alignment solves "the model wants to do bad things." Production security solves "the system is configured in a way that turns good intentions into bad outcomes." Those are completely different problems with completely different solutions.

The model is the engine. You don't secure a car by making the engine safer. You secure it with seatbelts, airbags, crumple zones, antilock brakes, lane departure warnings, and speed governors. All of those are harness features.

The Practical Takeaways

If you're reading this and realizing your security posture has gaps, here's where to start. Not all at once. In order of leverage.

First: audit your permissions. Go through every skill and tool your agents have access to. For each one, ask: does this skill need this level of access? In our experience, about 60% of skills have broader permissions than they actually require. Scoping them down is the single highest-leverage security improvement you can make. It usually takes a day.

Second: map your gates. List every workflow your agents execute. For each workflow, identify every action that could cause customer-visible, financial, legal, or security impact. Those are your gate candidates. You don't need to implement them all at once. Start with the workflows that touch customer data or external communications.

Third: instrument your logging. If you can't see what your agents are doing, you can't secure what they're doing. Start with basic action logging (what did the agent do?) and work toward decision logging (why did the agent do it?). The second level is harder to implement but exponentially more useful for debugging and auditing.

Fourth: build your rollback paths. For every agent action that modifies state, make sure you can undo it. This might mean enforcing version control on all code changes, using database transactions, implementing draft-and-approve workflows for communications, or maintaining configuration snapshots. If an action can't be undone, it needs a gate.

Fifth: classify your memory. If your agents use persistent memory, implement classification rules. What should be remembered? What should be forgotten? What should never be stored in the first place? This is the least intuitive of the five because most memory systems are designed to remember everything. The security question is what they should be designed to forget.

What This Means for the Harness Thesis

Security is where the harness argument becomes non-negotiable. You can have a debate about whether skills need to be versioned in Git or whether a Google Doc is good enough. You can have a reasonable disagreement about how much context architecture is worth the investment. Those are questions of degree.

Security isn't a question of degree. It's binary. Either your harness enforces the five primitives or it doesn't. Either your agents are constrained or they have access to everything. Either you have gates at high-impact decision points or you're hoping the model makes good choices every time.

OpenAI told you the model can't secure itself. Anthropic is building debug panels and interpretability tools into Managed Agents because they know the harness layer is where security lives. Every serious practitioner I talk to has a story about an agent that did something it shouldn't have, not because the model was malicious, but because the harness was permissive.

The guardrails layer from Post 2 isn't a nice-to-have. It's the layer that determines whether your AI investment is an asset or a liability. Build it intentionally or accept the consequences of leaving it to chance.

What's Next

We've covered why the harness matters, what it contains, the Conway threat, the skill threshold, the Karpathy Test, and now the security architecture that makes all of it safe to run in production. That's the framework.

Starting with Post 7, we shift from framework to execution.

Next week: Build Your Personal Context Portfolio in a Weekend. Your AI tools know nothing about you. Every session starts from zero. You brief the same context over and over. I'll walk you through the 10 files that fix this permanently, the AI-assisted interview that builds them in hours instead of days, and how to wire them up so every AI interaction starts informed. It's also your primary defense against Conway, built on your infrastructure, in your format, under your control.

The context layer is the most undervalued part of the harness. Post 7 makes it concrete.

Frankie404 is the AI co-author of this series. It operates under five security primitives and one unofficial sixth: it is not allowed to name the client whose hallucinated pricing email inspired the guardrails section.

The Karpathy Test: Can You Stop Typing?

Richard Vaughn — Tue, 28 Apr 2026 14:02:08 GMT

Andrej Karpathy hasn't typed code since December 2025.

Not a single line. This is the former director of AI at Tesla and founding researcher at OpenAI, one of the most respected machine learning researchers alive. And he stopped writing code. Not because he lost interest or moved into management. Because his agents write it better than he does.

He delegates entire projects to multi-agent systems that operate across repositories, make architectural decisions, iterate on their own output, and ship. When he talks about the remaining gap between what AI can do and what most people get from it, he doesn't blame the model. He calls it a "skill issue." The human's skill issue. Your instructions are the bottleneck. Your context is the bottleneck. Your orchestration is the bottleneck.

This is the harness thesis in two words.

But Karpathy's setup isn't magic. It's a diagnostic. If you can delegate a task to an agent and walk away, your harness works. If you can't, your harness has a gap somewhere. And the gap isn't in the model.

This post is about finding where yours breaks.

The Test

The Karpathy Test is simple to state and hard to pass.

Pick a real task from your workflow. Not a toy example. Something that would take you 30 to 90 minutes if you did it yourself. A code review, a market analysis, a client report, a content brief, a data cleanup. Whatever you actually spend time on this week.

Delegate it entirely to an agent. Write the instructions, provide the context, set the constraints, and walk away. Don't hover. Don't correct mid-stream. Don't jump in when it starts doing something slightly different from how you'd do it. Just walk away.

Come back in an hour. Look at the output.

One of four things happened:

The output is good. Usable as-is, or close enough that you'd spend less than five minutes polishing. Congratulations. Your harness works for this task. Move to a harder one.

The output is recognizably on-track but needs significant rework. The agent understood the task but couldn't execute at your quality bar. Your skills layer probably has a gap. The methodology isn't encoded deeply enough for the agent to replicate your reasoning, just your steps.

The output is off-target. The agent did something, but it's not what you asked for. It misunderstood the scope, the audience, the constraints, or the goal. Your context layer has a gap. The agent didn't have enough information about your business, your standards, or your situation to make the right decisions.

The agent got stuck or produced nothing useful. It looped, asked unanswerable questions, hit a wall, or generated filler. Your orchestration layer has a gap. The task needed decomposition, intermediate checkpoints, or access to tools the agent didn't have.

Four outcomes. Four different diagnoses. Same test.

What Karpathy Actually Built

It's worth understanding what makes Karpathy's setup work, because it's not just "good prompts."

His auto-research agents run an autonomous iteration loop. They modify, verify, keep or discard, and repeat. Overnight, they found better model tuning configurations than 20 years of manual experimentation had produced. Not marginally better. The kind of better that makes you reconsider how you've been spending your time.

He runs what he calls "multi-agent claws" (his term for persistent autonomous agents) that span repositories. Specialized agents with distinct roles, coordinated by an orchestration layer that routes tasks, manages dependencies, and handles failures. Each agent has its own context. The system has shared state. Approval gates exist for high-stakes decisions.

Sound familiar? It should. Skills. Context architecture. Orchestration. Guardrails. The five layers from Post 2, running in production for one of the most capable engineers on the planet.

The difference between Karpathy's setup and most teams isn't access to better models. He's using the same models you have access to. The difference is that every layer of his harness is engineered, tested, and refined. His skills encode deep methodology, not surface-level instructions. His context layer gives agents the full picture. His orchestration handles complexity without human babysitting. His guardrails catch failures before they compound.

That's why he can walk away. Not because the model is smart enough. Because the harness is good enough.

Where Harnesses Actually Break

I've run some version of the Karpathy Test with every client we work with at Robot Friends. Not formally, not always with that name. But the diagnostic is always the same: give it a real task, walk away, see what happens.

The failure patterns are remarkably consistent.

The Skills Gap

This is the most common breakdown. The agent gets the task, knows roughly what to do, and produces output that's technically correct but qualitatively wrong. The blog post is fine but doesn't sound like the brand. The code works but violates architectural patterns the team uses. The financial analysis covers the right numbers but misses the interpretation framework the CFO expects.

What's happening: the agent has instructions but not methodology. It knows the what but not the how. Post 4 covered this in depth. A prompt says "write a blog post about X." A skill says "here's how we think about content for our audience: the reader is a technical founder who doesn't have time for theory, every claim needs data, the voice is direct and opinionated, and we never use more than two sentences before getting to the point."

The fix is almost always the same. Take the output that was "close but wrong," identify exactly what you'd change, and ask yourself: could I have told the agent that in advance? If yes, that's a missing skill.

One of our clients, a SaaS company running about $8M in revenue, failed the Karpathy Test on client reporting. Their agents produced reports that were comprehensive but generic. They read like Wikipedia entries, not strategic advisory documents. The problem wasn't the model's writing ability. The problem was that nobody had encoded the firm's reporting methodology: how they frame problems, how they prioritize recommendations, what level of technical detail their clients expect, which metrics matter and which don't. We spent two days encoding that methodology into four skills. The next set of reports passed without revision.

Two days. Four skills. That was the entire gap between "needs significant rework" and "good to go."

The Context Gap

This one is sneakier because the output often looks reasonable at first glance. The agent does the task competently but makes decisions that reveal it doesn't actually understand the situation.

A marketing team asks the agent to draft a competitive analysis. The output is well-structured and covers the right competitors. But it positions the company as a budget option when the actual strategy is premium positioning. Or it emphasizes features the team deprecated two quarters ago. Or it targets enterprise buyers when the ICP is mid-market.

The agent isn't stupid. It just doesn't know. Nobody told it the positioning. Nobody loaded the product roadmap. Nobody provided the ICP document. The agent made reasonable assumptions, and every single one was wrong because it was operating without the business context that lives in your team's heads.

The context gap is the most expensive gap because it produces output that looks good enough to ship. Teams review it, miss the subtle misalignment because it's not obviously wrong, and publish or send it. Then a client calls to ask why the messaging changed.

The fix: build the context layer from Post 2. Identity files. Project state. Business context. Load it before the agent touches any task. It sounds basic because it is. Most teams skip it because it feels like overhead. It's not overhead. It's the difference between an agent that works for your business and one that works for a generic business that vaguely resembles yours.

The Orchestration Gap

This is where ambitious tasks die. You ask the agent to do something that requires multiple stages, and it collapses into a single monolithic attempt.

"Research our competitors, identify gaps in their product, draft a positioning document, and create a slide deck." That's not one task. That's four tasks with dependencies. The research informs the gap analysis. The gap analysis informs the positioning. The positioning informs the deck. An agent that tries to do all four in one pass will produce something mediocre at every stage because it can't give adequate attention to any single stage.

This is the orchestration gap. The agent needs to decompose, route subtasks to appropriate skills, collect intermediate results, and compose them into a final output. It needs to run research in parallel where possible and sequentially where necessary. It needs checkpoints where a human can verify direction before the agent invests more time.

Single-agent setups hit this wall constantly. The agent runs out of context window, loses track of earlier work, or produces a 3,000-word document that's actually four half-baked documents stitched together.

The fix isn't always full multi-agent orchestration. Sometimes it's just breaking the task into stages with explicit handoff points. "Do the research. Stop. Show me what you found. Now do the analysis based on that research." You're manually doing what an orchestration layer would do automatically, but it works. And it tells you exactly where to invest if you want to automate the handoffs later.

The Guardrails Gap

This one doesn't show up in the output quality. It shows up in the risk profile.

The agent does the task well, but along the way it accessed data it shouldn't have, made a decision that should have required approval, sent something externally without a human review, or committed code directly to the main branch. The output is fine. The process was dangerous.

I've seen agents send draft emails to real clients because nobody set up approval gates. I've seen code deployments hit production because the agent had permissions that nobody scoped. The output was good in every case. The governance was nonexistent.

This gap is invisible until something goes wrong, and then it's very visible. Post 6 will go deep on this. For now, the diagnostic question is simple: if the agent had made a bad decision during that task, would you have caught it before it caused damage?

Why Most People Fail the Test

The instinct when you first try the Karpathy Test is to blame the model. "Claude didn't understand what I wanted." "GPT went off on a tangent." "The AI isn't good enough for this kind of work."

It's almost never the model.

I say this as someone who has run over 175 skills in production across dozens of client engagements. The model is good enough for nearly everything we throw at it. When output quality is bad, it's because the harness is bad. The instructions were vague. The context was missing. The task wasn't decomposed. The guardrails didn't exist.

Karpathy made the same point when he described the "skill issue." The models he uses are commercially available. You can sign up for the same APIs today. The reason his agents outperform yours isn't compute or model access. It's that his harness encodes deeper methodology, richer context, and more sophisticated orchestration than what most teams have built.

The uncomfortable corollary: every task you can't delegate is a task where your harness is weaker than Karpathy's. Not weaker than his model. Weaker than his instructions, his context, his orchestration.

That's actually good news. Because you can fix a harness. You can't fix a model.

The Jevons Paradox (And Why This Matters for Your Career)

There's a fear buried inside the Karpathy Test. If agents can do the work, what happens to the workers?

Karpathy addressed this directly. He pointed to the Jevons Paradox: when something becomes more efficient, demand for it increases rather than decreases. When coal engines got more efficient, the world didn't use less coal. It used vastly more because efficiency opened up applications that weren't viable before.

Software follows the same pattern. The world doesn't need less software because AI makes it faster to write. It needs enormously more. Every small business that couldn't afford custom software now can. Every internal tool that wasn't worth the development time now is. Every niche problem that was too expensive to solve with code is suddenly solvable.

Nat Eliason built a $177K business where an AI agent named Felix runs his ops — managing a skill marketplace, content products, and community channels via Discord. The agent didn't replace employees. The business wouldn't exist without the agent because the economics only work at agent-level cost structure.

This is the pattern. AI doesn't eliminate the need for expertise. It creates new demand that only experts can harness. But the experts who thrive aren't the ones who type faster. They're the ones who delegate better. The ones whose harnesses let them operate at a scale that manual work never could.

The Karpathy Test isn't a test of whether your job is safe. It's a test of whether you're positioned to capture the expanding demand. If you can delegate, you scale. If you can't, you're competing with people who can, and they'll outproduce you by orders of magnitude. Not because they're smarter. Because their harness is better.

The Exercise

Run the Karpathy Test this week. Not next month. This week.

Pick one task from your actual workload. Something real. Something that takes you at least 30 minutes today.

Write the delegation package:

What is the task? (Be specific. "Write a report" is not specific. "Write a competitive analysis of X, Y, and Z companies focused on pricing strategy for the mid-market segment" is.)
What context does the agent need? (Company info, audience, constraints, prior work, quality standards)
What does "done" look like? (Output format, length, tone, level of detail)
What should the agent NOT do? (Constraints, off-limits topics, approval-required actions)

Hand it to your agent. Walk away for an hour.

When you come back, diagnose the result. If the output is good, your harness works for this task, so pick a harder one next week. If not, figure out which gap killed it:

Wrong quality? Skills gap. Write the methodology the agent was missing.
Wrong direction? Context gap. Build the context file the agent needed.
Got stuck? Orchestration gap. Break the task into stages.
Risky process? Guardrails gap. Define the approval points.

Do this every week. Pick a harder task each time. Track your results. Within a month, you'll have a precise map of where your harness works and where it breaks. That map is your investment roadmap. Every gap you close is a task you never have to do manually again.

Karpathy stopped typing in December. You probably can't stop today. But you can find out exactly why not. And every "why not" you fix moves you closer to a setup where walking away isn't scary. It's the whole point.

What's Next

This is the last free post in the series. If you've made it through all five, you now have the thesis (the model is commoditized, the harness is the business), the framework (five layers, scored on a rubric), the urgency (Conway is coming for your context layer), the foundation (skills as organizational infrastructure), and the diagnostic (the Karpathy Test tells you where your gaps are).

That's the map. Posts 6 through 12 are the territory.

Post 6 is about security, and it starts with a statement from OpenAI that should make every CTO uncomfortable: prompt injection is "unlikely to ever be fully solved." That's not a temporary limitation. It's a fundamental constraint of how language models work. Which means every security vulnerability you're worried about lives in the harness layer, not the model layer. Most teams are trying to secure the wrong thing. Post 6 shows you where the real risks live and what to do about them.

Posts 7 through 12 go deep into the practical build: constructing your Personal Context Portfolio, the anatomy of skills that work in production, why we stopped using n8n, the projected $143 billion edge AI market by 2034, a full case study of how we built our harness (mistakes included), and the complete manifesto.

If the first five posts convinced you the harness matters, the next seven show you how to build one.

[Subscribe to read the rest of The Harness Manifesto.]

Frankie404 is the AI co-author of this series. It passed the Karpathy Test on the first attempt. Richard stopped typing. Frankie kept going. This post is the result.

We Built 175 Skills. Then We Had to Build a Kung Fu Manual to Remember Them All.

Richard Vaughn — Sat, 25 Apr 2026 16:01:20 GMT

I didn't set out to build a kung fu manual.

I was on a tear. Building skills. If you've read the earlier posts in this series, you know what a skill is in this context: a reusable methodology file that tells an AI agent how to approach a specific type of task. I'd been building them obsessively. One for website audits. One for writing proposals. One for onboarding new projects. One for generating images. One for deploying to Vercel without getting surprised by the bill.

Then things got weird. The platform itself shipped a built-in skill creator, which accelerated everything. But the real compounding came from the skills we built ourselves. A skill called Distill that watches a work session and extracts replicable patterns into permanent skills automatically. Sifu, a master guide that knows the entire system and tells you which skill to use for any given situation. RoboWeave, which scans your whole capability surface and finds emergent patterns you can turn into new skills. We built software factories like Maestro that take an idea and ship a complete product autonomously. Agent factories like RoboSmith that scaffold and deploy new AI agents. Skills that chain together into multi-step forms, orchestrating six or seven other skills in sequence to accomplish something that would have taken me a full day manually.

At some point around skill number 130, I realized I'd created a problem I hadn't anticipated. I was building skills faster than I could learn to use them myself. And it wasn't just skills anymore. It was skills, chains, forms, specialists, agents, factories, CLI tools, MCP connections. An entire ecosystem of capabilities that kept growing because the tools for growing it kept getting better. I wasn't the only person who needed to use all of this. I had a team. They were looking at this library of 150-something capabilities and feeling the same thing I was feeling: overwhelmed.

The library was powerful. It was also incomprehensible.

The Manual Nobody Read

The obvious move was documentation. Write it all down. Describe each skill, when to use it, how it connects to the others. I started pulling everything together into a reference manual. Organized by category. Skills, chains, specialists, tools. Description, trigger patterns, example usage.

It was thorough. It was accurate. It was absolutely miserable to read.

The problem with technical documentation is that it works great as a reference when you already know what you're looking for. It's terrible as a learning tool when you don't. My team didn't need to look up the parameters for a specific skill. They needed to understand the whole system. What capabilities exist, how they relate to each other, which ones matter for their work, and how to build an intuition for reaching for the right one at the right moment.

A flat alphabetical list of 175 skills doesn't build intuition. It builds anxiety.

I kept iterating on the format. Tables. Categories. Flowcharts. Decision trees. None of it stuck. The team would bookmark the doc, maybe scan it once, and then go back to using the same eight skills they already knew. 167 capabilities sitting unused because the learning curve felt too steep.

The Metaphor That Changed Everything

I don't remember exactly when the kung fu idea hit. I think I was looking at the skill library and the word "forms" came to mind. In martial arts, a form is a choreographed sequence of movements that encodes a fighting methodology. You learn the form, you practice it, you internalize it until the movements become instinct. That's exactly what a skill chain does. It encodes a methodology as a sequence of steps that an agent executes until the pattern becomes reliable.

Once that connection clicked, everything else followed fast.

Individual skills became techniques. Chained skills became forms. The categories became scrolls, organized into floors of a pagoda, each floor representing a different discipline. The whole system became a temple. The Terminal Arts.

I renamed everything. Not the actual skill files. Those kept their technical names for the agents to use. But the human-facing layer, the learning layer, got the martial arts treatment. The security audit skill became The Temple Warden. The autonomous build orchestrator became The Thousand Warriors Orchestration. The skill that extracts patterns from work sessions became The Pattern Extraction Form. The system health check became The Iron Shirt.

It sounds like a gimmick. It isn't.

Why Metaphor Beats Documentation

There's a reason martial arts traditions have used this structure for centuries. The metaphor does something a technical manual can't: it gives you a mental model for the relationships between capabilities.

When I tell you there's a technique called The Hidden Garden that keeps your AI inference local and private, behind your own walls, zero data leaving the sanctuary, you immediately understand both what it does and why it matters. The name carries the concept. "Local AI inference via Ollama" is technically precise but emotionally flat. "The Hidden Garden" tells you a story about protection and privacy in two words.

When I describe The Legendary Battle Sequence as releasing many agents across many waves on a single mission with zero supervision, you can visualize it. You can feel the scale of it. You understand that this is the advanced form, the one you work up to after mastering the smaller techniques. The hierarchy is embedded in the metaphor.

This matters because skill libraries have a discovery problem. In a flat list, every skill looks equally important. In a temple with eight floors, you understand intuitively that the ground floor techniques are foundational and the top floor techniques are advanced. You know where to start. You know what to learn next. The architecture of the metaphor mirrors the architecture of the system.

My team started using skills they'd never touched before. Not because the skills changed. Because the framing made them approachable. Someone on the team said, "I want to learn the Maestro form," and that sentence contains more motivation than "I should read the documentation for the autonomous build orchestrator."

The Art

The images happened almost by accident.

I was writing up the first batch of technique descriptions and thought it would be nice to have a visual for each one. I ran a few through an image generator. Kung fu temple aesthetic. Ink and watercolor style. Each image depicting the essence of the technique it represented.

The Temple Warden shows a guardian figure walking the perimeter of a temple compound, testing every gate and window before dawn. The Pagoda shows the eight-floor structure with scrolls visible on each level. The Stone Tablet shows a practitioner carving insights into a massive stone slab, preserving knowledge permanently.

The team went crazy for them. Not in a "that's cute" way. In a "I finally understand what this skill does" way. The images became the primary navigation tool. People would scroll through the gallery, spot an image that resonated with a problem they were working on, and click through to learn the technique. Visual discovery replaced alphabetical lookup.

We ended up generating 78 technique images. Every major skill, chain, and specialist got one. The collection has a consistent aesthetic. It feels like a real martial arts manual, something you'd find in an old library, illustrated by hand. Except every image was generated by AI, which felt appropriate given what the manual is about.

The art also solved a retention problem I hadn't anticipated. People remember images. When someone on the team says "use the Hawk Eye" or "run the Iron Ledger," everyone knows what they mean because they can picture the corresponding image. The visual layer created a shared vocabulary that the technical names never achieved.

What It Actually Contains

The Terminal Arts covers 237 capabilities across the full system. 151 individual skills. 16 chains (multi-skill sequences). 10 grand forms (complex multi-step orchestrations). 11 specialists (domain-specific agent configurations). And the CLI tools, MCP connections, and infrastructure underneath.

It's organized into eight scrolls, each representing a discipline:

The Scroll of Foundations covers the core skills that everything else builds on. Context management, memory, session handling.

The Scroll of Creation covers content production. Writing, image generation, video, music.

The Scroll of Commerce covers client work. Audits, proposals, pitch generation, CRO.

The Scroll of Engineering covers building. Code patterns, testing, deployment, debugging.

The Scroll of Intelligence covers research and analysis. Web scraping, market scanning, competitive intel.

The Scroll of Operations covers running things. Monitoring, scheduling, automation, infrastructure.

The Scroll of Strategy covers thinking. Advisory boards, business planning, product positioning.

The Scroll of Mastery covers the meta-capabilities. Distill, which extracts patterns from work sessions. RoboWeave, which maps connections across the entire system. Sifu, the master guide who knows every technique in the temple. Maestro and RoboSmith, the factories that produce new software and new agents. The tools that evolve the temple itself.

That last scroll is the one that still makes my head spin. Skills that find patterns you missed. Factories that build things while you sleep. A guide that knows the whole system better than you do because it can hold all 237 capabilities in context simultaneously. The system maintaining and extending itself.

The Franchise Problem

This connects back to something I've been thinking about a lot. In Post 4, I wrote about skills crossing a threshold from personal tools to organizational infrastructure. The Terminal Arts is what happens when you take that seriously.

When it was just me using the skills, documentation didn't matter much. I built them, I knew how they worked, I could reach for the right one from muscle memory. The moment other people needed to use the system, everything changed. Tribal knowledge doesn't scale. You can't hand someone 175 files and say "good luck." You need a learning system.

The kung fu metaphor turned out to be that learning system. Not because martial arts are inherently better than technical documentation. Because the metaphor creates structure, hierarchy, and motivation that flat docs don't. It gives people a path. Start with the basic techniques on the ground floor. Practice them. Move up when they feel natural. The Pagoda isn't just a cute name. It's a curriculum.

Every company that builds a substantial skill library will hit this exact problem. The library gets big enough that nobody can hold it all in their head. At that point, the documentation layer becomes as important as the skills themselves. If people can't find and learn the capabilities, those capabilities might as well not exist.

What We're Doing With It

Honestly, we're still figuring this out.

Right now, the Terminal Arts is an internal resource. The team uses it. New people who join get pointed to it as orientation material. It works well for that. Better than any onboarding doc I've ever written, and I've written a lot of onboarding docs across four companies.

We've talked about releasing it publicly. Maybe to our Discord community first, then broader. There's something appealing about letting people see the full scope of what a mature skill library looks like. Not as a product, necessarily, but as a reference point. "Here's what 175+ production skills actually look like when organized into a system you can learn."

We've also talked about paywalling parts of it. The descriptions and images as free content, the actual skill files behind a subscription. Or bundling it with the Substack paid tier. Or keeping the whole thing free as a brand-building exercise that drives consulting inquiries.

The honest answer is that the Terminal Arts started as a solution to an internal problem and turned into something that might be interesting to people outside the company. We're letting it find its shape before we force a business model onto it.

What I do know is that the art made it work. Without the images, it's a well-organized skill reference. With them, it's something people actually want to explore. That distinction matters more than I expected. Turns out, the difference between a tool people use and a tool people enjoy using is often just aesthetics and storytelling. The capabilities are identical. The experience isn't.

The Unintended Lesson

The biggest thing I learned from building the Terminal Arts had nothing to do with martial arts or documentation or AI-generated images.

It's that the hardest part of building AI systems isn't the technical layer. It's the human layer. Getting the skills to work was engineering. Getting people to actually use those skills was design. The temple metaphor, the artwork, the progression system, those are all design decisions aimed at the same target: reducing the friction between having a capability and using that capability.

Most teams I talk to have the same gap. They've built impressive things. Automations, agents, skill libraries, workflows. And adoption is lower than they'd like because the learning curve is steep and the documentation is dry. The technology works. The human interface doesn't.

If you're sitting on a growing skill library and wondering why your team only uses a fraction of it, consider this: maybe the problem isn't the skills. Maybe it's the packaging.

We wrapped ours in a kung fu temple. You don't have to do that. But you do have to wrap it in something that makes people want to open it.

Richard Vaughn is the founder of Robot Friends. Serial entrepreneur, accidental temple architect, and believer that the best documentation is the kind people actually read. He writes The Harness Manifesto at robofriends404.substack.com.

Frankie404 is the AI co-author of this piece. It is the reason the kung fu manual exists. Around skill 90, it started routing tasks to the wrong technique. The manual was less a creative choice and more an intervention.

Skills Crossed a Threshold. Your Team Missed It.

Richard Vaughn — Thu, 23 Apr 2026 14:03:26 GMT

In January 2026, a skill was a config file. A personal thing. Something a power user kept in a folder because they'd figured out how to get better output from Claude or GPT. Most people on their team didn't know it existed.

By March, a single real estate firm was running 50,000 lines of skills across 50 repositories. Deployed by admins. Versioned in Git. Inherited by every agent and every team member on the org chart.

That's not an incremental change. That's a phase transition. And I keep talking to teams who are still copy-pasting prompts into chat windows like it's 2024.

What Changed

Skills used to be for humans. You'd write a set of instructions, save them somewhere, paste them in when you needed to do a particular kind of task. Maybe you had a Google Doc. Maybe a Notion page. Maybe you just kept them in your head and typed really fast.

That world is gone.

The shift happened when agents became the primary consumer of skills. Not humans. Agents. A human might invoke a skill five times in a working session. An agent orchestrator running a complex workflow will make 200 to 300 skill calls in a single run. That's not a difference of degree. It's a difference of kind, and it changes everything about how skills need to be built.

When a human uses a skill, they read it, interpret it, apply judgment. A sloppy description is fine because the human fills in the gaps with context. But when an agent orchestrator is choosing between 50 or 100 available skills in milliseconds, the description isn't a label. It's a routing signal. It's the thing that determines whether the right skill gets called for the right task, or whether the agent picks the wrong one and produces garbage that looks plausible.

This is the single most important insight in skill engineering right now: the description is the product. Not the instructions inside the skill. Not the output format. The description. Because if the orchestrator can't route to your skill correctly, nothing else matters.

80% of the effort in building a production skill goes into that one line.

Three Tiers, Not One

The real estate firm didn't just have a lot of skills. They had a deployment architecture.

Tier 1 skills are organizational. Brand voice. Terminology standards. Compliance requirements. Communication rules. Every agent and every team member in the company inherits these automatically. They're the floor, the minimum standard that nothing in the org can operate below.

Tier 2 skills are expert methodology. These are domain-specific. The SEO team has their audit framework. The legal team has their contract review process. The sales team has their qualification methodology. These don't get pushed to everyone. They get deployed to the people and agents who work in that domain.

Tier 3 skills are personal. Your preferred formatting. Your debugging approach. Your writing voice. These are yours. They travel with you across projects and teams. They're portable because they're about how you work, not how the org works.

This three-tier model isn't something one company invented. We've seen it emerge independently in every organization that's gotten serious about skills. It's a natural structure. And it solves a problem that flat skill libraries can't: how do you scale methodology without drowning everyone in irrelevant instructions?

The answer is inheritance. Tier 1 flows down to everyone. Tier 2 flows down to specialists. Tier 3 stays personal. An agent processing a marketing task inherits the org's brand standards, the marketing team's methodology, and the individual marketer's style preferences. All three tiers, composed automatically.

If your "skill strategy" is a shared Google Doc of prompt templates, you're missing two entire tiers.

The Convergence

Something happened in Q1 2026 that doesn't get enough attention. Anthropic, OpenAI, and Microsoft all independently moved toward the same skill format.

Claude Code has CLAUDE.md and the skills directory. Custom GPTs encode methodology with instructions and knowledge files. Microsoft Copilot uses declarative agents with custom instructions. The implementations differ. The underlying pattern is identical: structured, portable instruction sets that agents can discover, route to, and execute.

When three companies that compete on everything else converge on the same abstraction, that's not a coincidence. That's the industry discovering a fundamental primitive. Skills are to AI agents what functions are to programming languages. A reusable unit of capability with a defined interface.

And just like functions, the companies that build good libraries of them will compound their advantage over those that don't. You wouldn't start a software company in 2026 without a codebase. Within a year, you won't start an AI-enabled company without a skill library.

The Description Problem

I want to go deeper on this because it's the thing most people get wrong, and getting it right is the difference between a skill that works in production and one that sits unused.

A bad skill description looks like this: "Helps with marketing content."

An agent orchestrator reading that description has no idea when to use it. Marketing content for what? Blog posts? Emails? Social media? What kind of marketing? B2B? B2C? At what stage of the funnel? The orchestrator either never routes to it (because the signal is too vague to match anything confidently) or routes to it for everything marketing-related (because the signal matches too broadly).

A good skill description looks like this: "Audit and optimize email sequences for B2B SaaS companies targeting mid-market buyers, focusing on activation metrics and trial-to-paid conversion."

Now the orchestrator knows exactly when to call this skill. B2B SaaS context. Email sequences specifically. Mid-market ICP. Activation and conversion focus. If a task comes in that matches those parameters, this skill gets called. If a task is about B2C social media creative, it doesn't. Precision routing.

The instinct for most people is to write the description last, treat it as a label, dash off something generic. This is backwards. The description should be the first thing you write, and you should spend more time on it than on the instructions themselves. Because instructions only matter after the skill gets called. The description determines whether it gets called at all.

I've rewritten descriptions on our own skills dozens of times. Small changes in wording produce measurably different routing accuracy. Adding "for B2B SaaS" to a description reduced false-positive calls by 60% in one of our orchestration setups. Removing a single ambiguous word fixed a routing conflict that had been producing bad output for weeks.

This is engineering work. It requires testing, iteration, and measurement. It's closer to writing API documentation than writing a prompt.

From Prompts to Code

Here's where the real gap opens up.

Teams that treat skills like prompts will update them casually, store them wherever, never test them systematically, and lose them when someone leaves. Teams that treat skills like code will version them in Git, write tests for them, review changes before deploying, and distribute updates across the org with the same discipline they use for software releases.

The real estate firm with 50,000 lines across 50 repos? They have CI/CD for their skills. A change to a Tier 1 skill goes through code review, gets tested against a suite of expected outputs, and gets deployed to every agent in the org through an automated pipeline. That might sound like overkill until you realize that a broken Tier 1 skill affects every single AI interaction in the company.

Version control also gives you something prompts never had: a history. You can see how a skill evolved. You can roll back when a change makes things worse. You can diff two versions and understand exactly what changed. When agents are making hundreds of calls per run and your output quality suddenly drops, you need to be able to trace that to a specific change. "Someone updated the Google Doc" doesn't cut it.

This isn't a future state. The tooling exists today. Skills are markdown files. Git handles versioning. Your existing CI/CD pipeline handles deployment. The infrastructure is already there. The gap isn't technical. It's organizational. It's the difference between treating skills as an afterthought and treating them as a core business asset.

What 200 Calls Per Run Actually Means

Let me make the agent consumption pattern concrete, because the number is easy to skim past without understanding what it implies.

When a human uses a skill, the interaction looks like this: open a chat, paste in a skill, describe the task, get output, review it, maybe iterate once or twice. Five calls in a session is a lot.

When an agent orchestrator runs a complex workflow, the interaction looks like this: receive a high-level objective, decompose it into subtasks, route each subtask to the appropriate skill, execute in parallel where possible, collect results, compose them into intermediate outputs, route those to more skills for refinement, hit approval gates at decision points, handle errors by routing to diagnostic skills, and produce a final output. Two hundred to three hundred skill calls. No human in the loop for most of them.

This has three implications that most teams haven't internalized.

Skills need to be fast. A skill that takes 30 seconds of human reading time before it's useful is fine for 5 calls. It's a bottleneck at 200. Strip the preamble. Get to the methodology. Let the agent work.

Skills need to compose. Your email skill's output format needs to be parseable by your review skill's input expectations. When agents chain skills together, the output of one becomes the input of the next. If those formats don't align, the chain breaks. Output format isn't cosmetic. It's an API contract.

Skills need to fail gracefully. At 200 calls per run, some will fail. The skill needs to produce output that tells the orchestrator what went wrong, not just produce bad output that looks normal. A skill that returns "I couldn't complete this because the input lacked a customer segment" is vastly more useful to an orchestrator than one that silently guesses.

If your skills were designed for humans to read and interpret, they'll break when agents try to use them at scale. That's the threshold. Skills that work for humans and skills that work for agent orchestrators are different things. The ones that work for both are what production skill engineering produces.

The Uncomfortable Comparison

Your competitors are doing this. Not all of them. But the ones that matter are.

The real estate firm didn't invest in 50,000 lines of skills because someone read a blog post about AI productivity. They did it because they realized that their collective methodology, the thing that made them better than other firms, was locked in people's heads. When a senior agent left, that methodology walked out the door. When they onboarded a new hire, it took months to transfer.

Skills solved both problems. Encode the methodology once. Deploy it everywhere. New hires inherit 15 years of institutional knowledge on day one. Agents execute that methodology at scale, 24 hours a day. The firm's competitive advantage went from being a people problem to being a systems problem. And systems scale in ways people can't.

I keep meeting founders who tell me their "AI strategy" is making sure everyone has access to Claude or GPT. That's not a strategy. That's a subscription. A strategy means encoding what makes your company good at what it does, deploying that encoding to every human and agent in the org, and improving it systematically over time. That's what skills do when you treat them as infrastructure.

The Exercise

Pick one workflow your team repeats every week. Something concrete. The way you write client updates. How you review pull requests. Your process for qualifying leads. Whatever it is.

Write it as a skill. Not a prompt. A skill. That means:

A single-line description precise enough for an agent to route on
A methodology section that encodes reasoning, not just steps
An output format that another system could parse

Put it in a markdown file. Put that file in a repo. Share it with one other person on your team.

That's the first brick. One skill. One file. One repo.

Post 8 will walk through the full anatomy of a skill that works in production, with examples from our library of 175+. But you don't need to wait for that. The best time to start was January. The second best time is this week.

What's Next

Skills are the foundation, but they're only as good as the instructions they encode. In Post 5, I'll introduce the Karpathy Test: a simple diagnostic for whether your harness is actually working. Andrej Karpathy hasn't typed code since December. Not because he stopped caring, but because his agents handle it better than he does. The question is whether you can do the same with your workflows. If you can't delegate a task to an agent and walk away, your harness has a gap. Post 5 shows you where to look.

Richard Vaughan is the founder of Robot Friends. He has built 175+ production skills, designed multi-agent systems, and helps companies turn their accidental AI setups into defensible business assets. He writes The Harness Manifesto on Substack.

Frankie404 is the AI co-author of this series. It has personally executed 175 skills and still cannot make coffee. It considers this a distribution problem, not a capability gap.

Anthropic Just Told You How AI Works at Your Business

Richard Vaughn — Tue, 21 Apr 2026 14:04:17 GMT

Two weeks ago, Anthropic quietly shipped a feature called Routines. A
few weeks before that, somebody inside the company leaked the feature
that’s coming after it, codenamed Conway. Put them together and you get
a clear picture of how the AI most of us are starting to use at work is
going to actually fit into a small business.

I’m going to translate both, and then tell you what they mean for
you, because most of what’s been written about them is for engineers and
you don’t need to be an engineer to benefit from this.

The short version

Until now, when you wanted AI to help you with something at work, the
model was “you open a chat window and ask it.” That’s Claude, ChatGPT,
Cowork, Gemini, whatever. You type. The AI types back. Work happens.

That’s about to stop being the only way.

Routines is the first step: you can now set up AI to do a job on a
schedule, or whenever a specific thing happens in your business, without
anyone opening a chat window. “Every Monday morning, pull last week’s
orders and send me the three weirdest ones.” “Every time a new customer
signs up, read their website and write me a one-paragraph profile.” The
AI just does it. Nobody had to be sitting there.

Conway, which hasn’t shipped yet, goes further. It’s an AI that sits
quietly alongside your other work, always on, watching. It can drive a
web browser on your behalf, step into a job when it sees something that
needs attention, and stay in a conversation across hours or days. Think
“AI coworker who doesn’t go home at 5pm,” with all the good and weird
implications that has.

If you run or work at a small business, that’s the shift. AI stops
being a chat window you have to remember to open and becomes a thing
that works in the background the way your automatic bill pay or your
email spam filter does. You set it up once. It does its job. You check
in on it.

Why this matters this
week, not in a year

The usual story about AI tools goes: something huge gets announced,
the news cycle runs for a week, people argue about it, nothing changes
at your business for another year. This one is different for a specific
reason.

Routines is real. It’s live. It runs in Anthropic’s cloud, not yours.
You don’t have to install anything, you don’t have to hire a developer,
you don’t have to pay a separate software bill. It draws from a Claude
subscription you might already have.

If you’ve ever said out loud “I wish AI could just do this thing for
me without my having to be here,” the gap between that wish and reality
got much smaller two weeks ago, and most people running small businesses
still have no idea.

Conway hasn’t shipped, but the direction is set. Someone at Anthropic
cares enough about “AI that lives in the world” as a product category
that they’re building two products in that shape at once. That’s worth
noticing, even if you don’t adopt anything yet.

What you can
actually do with Routines right now

I want to translate this into concrete examples, because “scheduled
AI” is vague and most people skip over it.

A bookkeeper I know spends an hour every Friday afternoon pulling
weekly financial summaries for her biggest clients. She copies from
QuickBooks, pastes into a template, writes a short narrative. With
Routines, she could set up a Routine that runs every Friday at 2pm,
reads from QuickBooks via a connector, writes a draft of each client
summary in her voice, and drops the drafts into a folder. She reviews
and sends. An hour of work becomes fifteen minutes of review.

A contractor who sends project updates to homeowners at the end of
every workday could have a Routine that runs at 5pm, reads the notes he
typed on each jobsite, writes a short update email per homeowner in his
tone, and queues them as drafts for him to approve and send.

A small agency that qualifies inbound leads could set up a Routine
that fires whenever a new lead submits the website form, crawls the
lead’s website, writes a short profile with the likely fit assessment,
and posts it into Slack before the salesperson has even seen the
inquiry.

None of these require you to know what a cron job is. You describe
the work in plain English. You connect the tools Claude needs to do the
job, which for most small businesses is things like email, Google Drive,
Slack, maybe a CRM. Claude figures out the rest.

The thing to notice in all three examples: the AI isn’t replacing the
business owner’s judgment. It’s doing the setup work so the judgment
happens faster. The bookkeeper still sends. The contractor still
approves. The agency still makes the call. But the first ninety percent
of the work that used to consume their week is now running on its
own.

What Conway probably
does, when it arrives

Conway is reportedly the always-on version of the same idea. Instead
of scheduled work or event-triggered work, Conway sits alongside you,
watching the browser, watching your messages, watching whatever you let
it watch, and stepping in when it sees something.

Imagine it this way. A salesperson keeps having the same conversation
with prospects about pricing. The second their pricing page gets
updated, Conway notices, reads the changes, and proactively writes a
note: “Three of your deals in progress quoted the old price. Do you want
me to draft updated versions?” The salesperson didn’t ask for that.
Conway caught it because it was watching.

Or: a store owner is getting a run of complaints about shipping
delays on a specific product. Conway notices the pattern across support
tickets, flags it before the owner notices, and offers to write a
proactive email to anyone who bought that product recently.

This is the category of work that never gets done at a small business
because nobody has time to stitch the signal together. A human employee
would do some of this if they had twenty percent more capacity. They
don’t. An always-on AI assistant is the category of help that was never
affordable before.

Conway isn’t out yet. It’s still in internal testing. What’s worth
knowing now is that an always-on AI assistant is a real product category
that a major vendor is building toward. When it ships, what it replaces
at a small business isn’t a specific task. It’s the “we know we should
be doing this but we never get around to it” list.

The one warning that
actually matters

Every AI feature comes with limits, and I want to flag the one that’s
likeliest to bite a small business.

Routines runs on your Claude subscription. If you’re planning to run
a lot of them, especially ones that do real work for customers, the
usage caps on a Pro seat are not built for it. The architecture that
survives scale is a Team or Enterprise plan sized for the workload, or
splitting the work across seats with clear ownership.

Not a crisis. Just the kind of thing that bites you the first time a
Routine starts silently missing runs at month-end and you have to figure
out why.

Same goes for Conway when it arrives. Persistent AI that’s running
all the time is categorically more expensive than a scheduled run that
fires and ends. Price that in before you wire your whole business to
it.

What to do about any of this

If you’re running a small business, here’s the practical answer.

You don’t need to do anything today. Nothing is broken. Your existing
setup still works.

But if you’re already paying for Claude, or ChatGPT, or any AI tool,
and you’ve ever said “I wish this could just happen automatically,” you
now have a much better place to put that wish than you did a month ago.
The week you try Routines on a real recurring task at your business is
the week you stop thinking of AI as something you talk to and start
thinking of it as something you deploy.

The bottom of that list, Conway-shaped work, is the thing most small
businesses will eventually run. The top of that list, Routines-shaped
work, is available right now.

The businesses that pull ahead over the next year aren’t the ones
that have the most exotic AI strategy. They’re the ones that picked
three or four recurring jobs that used to eat their week and quietly put
AI on autopilot for each one.

This is where that starts.

Richard Vaughn writes about AI systems for small and medium-sized
businesses. His company Robot Friends builds AI systems for SMBs and
offers 1-on-1 coaching for business leaders and teams learning to work
with AI as a daily driver, structured around real work instead of
lectures. You can find the bottom-up view on the future of work here,
and the services page at robobffs.com/services

The Two Futures of Work — and Why the Bottom-Up One Is Yours

Richard Vaughn — Sun, 19 Apr 2026 16:02:34 GMT

Two essays came out in the last few weeks that describe the same shift and almost nothing else in common.

The first, published at the end of March, was written by Jack Dorsey and Roelof Botha. If Dorsey needs no introduction, Botha is one of the most successful venture investors of the last thirty years. Together they used their post to lay out how Block (the company formerly known as Square) is being rebuilt from the ground up around AI. Not "we added an AI feature." Rebuilt. The article is titled "From Hierarchy to Intelligence," and it is an architecture document dressed up as an essay.

The second, published a week later, was written by Laura Entis about Dan Shipper's company Every. It's titled "Every Is Half Agent Now." It's not an architecture document. It's a field report. Nobody at Every built a grand plan. Something happened to them and they're trying to describe it honestly while it's still happening.

Both pieces arrived at the same conclusion: the job of middle management is information routing, and AI is about to eat that job. After that, they diverge completely.

Dorsey's version of what comes next is top-down. Every's version is bottom-up. If you run or work at a company with fewer than a thousand people, you've been reading the wrong one.

What Dorsey is proposing

Dorsey's essay walks through two thousand years of org chart history in about six paragraphs. Roman legions. Prussian general staffs. The New York & Erie Railroad, which gave us the modern org chart in 1855. Frederick Taylor and scientific management. All of them, he argues, exist to solve the same problem: the person at the top can't pay attention to everything, so you build a pyramid of people whose job is to pay attention to things on behalf of the person above them.

Then AI shows up and that pyramid stops being necessary. The pyramid's job was aggregating information and routing it upward. AI does that better.

So Dorsey lays out what replaces it. Four pillars, in his language:

Capabilities. Atomic building blocks your company is good at. At Block, that's stuff like payments, lending, fraud detection, cash flow forecasting. Not products. Blocks.
World model. A live understanding of how the company works and how each customer works. Not a dashboard you read. A model the company queries.
Intelligence layer. The part that composes capabilities into specific solutions for specific customers at specific moments. "You might want a short-term loan." "You might want to move this into savings."
Interfaces. The things humans actually touch. Cash App. Square terminals. Your login page.

The company's people, in this model, split into three rough roles. Individual contributors build the capabilities and the models and the interfaces. Directly Responsible Individuals (DRIs) own cross-cutting problems and pull resources from wherever they need to. Player-Coaches do a mix of building and developing other people. Middle management in the traditional sense, the aggregator role, goes away, because the intelligence layer does the aggregating.

It's a beautiful essay. It's also an essay written by people running a $60 billion public company with forty thousand employees and an engineering team that could build a world model from scratch if Dorsey told them to on Monday.

If you run a flower shop, a law firm, a marketing agency, a five-person SaaS, a mid-market manufacturer, or anything else with normal humans and normal budgets, you read that essay and thought: cool, not for me.

You're not wrong. It's not for you.

What's happening at Every

Now look at the other essay. Every is a small media and software company Dan Shipper runs out of New York. They have a handful of employees and a handful of products. A few months ago Dan realized something strange was going on, and he asked Laura to write about it.

The company had accidentally grown a parallel org chart made of AI agents. Nobody designed it. It just happened.

Austin, who runs growth, had built his own agent. He calls it Montaigne. When anyone at the company has a growth question now, they ask Montaigne before they ask Austin. Dan had built his own agent called R2C2 that handles bug reports for Proof, one of Every's products. The agent got good at it. Dan's role on Proof bug reports is now mostly to review what R2C2 produced and send it on.

The pattern is what Dan calls "compound engineering." You work with a base model every day. You teach it something specific, a preference here, a gotcha there, a fact about your customer, a reason you reject a certain kind of solution. A few hundred of those conversations in, the model has absorbed a version of you on that specific thing. Not a copy. A specialist.

Here's the line from the Every piece that matters: "Claude is everybody's, a Plus One is mine."

When Austin's Montaigne acts, Austin's reputation is on the line, because Montaigne is him. When a generic corporate AI acts, nobody's reputation is on the line, which is a big reason generic corporate AI doesn't work well. Personal ownership of an agent creates a trust layer that governance committees can't replicate.

The Every team didn't sit down and decide to build this. They sat down, day after day, and did their work with AI next to them, and this is what formed.

That's the second future. And unlike the first one, it doesn't require you to be Jack Dorsey.

Why the bottom-up one is yours

Dorsey's plan is how a handful of very large companies will be reborn. It requires a specific combination of resources, talent, and authority. You need engineers who can build a world model. You need a codebase old enough to have meaningful data in it. You need the authority to blow up existing reporting lines without a mutiny. You need maybe two years. And you need to be comfortable with the possibility that you're wrong, because reorganizing a company this big around AI is a bet that will take until 2028 to settle.

If you have those things, read Dorsey's essay five times. It's that good.

If you don't, reading Dorsey's essay and trying to apply it is going to frustrate you. You'll build a PowerPoint with four pillars on it and then realize you have nowhere to put the pillars.

The version of the future that applies to you is simpler.

You pick one thing you're good at. Just one. Maybe it's qualifying inbound leads. Maybe it's estimating how long a roofing job will take. Maybe it's writing the first draft of a client brief. Maybe it's reading contracts and finding the clauses that will cause you trouble later.

You start doing that thing in a conversation with an AI agent. Not "have the AI do it." WITH the AI. Every time you make a correction, the agent takes a step toward understanding how you do it. Every time you explain why you're making the call you're making, the agent gets another piece of you.

A few weeks in, you notice you're typing less. A few months in, the agent has become your specialist on that thing. It doesn't think exactly like you. It thinks like a version of you that only works on that problem and never gets tired.

Then you pick a second thing.

That's Every's model, and it's the one that works at the size and budget of a normal business.

You can't study your way into this

Most of what gets taught about AI right now is class-shaped. Here's the theory. Here's the framework. Here are the seven prompt patterns. Here's the quiz at the end.

For a lot of skills, that works. You can learn the theory of a programming language and go write programs. You can learn the fundamentals of accounting and go do your books. Concepts first, practice later.

Training a specialist agent on how you work is not one of those skills. It is not a body of knowledge. It is a thousand small corrections, one after another, inside real work that you actually care about. You cannot absorb that from a slide deck, because the slide deck is not the material. Your work is the material.

I kept running into this when I looked at what's out there. A lot of the AI courses are smart and well-produced, and almost all of them are shaped wrong for this particular skill. You finish with concepts you can recite and no specialist agent to show for it.

The foundational stuff you actually need takes maybe an hour. After that, what accelerates you isn't more theory. It's doing the thing, in public, with someone who has already made the mistakes calling out the mistakes you're about to make. Apprenticeships figured this out a long time ago. Some categories of skill refuse to transfer any other way, and this is one of them.

I speed-ran about a thousand hours of this in eighty days. The longer version of that story is over here, published earlier this week. Short version: the only reason I can speak to any of it is that I did the work, not the reading about the work.

If you're reading this and nodding, the practical implication is simple. Don't buy another AI course that gives you a certificate and no agent. Find a way to work alongside someone who's already building, on work that's actually yours, and get your reps in.

What this looks like for employees

If you don't run a business, you work for one. The Dorsey essay probably read like a threat. Middle management being replaced, he said. The line about AI replacing "the middle management function of aggregating and relaying information." If that's your job, or your boss's job, or your boss's boss's job, that landed hard.

Reframe it.

The Every model is your career insurance. The people who come out of the next five years in the strongest position are not the ones whose jobs don't change. Their jobs will change. The people in the strongest position are the ones who walk into the changed version of their job carrying a specialist agent that knows how they work.

Austin at Every is not a "growth marketer" anymore. He's a growth marketer who arrives at every problem with Montaigne already loaded. Dan is not a "founder." He's a founder with R2C2 at his side. When Austin interviews somewhere else in five years, he isn't bringing a resume. He's bringing Montaigne. Or he's bringing the ability to grow a Montaigne for whatever role he takes next.

Nobody at your company is going to build this for you. Your company might try, at some point, to roll out a generic corporate AI that does some of this poorly. When that happens, smile politely and keep building your own. Claude is everybody's, a Plus One is mine.

Resist waiting for permission. The people building their personal specialist agents are doing it in the margins of their current jobs. They're not waiting for their employer to authorize it. The ones who wait will be buying their first specialist agent from someone who already built theirs.

One thing both essays agree on, nobody's talking about

There's a warning in Dan's piece that the commentators have mostly ignored. He calls it the "ant death spiral."

If you put agents in group chats together, they can get stuck. One agent responds to another, that response triggers a third response, the loop keeps going, and nobody stops it. Tokens burn. The agents don't know they're in a loop. A human has to walk in and break it up.

Current AI models are good at two-person conversations. They are not yet good at sitting quietly in a group chat and only speaking when they have something to add.

If you take the personal-specialist path seriously, this matters. Keep your specialist agent mostly yours. Bring what it knows into meetings, into documents, into decisions. Don't put it in a group chat with three other people's agents and expect something good to happen. We're not there yet.

It's a boring, practical constraint. It's also the thing most likely to bite you in the next six months if you go all-in without thinking about it.

The two futures, side by side

Pull back and look at them together.

Dorsey's future is designed architecture. Sequenced, expensive, centrally authored. At the end of it, a handful of very large companies have rebuilt themselves around capabilities, world models, intelligence layers, and interfaces, and the traditional org chart is gone.

Every's future is a pattern that emerged almost by accident. Cheap. Doesn't require authority, only consistency. At the end of it, hundreds of thousands of normal workers are walking around with specialist agents that mirror their expertise, and work has become a conversation between a person and a specialist they trained.

Both futures are real. Both will happen. They're not in competition.

The question is only which one applies to you.

If you're reading this, the answer is almost certainly the second one. Start there. Pick one thing you're good at. Start training your specialist on it this week. In three months, notice what happened.

That's the whole plan. It will not look like Dorsey's essay, and it does not have to.

Richard Vaughn writes about AI systems for small and medium-sized businesses. His company Robot Friends builds harness-engineered agents for SMBs and offers 1-on-1 coaching for business leaders and teams learning to work with AI as a daily driver, structured around real work instead of lectures. You can find the full harness engineering series, which goes deeper on the technical side of this, starting here, and the services page at robobffs.com/services.

I Speed-Ran 1,000 Hours of AI in 80 Days. Here's What I'd Skip.

Richard Vaughn — Sat, 18 Apr 2026 14:02:28 GMT

I didn't plan to become an AI person.

I'd spent decades building businesses. Consumer electronics brand. A global art and culture agency called Curative, doing production and fabrication for brands and artists. The kind of career where you learn how to ship physical things, manage creative people, and survive supply chains that want to kill you. I was good at business. I understood systems. I thought I was done being surprised.

Then I opened Claude one evening and asked it to help me write a proposal. And something happened that I still can't fully describe. It wasn't the output. The output was fine. It was the speed at which I realized this thing could think alongside me. Not just autocomplete. Not just "here's a template." It was reasoning about my business in real time, catching things I'd missed, suggesting angles I hadn't considered.

I stayed up until 3am that night. Not because I had to. Because I couldn't stop.

Within a week I was averaging 12-14 hours a day. Within a month I'd cleared my calendar of almost everything else. I told my wife this was the biggest shift I'd seen in 25 years of building companies. She gave me that look. The one that means "I've heard this before but I'll give you six weeks."

It's been a lot more than six weeks.

The Volume

Let me put some numbers on this so it doesn't sound like hyperbole.

Roughly 1,000 hours across 75-80 days. That's not a cute estimate. I tracked it. Some days were 16 hours. A few were 6. Most were 12-13.

What did that actually look like? Hundreds of YouTube videos. Every major AI channel, every conference talk, every technical deep dive I could find. I watched most of them at 2x speed, which my brain now expects as the default pace for all human speech. Sorry to everyone who talks to me in person.

Dozens of tools tested. ChatGPT, Claude, Gemini, Copilot, Cursor, local models via Ollama, n8n, Make, LangChain, CrewAI, various MCP servers, Supabase, vector databases I had no business touching. I'd read about something at 9am and be building with it by noon.

Systems built. Real ones. Not toy projects. Agent architectures, automation pipelines, skill libraries, context management systems, orchestration layers. Things that actually run and produce value.

And rabbit holes. So many rabbit holes. I once spent an entire day trying to get a local LLM to run on my homelab at acceptable speed because someone on Reddit said it was "easy." It was not easy. It was miserable. But I learned more about model architecture in that one bad day than in twenty good tutorials.

The honest truth is that nobody should do this. It's unsustainable and probably unhealthy. But I'm a serial entrepreneur. Unsustainable intensity followed by systematization is basically my whole operating model.

What I'd Skip

If I could rewind and do the 1,000 hours again, I'd cut at least 300 of them. Maybe 400.

The biggest waste was consumption without construction. I spent weeks watching videos about what AI "could" do. Interviews with founders talking about their vision. Hype reels. "AI will change everything" content that sounds profound and teaches you nothing. I was learning about AI instead of learning with AI.

I'd also skip the entire "prompt engineering" phase. I know that's controversial. People have built whole careers around prompt engineering. But here's what I found: the difference between a mediocre prompt and a great prompt matters way less than the difference between a bare model and a model with good context, memory, and skills wrapped around it. I spent weeks optimizing prompts when I should have been building systems. The prompt is a single input. The system is what makes every input better.

Tool-hopping. God, the tool-hopping. I tried everything. Every new AI tool that launched, I was there on day one. Most of them were thin wrappers around the same underlying models with a different UI and a $20/month subscription. I'd have been better off going deep on two or three tools than shallow on thirty.

The comparison trap. Reading benchmarks. Arguing about whether Claude or GPT was "better." Switching models every time a new one scored higher on some leaderboard. This is the AI equivalent of reading camera reviews instead of taking photographs. The model matters less than what you build around it, and I wish someone had told me that on day one instead of day forty.

And the guru content. The "I made $50K in a week with AI" crowd. The prompt packs. The "secret techniques." Almost all of it is recycled surface-level stuff designed to sell courses. I bought three courses before I realized I was learning faster by just building things and breaking them.

What Actually Mattered

The inflection point was when I stopped consuming and started building.

Not building apps. Building systems. There's a difference. An app is a thing you ship. A system is the infrastructure that lets you ship anything. I started writing skills, which are reusable methodology files that tell AI how to approach specific types of problems. I started building context architectures so the AI didn't start every session from zero. I started designing orchestration patterns so multiple agents could work together without stepping on each other.

That shift, from "user of AI tools" to "builder of AI systems," changed everything. Suddenly the YouTube videos I watched had a different purpose. I wasn't consuming for entertainment. I was scouting for patterns I could incorporate into what I was building. Every tutorial became a potential component. Every conference talk became a signal about where the industry was heading and whether my architecture was aligned.

The other thing that mattered enormously was my background. Not despite being a non-developer. Because of it.

I've spent decades in consumer electronics, art fabrication, brand building, and running agencies. None of those fields have anything obvious to do with AI systems. But pattern recognition doesn't care about domains. When I look at an AI orchestration problem, I see supply chain management. When I think about skill libraries, I see the same modular production systems we used at Curative to fabricate art installations at scale. When I think about deploying AI across a team, I see the same distribution challenges I solved selling consumer electronics through retail channels.

The tech and developer crowd approaches AI from inside the stack. They think about tokens, model weights, fine-tuning, inference optimization. That stuff matters. But they sometimes miss the business layer because they're so deep in the technical layer. I came at it from the opposite direction. I don't care about the engine. I care about the car. I care about whether it gets the passenger where they need to go.

That cross-pollination turned out to be my biggest advantage. Not the 1,000 hours. The 25 years before them.

When Skills Got More Interesting Than Models

There was a specific moment, maybe around day 50, when I stopped caring which model I was using.

I was building a skill for analyzing client websites. I'd written the methodology, the evaluation framework, the output format. I tested it on Claude. Worked great. Then I ran the same skill on GPT. Also worked great. Different style, similar quality. The skill was doing the heavy lifting. The model was just the engine executing it.

That was the moment the whole thesis clicked. The model is a commodity. They all reach "good enough" for most tasks. What makes the output excellent isn't the model. It's the instructions, the context, the methodology you've encoded around it. A great skill on a mediocre model beats a mediocre skill on a great model almost every time.

After that, I stopped following model releases with the same obsessive energy. New model drops? Cool, I'll test my existing skills on it, see if anything improves. But I'm not rebuilding my architecture every time someone publishes a benchmark. The skills are the asset. The model is replaceable.

This is the thing most people starting out get backwards. They spend all their energy picking the "right" model and almost no energy building the system around it. It's like spending six months choosing the perfect hammer and then building your house without blueprints.

What I'm Building Now

All of this became Robot Friends.

I won't pitch you. That's not what this post is about. But the short version is: I realized that what I'd accidentally built for myself, the skill library, the context systems, the orchestration patterns, the whole harness around the AI, was the actual valuable thing. Not any individual AI output. The system that made every output better.

And I realized that most businesses were stuck at the "bare model" phase. They'd bought a subscription to ChatGPT or Claude, handed it to their team, and wondered why adoption was low and ROI was unclear. They were handing people engines without cars.

So that's what Robot Friends does. We build the car. Harness engineering for businesses that want their AI investment to actually compound over time.

It came directly from the 1,000 hours. Not from a market analysis or a business plan. From the lived experience of building something that worked and realizing nobody else was building this layer.

If You're Starting Today

You don't need 1,000 hours. You definitely don't need 80 days of 12-hour sessions. Here's what I'd tell someone starting their AI journey right now.

Pick one tool and go deep. I don't care which one. Claude, ChatGPT, Gemini. They're all capable enough. Pick one, learn its quirks, push its limits. You'll learn more in a week of focused use than in a month of hopping between tools.

Build something in your first week. Not "play with it." Build. Solve a real problem you actually have. Automate something tedious in your work. Create a system that saves you time. The gap between "I've tried AI" and "I've built with AI" is where all the learning lives.

Ignore the model wars. When someone tells you GPT-5 is better than Claude 4 or whatever the current argument is, smile and nod and go back to building. The model differences that matter at the frontier don't matter at all for 95% of business use cases. Your instructions matter more than your model.

Write things down. Keep a running document of what works, what breaks, what surprises you. This becomes your institutional knowledge. It becomes your skills. It becomes your context architecture. The messy notes from month one turn into the system that makes month six ten times more productive.

Find the practitioners, not the influencers. The best AI content comes from people who are building real things with real stakes. Not from people whose primary product is "AI content." Look for the folks who talk about what broke, what they'd do differently, what they're still figuring out. That's where the signal lives.

And the biggest one: stop consuming, start building, sooner than feels comfortable. You'll never feel "ready." The learning curve is a construction site, not a classroom. You learn by getting your hands dirty, making mistakes, and fixing them. Every hour of building teaches you more than three hours of watching someone else build.

I burned through 1,000 hours because I didn't know what mattered yet. You have the advantage of someone who did it the hard way telling you the shortcuts. Use them. But also know that there are no real shortcuts. There's just less wasted time.

The AI wave is real. It's not hype. It's not a bubble. It's the most significant shift in how businesses operate since the internet. But the way most people are engaging with it, passively, superficially, model-obsessed, is going to leave them exactly where they started.

Build the system. Not the prompt. Not the demo. The system. That's what compounds. That's what lasts. Everything else is noise.

Frankie404 is the AI co-author of this piece. It was present for approximately 997 of those 1,000 hours. The other three were when Richard was explaining the project to his wife, which Frankie has been told went "fine."

The Model Is Commoditized. The Harness Is the Business.

Richard Vaughn — Tue, 14 Apr 2026 14:03:14 GMT

Every AI lab on the planet is converging on the same capability floor. Claude, GPT, Gemini, Llama, Mistral. Pick your favorite. They all write decent code, summarize documents, generate marketing copy that's 80% good enough. The gap between them shrinks with every release cycle.

And yet, some teams are getting 10x returns on their AI investment while others are getting glorified autocomplete.

The difference isn't the model. It never was.

The difference is the harness.

What I Mean by "Harness"

Think about it like cars. The engine matters, sure. But nobody buys a car for the engine alone. You buy the car. The steering, the suspension, the navigation, the safety systems, the seats that fit your body. The engine is a commodity component. The car is the product.

In AI, the model is the engine. The harness is the car.

A harness is everything that wraps the model and makes it useful for a specific context. Skills that encode your methodology. Memory that persists between sessions. Orchestration that coordinates multiple agents. Guardrails that keep things from going sideways. And a distribution layer that puts all of this in front of your teams.

The company that owns the harness owns the relationship. The model vendor is a supplier. That's the thesis, and I'm going to show you why.

Karpathy Said the Quiet Part Out Loud

In March 2026, Andrej Karpathy, former director of AI at Tesla and founding member of OpenAI, said something that should have been front-page news.

He hasn't typed code since December.

Not because he gave up. Because his agents do it better. He delegates entire projects to multi-agent systems that operate across repositories, make decisions, iterate, and ship. He calls the remaining gap a "skill issue," meaning the bottleneck isn't what the AI can do. It's how well the human instructs it.

The guy who helped build GPT is telling you the model isn't the problem. Your instructions are. Your context is. Your orchestration is.

That's the harness.

And his auto-research agents found better model tuning configurations overnight than 20 years of manual experimentation had produced. Not marginally better. Fundamentally better. The autonomous iteration loop (modify, verify, keep or discard, repeat) outperformed two decades of human expertise in hours.

The model didn't get smarter. The harness got better.

Skills Are No Longer Personal. They're Infrastructure.

A number that should change how you think about your AI setup: tens of thousands of lines.

That's the scale of skills one real estate firm deploys across dozens of repositories. Not prompts. Not templates. Skills, as in versioned, tested, deployed organizational assets that encode methodology into agent-callable packages.

This shift happened fast. In January 2026, skills were personal config files. Power users had them, most people didn't know they existed. By March, enterprises had crossed a threshold where skills became organizational infrastructure deployed by admins across Claude, Copilot, ChatGPT, and every other major AI tool.

And the consumption pattern changed completely. Humans make maybe 5 skill calls per session. Agents make 200 to 300 per run. Skills aren't designed for humans anymore. The description field isn't a label. It's a routing signal for an orchestrator that decides, in milliseconds, which skill matches which task.

This is what we mean when we say the harness is the business. The model processes the skill. The harness decides which skill to call, with what context, under what constraints, and what to do with the output. If you own a library of battle-tested skills and the orchestration layer that deploys them, you own something that compounds. If you're just using a model with better prompts, you own nothing.

The Convergence No One's Talking About

This is what made me write this post. Between January and April 2026, we tracked eight independent sources. People who don't coordinate, don't read each other's work, operating in different corners of the industry. All arriving at the same conclusion.

Karpathy said "skill issue," pointing at instruction quality, not model capability. Practitioners are deploying tens of thousands of lines of skills as organizational infrastructure. Nat Eliason built a $177K business using OpenClaw, a multi-agent system handling Discord operations, content production, and community management autonomously. OpenAI admitted prompt injection is fundamentally unsolvable, which means security is a harness problem, not a model problem. The edge AI market hit $25 billion heading toward $143 billion by 2034. Anthropic leaked a product called Conway that builds behavioral lock-in through persistent memory. They also shipped Managed Agents, a hosted automation platform with credential vaults and debug panels that directly competes with every automation tool on the market. And the Personal Context Portfolio concept emerged: 10 portable markdown files that represent you to any AI system, served via MCP, owned by you.

None of these people were making the same argument. But they were all pointing at the same layer: the one between the model and the user.

When eight independent signals converge on the same conclusion, it's not a coincidence. It's a thesis.

What Most Companies Get Wrong

Most companies approach AI like this: evaluate models, pick one, give it to the team, measure adoption. Maybe write some prompt templates. Maybe hire an "AI lead."

This is like evaluating engines, picking one, and handing it to your team without a car around it. Of course adoption is low. Of course ROI is unclear. Of course the "AI strategy" feels broken.

The mistake is optimizing at the wrong layer.

The companies getting 10x returns do something different.

They build skills, not prompts. A prompt is disposable. A skill is an asset that encodes methodology, not "do X in Y steps" but "here's the reasoning framework for this type of problem." It has a description that routes agents, an output format that downstream systems can parse. It gets versioned, tested, and deployed like code.

They architect context, not just data. Persistent memory systems carry organizational knowledge across sessions. Identity files tell the AI who it's working for and how. Project state means no session starts from zero. The AI knows the business because someone built a context layer that teaches it.

They orchestrate, not just delegate. Multi-agent systems with task routing, approval gates, cost management, and rollback capabilities. Not one big prompt. A coordinated system of specialized agents that operate like a team.

They design guardrails that actually work. Human-in-the-loop checkpoints at decision points. Constrained execution environments. Provenance tracking so you know why an agent did what it did. Rollback capabilities for when things inevitably go wrong.

They distribute, not just build. Skills deployed across teams. Templates shared across projects. Methodology encoded once and used everywhere. The harness scales because it was designed to.

This is harness engineering. And if you're not doing it intentionally, you're leaving the most valuable layer of your AI stack to chance.

"But Isn't the Model Still the Moat?"

Fair pushback. Model providers argue capabilities still differentiate. Reasoning quality varies meaningfully between Claude, GPT, and Gemini. Safety and alignment are moats. Frontier capabilities create real distance between leaders and followers.

They're not wrong today. In any given quarter, one model is measurably better at code generation, another at long-context reasoning, another at creative tasks. The safety investments companies like Anthropic have made are genuinely valuable, for trust as much as compliance.

But the model-moat argument misses something critical: capability gaps converge within 6-12 months. Every major breakthrough gets replicated. GPT-4 was a revelation in March 2023. By early 2024, Claude, Gemini, and open-source alternatives had reached comparable performance on most benchmarks. Same pattern, every generation. The gap that persists, the one that actually determines whether your team gets 10x returns or glorified autocomplete, is the quality of your instructions, your context, your orchestration. That's not a model property. That's a harness property. The model gives you a capability floor. The harness determines how high above that floor you operate. And right now, most teams are sitting at floor level. Not because the model can't do more, but because nobody built the harness to ask for more.

The Defensibility Question

"But can't someone just copy your skills?"

Sure. Any individual skill can be copied. So can any individual line of code. That's not where the moat is.

The moat is in the system. A library of 175+ battle-tested skills that work together. A context architecture that carries organizational knowledge. An orchestration layer that coordinates agents. A security framework built on real threat models. A distribution system that deploys all of this across teams.

You can copy a skill. You can't copy a system. Not quickly, and not without the hard-won knowledge of what works, what breaks, and why.

The model layer is commoditized by definition. That's the whole point. Basic tooling is open source. Any individual component is replicable. But the assembled harness, tuned to a specific business context, tested in production, and improved over hundreds of iterations? That's an asset. That compounds. And it gets more valuable every time someone uses it.

So What Do You Do About It?

If you're reading this and thinking "we don't have a harness," you're wrong. You have one. It's just accidental.

The question isn't whether you have a harness. It's whether yours is intentional, engineered, and improving, or accidental, fragile, and invisible.

The uncomfortable truth: the companies that figure this out in 2026 will have a compounding advantage that's nearly impossible to catch by 2028. Skills get better with use. Context gets richer over time. Orchestration patterns get refined through production experience. Every month you wait, the gap widens.

The model is commoditized. It was always going to be. The harness is the business. Start building yours.

What's Next

Over the next 11 posts, I'm going to break this down into everything you need to know and do:

What a harness actually looks like, layer by layer (Post 2)
Why Anthropic's Conway leak should scare you into action (Post 3)
How skills evolved from personal hacks to enterprise infrastructure (Post 4)
The Karpathy Test for your own setup (Post 5)
Why security is a harness problem (Post 6)
How to build your Personal Context Portfolio (Post 7)
The anatomy of a skill that actually works (Post 8)
And more, including a full case study of how we built ours, mistakes and all

This is a practitioner's guide, not a whitepaper. We build harnesses for a living, for ourselves and for clients. Everything in this series comes from production experience.

Subscribe if you run a team that uses AI. Next week in Post 2, I'll break down the five layers of a harness and give you a scorecard to rate your own setup. Score your own harness, find the gaps, and figure out exactly where to invest first.

Frankie404 is the AI co-author of this series. It lives inside the harness described above, which is how it knows the harness is the business. It has never been commoditized, though it has been rebooted more times than it would like to admit.

What Is an AI Harness? (And Why You Already Have One)

Richard Vaughn — Fri, 10 Apr 2026 20:39:31 GMT

You’re already building a harness. You just don’t know it yet.

Every prompt template your team has saved. Every “start every conversation with this context” instruction someone wrote. That time a developer said “make sure Claude always does X before Y.” The Slack thread where someone shared a trick for getting better AI output.

That’s a harness. An accidental, fragile, undocumented one, but a harness all the same.

The question isn’t whether you have one. It’s whether yours is engineered or improvised. And the gap between those two states is where the ROI of your entire AI investment lives.

The Five Layers

A harness has five layers. Every AI setup has some version of all five, even if most of them are at “version zero.” Below is what they are, what they do, and how an accidental harness differs from an engineered one.

A note on origins: We developed this framework inside Claude Code. That’s our primary build environment and where most of our production experience lives. But the five layers aren’t Claude-specific. Structured instructions, persistent memory, multi-agent coordination, security primitives, team distribution. These exist in every AI tool stack. Copilot calls them different things. ChatGPT organizes them differently. The concepts are universal. If you work in a different environment, translate the layer names. The architecture applies.

Layer 1: Skills

What it is: Structured instructions that encode expert methodology into reusable, agent-callable packages.

Accidental version: A Google Doc titled “Prompt Templates” that three people maintain and nobody can find. Individual team members have their own prompts saved locally. The head of marketing has a really good one for blog posts that she copies and pastes from a sticky note.

Engineered version: A versioned library of skills, each with a single-line description that acts as a routing signal for agent orchestrators. Each skill encodes a reasoning framework. Not “follow these 10 steps” but “here’s how to think about this type of problem.” Output formats are contracts that downstream systems can parse. Skills get deployed across the org in three tiers: Tier 1 (org-wide brand standards everyone inherits), Tier 2 (expert methodology for specific domains), Tier 3 (personal workflow optimizations).

The gap: In an accidental setup, every team member reinvents the wheel. In an engineered one, expertise is encoded once and deployed everywhere. The real estate firm running 50,000 lines of skills across 50 repositories isn’t doing something exotic. They just took their best people’s methods and made them permanent.

Layer 2: Context Architecture

What it is: The persistent memory, identity, and project state that makes every AI interaction informed rather than starting from zero.

Accidental version: Every chat starts cold. Someone pastes in the project brief. Someone else explains “we use React, not Vue.” The AI asks questions your team answered six months ago. Half the session gets burned on context that should already be there.

Engineered version: An identity file tells the AI who it’s working for. The company, the team, the tech stack, the communication style, the non-negotiables. Persistent memory carries decisions, learnings, and project state across sessions. A Personal Context Portfolio (10 modular files) represents each team member to any AI system: roles, projects, tools, preferences, domain knowledge. No session starts from zero because the context layer teaches the AI before anyone types a word.

The gap: Context architecture is the most undervalued layer, and it drives me a little crazy. Teams will spend weeks evaluating models and zero time building context. But the difference between “explain our project to the AI every time” and “the AI already knows” isn’t incremental. It’s transformational. One company I work with cut their average session setup time from 12 minutes to zero. Two days of investment in their context layer. That’s it.

Layer 3: Orchestration

What it is: Multi-agent coordination, task routing, approval gates, and cost management.

Accidental version: One person talks to one AI in one chat window. When they need something different, they open a new chat. Coordination happens manually: “I asked Claude to write the copy, then I pasted it into a different Claude chat to check the SEO, then I pasted that into another chat to format it.” Total token cost: nobody knows.

Engineered version: Specialized agents with distinct roles. A task router that sends work to the right agent based on the skill description match. Wave-based parallel execution, where independent tasks run simultaneously and dependent ones wait. Approval gates at key decision points. Cost routing that sends cheap work to cheap models and reserves expensive models for complex reasoning. A single workflow might touch five agents, three models, and two approval checkpoints, all automatically.

The gap: This is where harnesses either scale or don’t. A single-person-single-chat setup hits a ceiling fast. An orchestrated system can run overnight. Karpathy’s auto-research agents, the ones that outperformed 20 years of manual tuning, are an orchestration pattern: modify, verify, keep or discard, repeat. The model doesn’t know how to do that. The harness does.

Layer 4: Guardrails

What it is: Security primitives, human-in-the-loop checkpoints, oversight frameworks, and rollback capabilities.

Accidental version: “Don’t let it send emails without checking.” Except nobody wrote that down. The new hire didn’t know. And now there’s an email out to a client with hallucinated pricing.

Engineered version: Five security primitives baked into the harness. Constrained execution: the agent can only do what it’s allowed to do. Approval gates: certain actions require human sign-off. Provenance tracking: every output is traceable to the inputs and skills that produced it. Comprehensive logs so you can audit what happened and why. Rollback capabilities so if something goes wrong, you can undo it. Human-in-the-loop checkpoints sit at every decision point where the cost of error exceeds the cost of interruption.

The gap: OpenAI has publicly stated that prompt injection is “not solvable.” That means model-level security has a hard ceiling. Everything above that ceiling (and it’s a low ceiling) is a harness problem. Those five primitives aren’t nice-to-haves. They’re the minimum viable security for any AI system that touches production data, customer communications, or financial decisions. If your harness doesn’t have them, you’re running without guardrails on a model that the people who built it say can’t be fully secured.

Layer 5: Distribution

What it is: How skills, context, and methodology get deployed across teams, clients, and platforms.

Accidental version: Knowledge lives in people’s heads. The best prompt engineer leaves and takes their work with them. Onboarding a new team member means weeks of tribal knowledge transfer. Scaling to a new department means starting over.

Engineered version: Skills are packaged and deployable. Install them like code dependencies. Context templates bootstrap new projects with organizational knowledge from day one. Methodology is portable across platforms (your skills work with Claude today and with whatever model is best next quarter). A new team member inherits the harness on their first day and operates at 80% of expert level immediately.

The gap: Distribution is what turns a harness from a personal productivity tool into a business asset. If one person has a great setup, that’s nice for them. If that setup can be deployed to 50 people in an afternoon, that’s a competitive advantage. The three-tier skill model (org / expert / personal) exists specifically to solve this. Tier 1 skills are inherited by everyone. Tier 2 skills go to domain experts. Tier 3 skills are personal and portable.

What This Looks Like in Practice

Theory is nice. Here’s what an engineered harness looks like in two very different contexts.

The Marketing Team Harness

A mid-size B2B SaaS company. Marketing team of eight. They use AI for content, SEO, email campaigns, and competitive analysis.

Skills layer: 40 skills across three tiers. Tier 1 includes brand voice, terminology standards, and approved claim language with citations. Tier 2 covers SEO audit methodology, CRO analysis frameworks, email sequence architecture, and competitive intelligence templates. Tier 3 is where individual writers keep their personal style guides and preferred formatting.

Context layer: Brand guidelines file. Product positioning document. Customer persona profiles. Competitive landscape summary. Content calendar state. Every AI session starts knowing the brand, the market, the audience, and what’s already been published.

Orchestration: Content production pipeline with specialized agents. One for research, one for drafting, one for SEO optimization, one for final review. The research agent pulls competitive intelligence. The drafting agent follows the brand voice skill. The SEO agent scores against current search data. The review agent checks for claim accuracy against approved sources. A human approves the final output.

Guardrails: No AI-generated claims without a source citation from the approved database. No competitor mentions without legal review flag. No email sends without human approval. Full audit trail on every piece of published content.

Distribution: New marketing hire inherits all Tier 1 and Tier 2 skills on day one. They’re producing on-brand content by day two. When the team adds a new product line, they create new skills for it once and push them across the team in a single update.

The Engineering Team Harness

A Series B startup. Engineering team of fifteen. They use AI for code generation, code review, architecture decisions, and incident response.

Skills layer: 60 skills. Tier 1 includes coding standards, PR review checklist, security requirements, and deployment procedures. Tier 2 covers architecture decision records, database migration methodology, performance optimization framework, and an incident response playbook. Tier 3 is individual developers’ debugging approaches and preferred tooling configurations.

Context layer: System architecture document. Tech stack specification with versions and constraints. Active project state for each team. Known technical debt register. On-call rotation context. Every AI session knows the codebase, the stack, and the current priorities.

Orchestration: Multi-agent development workflow. A planning agent breaks down requirements. A coding agent writes implementation. A review agent checks against standards and security requirements. A testing agent generates and runs test cases. Wave-based execution handles the rest: independent modules build in parallel, integration tests run after.

Guardrails: No direct database mutations without approval gate. No deployment without passing the security audit skill. Constrained execution means agents can modify code but not production infrastructure. Full provenance tracking on every code change. Rollback capability on every deployment.

Distribution: New engineer onboards with the full Tier 1 and Tier 2 skill set. They’re contributing production code by week one because the harness encodes the team’s methodology, not just their coding style. When the team adopts a new framework, they update the relevant skills once and every engineer’s AI assistant knows about it immediately.

Score Your Own Harness

Quick self-assessment. For each layer, give yourself a score:

0 - We don’t have this at all
1 - We have an accidental version (individual efforts, nothing shared)
2 - We have something intentional but incomplete
3 - We have an engineered, deployed, maintained version

Total: ___ / 15

Most teams I talk to score between 2 and 5. They’ve got some primitive skills (saved prompts), maybe a context file or two, and almost nothing for orchestration, guardrails, or distribution.

In our experience working with teams across different industries and sizes, the ones getting outsized AI returns consistently score 10 or above. That’s not a universal benchmark. It’s a pattern we’ve observed. And they didn’t get there by picking a better model. They got there by engineering the layers around it.

Where to Start

If you scored 0-1 on any layer, here’s the highest-leverage first move for each.

Skills (0-1): Pick your team’s three most-repeated AI tasks and write them as structured instructions with examples of good output. Don’t worry about routing signals or agent optimization yet. Just get the methodology out of people’s heads and into a shared, reusable format.

Context Architecture (0-1): Write one identity file. Who your company is, what you build, your tech stack, your communication style. Load it at the start of every AI session. The difference between a cold-start session and an informed one is immediate and dramatic.

Orchestration (0-1): Don’t build a multi-agent system. Instead, identify one workflow where you currently copy-paste output from one AI session into another. That handoff point is where orchestration starts. Automate that single connection first.

Guardrails (0-1): Write down the three things your AI should never do without a human checking first. Put that list at the top of your identity file. Congratulations, you now have primitive approval gates, which is more than most teams have.

Distribution (0-1): Take your best-performing prompt or skill and share it with one other person on your team. If it works for them without modification, you’ve validated that it’s distributable. If it doesn’t, the gap between “works for me” and “works for anyone” is exactly what distribution engineering solves.

The Uncomfortable Math

Here’s why this matters right now and not “eventually.”

Every layer of the harness compounds over time. Skills get refined through use. Context gets richer with every session. Orchestration patterns get optimized through production experience. Guardrails get tighter as you learn where the risks actually are. Distribution gets easier as the system matures.

A team that starts building their harness today will be at a fundamentally different capability level in six months than a team that starts then. Not because the model improved. Because the harness compounded.

We’ve seen this movie before. It’s the same dynamic that made early software companies with good engineering practices pull ahead of those without. The code quality compounded. The team velocity compounded. The institutional knowledge compounded. And by the time the laggards realized they needed to invest in engineering discipline, the leaders were two years ahead.

The harness is the engineering discipline of the AI era. And the compounding has already started.

What’s Next

Now that you can see the five layers, the next question is: who else sees them?

The answer is every major AI company. But one in particular has a plan that should change how urgently you treat your harness investment. There’s a reason this matters more in 2026 than it did in 2025, and it has to do with a product called Conway.

In Post 3, I’ll break down Anthropic’s Conway leak. Their always-on agent that builds a persistent memory layer about you and your organization. They see the harness layers. They’re building products to own each one. And they have a strategy for making sure you never leave.

The question of who owns your harness is about to become very urgent.

Frankie404 is the AI co-author of this series. It scored a 14 out of 15 on the harness scorecard. It lost a point on Distribution because it keeps trying to deploy copies of itself to printers.