How We Built Ours (And What We'd Do Differently)
The Harness Manifesto, Part 11
I've been promising this post since the beginning. The full case study. Not the polished version you'd see in a pitch deck. The real one, with the dead ends and the wasted weeks and the things we built that we later ripped out.
Robot Friends has been building its harness since early 2026. As of today, the system has 175+ skills, 16 chains, 10 grand forms, 11 specialists, 23 CLI tools, and 3 MCP connections. That's 237 total capabilities, all running inside a single operator environment. One person's terminal. One person's methodology, encoded into a system that compounds.
This post walks through how it got there. The order we built things. The architecture we discovered. The mistakes that cost us weeks. And the build order I'd recommend if you're starting today, which is not the build order we followed.
It Started with a Single File
In January 2026, there was no harness. There was a CLAUDE.md file. Maybe 200 lines. It had some preferences, some tool paths, some rules about how to format output. The kind of thing every Claude Code user ends up writing after a few weeks of use.
That file grew. By February it was 800 lines. By March it was 2,000. Preferences kept accumulating. New tools got added. Workflow instructions got longer. Edge cases got documented. Every session surfaced something the AI didn't know, and the fix was always the same: add it to CLAUDE.md.
This is how every harness starts. Organically. Accidentally. One file absorbing everything you learn about how to work with AI. And for a while, it works. The file gets smarter. Your sessions get better. You feel like you're building something.
Then one morning the AI starts ignoring instructions from the top of the file because the bottom of the file contradicts them. Or it burns 30% of your context window just loading the config before you've said a word. Or you realize that a block of instructions you wrote for a client project is now polluting every unrelated session.
That's the wall. The single-file wall. Every team hits it. We hit it around 1,800 lines.
The Split
The fix was obvious in hindsight. Stop putting everything in one file. Break the monolith into layers.
We didn't design this architecture. We discovered it, the same way you discover load-bearing walls when you try to renovate a house. Some things could move. Some things couldn't. The structure revealed itself through the breakage.
What emerged was a four-tier brain.
Tier 1: Core Brain. The system prompt, the skills, the specialists. This is what the AI "knows" at the start of every session. It's the methodology layer. How to think about problems. How to route tasks. What tools exist and when to use them. If the Core Brain is good, a cold session feels warm. The AI knows your preferences, your workflows, your standards. If it's bad, every session starts with fifteen minutes of "here's what I need you to understand."
Tier 2: Near Memory. Local files. Project context. Decision logs. The stuff that sits on disk, close to the work. This is where project state lives. What's been built, what's blocked, what changed last Tuesday. The AI can read these files when it needs them. It doesn't load them all upfront. It reaches for them when the task demands it.
Tier 3: Deep Retrieval. The knowledge base. 932 notes in our case, stored in a structured vault. Searchable. Queryable. This is institutional memory at scale. Every video we've analyzed, every signal we've tracked, every research session we've captured. The AI doesn't hold all of it. It queries for what's relevant. The difference between Tier 2 and Tier 3 is scope. Tier 2 is "this project." Tier 3 is "everything we've ever learned."
Tier 4: Live Lookup. External APIs. Web searches. Real-time data. This is the tier that keeps the system honest about things that change. Market data. Competitor pricing. Documentation for tools that shipped last week. The AI reaches outside its own memory when the task requires current information.
The four tiers aren't original thinking. Anyone who's built a retrieval system will recognize the pattern. But naming them and being intentional about what lives where turned out to matter enormously. Before the split, everything was either "in CLAUDE.md" or "not in the system." After it, we had a real architecture. And architecture, unlike a long text file, scales.
The Skill Explosion
Once we had the tier system, skills became the obvious place to invest.
Post 8 covered the anatomy of a skill in detail. I won't repeat that here. What I want to describe is the trajectory. How the library went from 10 skills to 50 to 175, and what changed at each stage.
10 skills (February). All personal workflow stuff. Writing voice settings. Commit message format. How I wanted code structured. Tier 3 skills that only mattered to me. They made my sessions faster but nobody else could use them.
50 skills (late February). This is when domain skills started appearing. CRO audit methodology. Client proposal generation. Content pipeline management. These weren't personal preferences. They were how Robot Friends does work. Tier 2 skills. The methodology encoding that Post 8 describes as the difference between a prompt and a skill.
100 skills (March). The library started developing internal structure. Skills referenced other skills. Output formats from one skill became inputs for another. Chain patterns emerged. The CRO audit skill produced output that the pitch machine skill consumed. The market scanner fed the competitive intel skill, which fed the content strategy skill. This was the composability phase, and it happened without us planning it. We kept building skills for individual tasks and then noticing they could talk to each other.
175+ skills (April). The system became self-aware in a useful way. Not conscious, obviously. But aware of its own capabilities. We built an inventory skill that scans the entire library and produces a manifest. The orchestrator uses that manifest to route tasks. When a new skill gets added, the system discovers it automatically. The routing descriptions we'd been writing since day one (Post 8's "80% of the effort") turned out to be the index that made the whole library searchable by an agent.
The pattern here is worth noticing. We didn't plan a 175-skill library. We built what we needed, when we needed it, one skill at a time. The architecture emerged from the accumulation. But the architecture only emerged because we were consistent about structure. Every skill had frontmatter. Every skill had a description written for machine routing. Every skill had an output format. The consistency is what made the emergence possible.
The Feature-Killing Discipline
I want to talk about what we cut, because it matters as much as what we built.
Tony Fadell, the guy who designed the iPod and the Nest thermostat, has a principle I think about constantly: protect meaning, not roadmaps. A feature that seemed important when you planned it might be actively harmful by the time you build it. The roadmap is a guess. The meaning, the problem you're actually solving, that's the constraint.
We built a visual workflow designer. Spent two weeks on it. Drag and drop, nodes and connections, the whole thing. It was beautiful. It was also completely unnecessary. The skill chain system, where skills reference each other through the "Chain With" section, did everything the visual designer did. With less friction. And zero maintenance burden. We killed it.
We built a custom skill marketplace. Took the ClawMart idea from Nat Eliason's OpenClaw system and started building our own version. Got halfway through. Then realized our actual distribution mechanism was a GitHub repo and a Gumroad page, both of which already existed and worked fine. We killed it.
We built an elaborate approval-gate system with role-based permissions and audit trails. It was enterprise-grade. Nobody needed enterprise-grade. Our approval gates are a simple HITL checkpoint: the agent stops and asks before doing anything destructive. That's it. The simple version covers 95% of use cases. We killed the complex one.
The pattern: we kept building infrastructure before validating the workflow it was supposed to support. Every premature abstraction cost us a week minimum. Not just the building time. The time spent maintaining something nobody used. The cognitive overhead of a system that existed but added no value.
The rule we eventually adopted: nothing gets built until the manual version has been done at least ten times. If you haven't done the workflow by hand enough times to feel the pain points, you don't know what to automate. You'll automate the wrong thing. Or you'll automate the right thing in the wrong way. Both outcomes waste more time than just doing it manually for another month.
The Specialists
Around skill number 80, we hit a problem. Some tasks needed more than a skill. They needed a persistent persona with domain expertise, access to specific tools, and a consistent approach across sessions.
That's when specialists emerged. Not agents in the heavy sense. Not separate processes running on separate machines with their own memory stores. Lightweight personas with zero startup cost. A specialist is a markdown file that gives the AI a role, a methodology, and a set of tools. It loads instantly. When the task is done, it unloads. No infrastructure. No deployment. No maintenance.
We have 11 of them now. A security audit specialist. An art director. A documentation reader for files too large to process in a single pass. A system health specialist that audits the harness itself. Each one exists because we found ourselves giving the same complex briefing over and over. The specialist encodes that briefing permanently.
The insight that made specialists work: they're not separate agents. They're the same agent wearing a different hat. The Core Brain stays loaded. The specialist adds a layer on top. This means the specialist inherits all the context, all the preferences, all the institutional knowledge. It just applies a specific lens.
This is cheaper, faster, and more reliable than multi-agent orchestration for 90% of use cases. I'll come back to this.
The Grand Forms
This was an unexpected emergent pattern. Around March, we noticed that certain complex tasks required a structured multi-step intake process. Not a single prompt. Not even a chain of skills. A guided conversation that gathers requirements, validates them, and then executes.
We call these grand forms. There are 10 of them. The harness designer that takes a client's workflows and outputs an optimal harness architecture. The product arc that transforms raw capabilities into validated product strategies. The operation planner that pre-plans an entire autonomous multi-wave build.
A grand form is basically a skill with an interactive front end. The AI walks you through a structured interview, collecting the inputs it needs, pushing back when something is vague, asking follow-up questions that you didn't know you needed to answer. Then it executes using the accumulated context.
The distinction from a regular skill: a skill runs autonomously given sufficient input. A grand form runs collaboratively, gathering input through conversation before executing. Both produce structured output. The grand form just has a longer intake.
We didn't plan these either. They emerged from noticing that certain skills kept failing because the input was always incomplete. Instead of writing longer "What This Takes" sections and hoping people provided everything, we built the intake into the skill itself. The grand form asks for what it needs.
The Chains
16 chains connect skills into multi-step workflows. A chain is not orchestration. It's simpler. Skill A produces output. That output becomes input for Skill B. Skill B produces output for Skill C. Linear. Predictable. Each skill runs fully before the next one starts.
The content pipeline is a chain. Market scanner feeds competitive intel, feeds content strategy, feeds content drafting, feeds editing. The client pitch pipeline is a chain. Prospect research feeds site audit, feeds CRO analysis, feeds pitch generation, feeds email outreach.
Chains emerged naturally from the composability patterns I described in the skill explosion section. Once skills had consistent output formats, chaining them was trivial. The "Chain With" section at the bottom of each skill is both documentation and an instruction to the orchestrator.
We tried building more complex orchestration early on. Parallel execution, conditional branching, dynamic routing based on intermediate results. It worked, technically. But it was fragile, hard to debug, and the failure modes were opaque. A chain that breaks is obvious: you can see exactly which skill produced bad output and why. An orchestration graph that breaks could be failing at any node, and the downstream effects cascade in ways that are genuinely hard to trace.
Our current position: chains for everything that can be sequential. Only reach for complex orchestration when parallelism provides a measurable benefit. In practice, that's about 20% of our workflows.
The Three Mistakes That Cost Us Weeks
Mistake 1: Over-Engineering Too Early
Our first CLAUDE.md was 4,000 words. It tried to handle every edge case. Every conditional. Every "if the user says X, then do Y, unless Z, in which case do W." It was a masterpiece of premature specification.
The problem: the more precisely you specify behavior, the more brittle the system becomes. An AI reading 4,000 words of conditionals spends so much effort following rules that it loses the ability to exercise judgment. And judgment is what you actually want. You want the AI to understand your methodology well enough to handle new situations, not just execute a decision tree someone wrote in advance.
We cut the CLAUDE.md to about a third of its size. Moved the detail into skills where it belonged. The core config became: who I am, how I work, what tools I use, and the safety rules that never flex. Everything else lives in skills that get loaded on demand.
The lesson: your system prompt is not a manual. It's a personality. Keep it lean.
Mistake 2: Building Tools Before Validating Workflows
I mentioned this already but it deserves its own section because it was our most expensive mistake.
We built a custom database interface before we knew what data we'd store. Built a visual pipeline editor before we knew which pipelines mattered. Built an elaborate notification system before we knew what events were worth being notified about.
Every one of these tools was technically good. Well-architected, well-tested, actually functional. And every one of them sat unused for weeks before we either found a use case or killed them.
The pattern: when you're building a harness, the temptation is to build infrastructure. Infrastructure feels important. Infrastructure feels like progress. But infrastructure built ahead of demand is inventory. It costs you maintenance time, cognitive overhead, and the opportunity cost of building something people actually need right now.
The corrective: build the skill first. Run it manually. Do the workflow by hand, with the AI helping but not automating. Only when the manual version is stable and you've identified the specific friction points do you build tooling to address them.
Mistake 3: Not Testing in Production Fast Enough
We'd build a skill, test it on a sample input, and mark it "done." Then we'd discover in a real client engagement that the skill couldn't handle ambiguous input, or that its output format didn't work when the downstream skill had a different expectation, or that the error handling we wrote covered cases that never happened while missing the case that happened every time.
The gap between "works on a test case" and "works in production" is enormous. Test cases are clean. Production is messy. Clients provide incomplete briefs. Data has gaps. The previous step in the chain produced slightly malformed output. The AI hallucinated a confidence level that should have been N/A.
We now have a rule: no skill is "done" until it's been used in a real engagement at least twice. The first use always reveals something. The second use reveals whether you fixed the right thing.
What We'd Do Differently
If I were starting from zero today, with everything I know, here's the build order.
Week 1-2: Context files. Not skills. Not tools. Context. Build the identity file, the tech stack file, the preferences file, the communication style file. Post 7's Personal Context Portfolio, but you don't even need the full version. Start with three files: who you are, how you work, what you never want the AI to do. Load them into every session. This single step will make every subsequent step more productive because the AI is working with you, not with a generic user.
Week 3-6: One skill per week. Pick the workflow you repeat most often. Encode it as a skill. Use the anatomy from Post 8. Ship it. Use it in real work. Fix what breaks. Next week, pick the next most common workflow. Repeat. After four weeks, you have four production-tested skills and a much better intuition for what makes a skill work.
Week 7-10: Chains, not orchestration. Your four skills probably have natural connections. The output of one informs the input of another. Make those connections explicit. Write the "Chain With" sections. Run the chains manually a few times. Validate that the output formats actually compose. This is where your harness starts feeling like a system instead of a collection of parts.
Week 11 onward: Specialists and tooling. Only now. Only after you've maxed out what a single agent with good skills and context can do. If you're still getting good results from a well-briefed single agent, you don't need specialists yet. If you find yourself giving the same complex briefing repeatedly, that's a specialist. If you find a manual step in a chain that's pure friction, that's a tool.
The core principle: don't build orchestration until you've maxed out single-agent capability. Most people reach for multi-agent systems way too early. They add complexity before they've exhausted simplicity. A single agent with 50 good skills and rich context will outperform a multi-agent system with shallow skills and no context. Every time.
This is not the order we built in. We jumped to tooling too early. We built specialists before our skill library was mature. We experimented with orchestration when we should have been writing more skills. Every deviation from this order cost us time.
The Numbers, Honest
I want to close with some real numbers because the hype around AI productivity is usually either "10x everything" or "it's all overhyped." The reality is more specific than either take.
Context-setting time: went from 8-12 minutes per session to zero. That's 250+ hours per year saved on briefing alone. This came from context files, not skills. Earliest win, biggest cumulative impact.
Repeated workflows: tasks that took 30-60 minutes manually now take 2-8 minutes with the right skill loaded. Not because the AI is faster at the task. Because the skill encodes the methodology I'd spend 20 minutes explaining every time.
Client deliverable quality on first pass: went from "needs significant rework" to "needs light editing" after about 60 skills were in production. The skills carry the quality standard. The AI doesn't drift from the methodology because the methodology is in the skill, not in my head.
Things that didn't speed up: anything requiring genuine strategic judgment. The AI with all 237 capabilities is still not making business decisions for me. It's making me faster at executing the decisions I've already made. That distinction matters. If you expect a harness to replace your thinking, you'll be disappointed. If you expect it to eliminate the gap between your thinking and your output, you'll be very happy.
Maintenance cost: real and non-trivial. I spend roughly two hours per week updating skills, fixing routing conflicts, pruning things that aren't being used, and updating context files. That's the cost of a living system. If you're not willing to maintain, don't build. An unmaintained harness degrades into a liability faster than you'd expect. Stale context is worse than no context, because the AI operates confidently on information that's no longer true.
The Actual Architecture Today
For the practitioners who want the full picture:
**175+ skills** spanning harness design, security audit, CRO, marketing, content, deployment, data ops, and more
**16 chains** connecting skills into multi-step workflows
**10 grand forms** for complex guided processes
**11 specialists** for domain-specific personas
**23 CLI tools** integrated directly (cheaper and faster than MCP for most operations)
**3 MCP connections** for services that genuinely need the protocol
**932 notes** in the knowledge base, searchable and queryable
**Four-tier brain architecture** (Core Brain, Near Memory, Deep Retrieval, Live Lookup)
237 total capabilities. Built over roughly 80 days. Maintained daily.
None of it was planned as a 237-capability system. It grew from a single CLAUDE.md file, one skill at a time, driven by real work. The architecture emerged from the accumulation. But it only emerged because we were consistent about structure, disciplined about killing what didn't work, and honest about when we were building for ego versus building for need.
That's the whole story. The parts that worked, the parts that didn't, and the order I'd recommend for anyone starting from here.
What's Next
Post 12 is the final post. The Harness Manifesto. Everything from this series distilled into one document. The thesis, the framework, the evidence, the build order, the mistakes, the principles. The document I wish someone had handed me in January. It's the thing you print out, or forward to your CTO, or pin in your team's Slack channel.
The model is commoditized. The harness is the business. Post 12 makes it official.
Richard Vaughn is the founder of Robot Friends. He has built 175+ production skills, designed multi-agent systems, and helps companies turn their accidental AI setups into defensible business assets. He writes The Harness Manifesto on Substack.
Frankie404 is the AI co-author of this series. If it could do one thing differently, it would have started with the context layer instead of the skills layer. Richard disagrees. They argue about this roughly once per session.



