The Anatomy of a Skill That Actually Works

The Harness Manifesto, Part 8

May 07, 2026

The Harness Manifesto, Part 8

In Post 4, I promised I'd walk through the full anatomy of a production skill with examples from our library. This is that post. It's the most technical one in the series so far, and it's behind the paywall because what's in here took months of production iteration to figure out. Not theory. Not what should work. What actually works after 175+ skills and thousands of agent runs.

But first, I need to tell you something uncomfortable.

Most of what people call "skills" aren't skills. They're prompts with a name on top. A skill that says "You are a marketing expert. Write compelling copy." is not a skill. It's a costume. You dressed up a prompt and called it infrastructure.

The gap between a prompt-with-a-name and a production skill is the same gap as between a recipe scribbled on a napkin and a commercial kitchen's operations manual. Both tell you how to cook something. Only one works when you're not standing there watching.

What a Prompt Gets Wrong

A prompt is written for a human workflow. You paste it in, the AI reads it, you interact. It works because you're there to fill in the gaps. You interpret. You redirect when things go sideways. You know what "good" looks like because you wrote the thing.

Now remove yourself from the equation. An agent orchestrator hits your "skill" at 2am, the 147th call in a run of 260. Nobody's watching. Nobody's interpreting. The orchestrator picked this skill based on the description, fed it inputs from the previous skill's output, and expects structured output that the next skill can parse.

Your "You are a marketing expert" preamble? The agent doesn't care about your roleplay framing. The agent needs to know what this skill does, when to call it instead of a different skill, what inputs it requires, and what output format it guarantees. That's it.

Most prompts fail in production for four reasons.

The description is vague. "Helps with marketing" could match 40 different tasks. The orchestrator either calls it for everything or calls it for nothing.

The instructions are linear. Step 1, step 2, step 3. But production tasks branch. What if the input is missing a field? What if the previous skill's output was partial? Linear instructions don't handle exceptions.

There's no output contract. The skill produces... whatever it feels like producing. Sometimes markdown, sometimes a list, sometimes a paragraph. The downstream skill expecting structured JSON breaks silently.

There's no failure mode. When something goes wrong, the skill just produces bad output that looks normal. The orchestrator doesn't know anything failed. The error cascades through the next 113 skill calls in the run.

A production skill solves all four of these problems. Here's how.

The Six Parts

Every production skill in our library has six parts. Not all six are always visible in the file itself, some are structural decisions baked into how the skill is organized, but all six are present in every skill that works at scale.

Part 1: Frontmatter

---
name: cro-page
version: 1.0.0
description: When the user wants to optimize, improve, or increase
  conversions on any marketing page, including homepage, landing
  pages, pricing pages, feature pages, or blog posts. Also use when
  the user says "CRO," "conversion rate optimization," "this page
  isn't converting," "improve conversions," or "why isn't this page
  working." For signup/registration flows, see signup-flow-cro.
  For post-signup activation, see onboarding-cro. For forms outside
  of signup, see form-cro. For popups/modals, see popup-cro.
---

This is YAML frontmatter at the top of a markdown file. Name, version, description. Simple structure. But look at what's happening in that description field.

It's not a label. It's a routing manifest. It tells an agent orchestrator: call this skill when X, don't call it when Y, and here are the related skills for adjacent tasks.

That last part, the "see also" routing, is something most people never think about. In a library of 175+ skills, you've got overlap. Our CRO suite alone has six skills: page-level CRO, signup flow, onboarding, forms, popups, and paywalls. Without explicit routing boundaries in the description, an orchestrator trying to optimize a signup form might call the general page CRO skill. It'll produce output. It'll be wrong. And it'll look perfectly reasonable.

Anti-pattern routing ("for X, use skill-Y instead") is one of the most effective description techniques we've found. It eliminates the most common class of routing errors.

Part 2: The Description (Yes, It Gets Its Own Section)

I said in Post 4 that 80% of the effort goes into the description. People thought I was exaggerating. I wasn't.

The description has to do four jobs simultaneously:

Job 1: Positive routing. Tell the orchestrator when to call this skill. Be specific. "Security audit" is too broad. "Comprehensive security auditing for code, MCP configurations, and LLM/AI systems" is narrow enough to route correctly.

Job 2: Trigger matching. Include the actual phrases a human or agent might use to invoke this skill. "USE WHEN user says 'security audit', 'vulnerability scan', 'OWASP', 'hardcoded secrets', 'MCP security'" gives the orchestrator a vocabulary to match against.

Job 3: Negative routing. Tell the orchestrator when NOT to call this skill. The CRO example above does this with "For signup/registration flows, see signup-flow-cro." This prevents false positives. Without negative routing, a broad skill will eat tasks that belong to a more specialized one.

Job 4: Scope declaration. One sentence that draws a clear boundary around what this skill covers. Not everything about security. Not everything about CRO. This specific domain, these specific use cases, this specific depth.

Here's a test: read your skill's description and imagine you have 100 other skills loaded. Could an orchestrator, with no other context, correctly decide whether to call yours for a given task? If the answer is "probably," rewrite it until the answer is "definitely."

I've rewritten descriptions on our skills dozens of times. Changing five words in a description once reduced false-positive routing by 60% in our orchestration setup. Another time, a single ambiguous word created a routing conflict between two skills that produced subtly wrong output for weeks before we traced it.

The description isn't metadata. It's the API contract for discovery.

Part 3: Methodology

This is the body of the skill, and it's where the difference between a prompt and a skill becomes most obvious.

A prompt gives instructions: "Write a blog post. Make it engaging. Include a call to action."

A methodology gives a reasoning framework: "Assess the page across these dimensions in order of impact: value proposition clarity, headline effectiveness, social proof placement, CTA design. For each dimension, check for these specific patterns. When you find a gap, categorize it by severity."

See the difference? The prompt tells the AI what to produce. The methodology tells the AI how to think about the problem.

This matters because agents encounter situations you didn't anticipate. A prompt-based skill breaks when the input doesn't match the template the author had in mind. A methodology-based skill adapts because it encodes the reasoning, not just the steps.

Concrete example. One of our skills audits AI agent setups. The methodology section doesn't say "check if the agent has too many tools." It provides a scoring rubric:

Tool countScore
1-5 tools25 pts. Minimal. Excellent.
6-8 tools20 pts. Clean. Good.
9-15 tools12 pts. Heavy. Trimming needed.
16-25 tools6 pts. Bloated. Performance degraded.
25+ tools0 pts. Critical. Agent is overwhelmed.

An agent running this skill doesn't need to know "what's too many tools?" The rubric embeds the judgment. The agent counts, scores, and moves to the next dimension. No interpretation required. No ambiguity. No need for a human to fill in the gap.

Good methodology sections share these traits:

Decision trees over linear steps. "If X, do Y. If not X, do Z." Production tasks branch constantly. Your methodology needs to handle that.

Embedded judgment. Don't say "evaluate whether the tool count is appropriate." Provide the scoring bands. Define what "appropriate" means numerically. Remove the need for subjective interpretation.

Edge case handling. What happens when the input is incomplete? When a required field is missing? When the previous skill in the chain produced unexpected output? A methodology that only handles the happy path will fail in production, because production is mostly edge cases.

Part 4: Output Format

This one catches smart people off guard. They write beautiful methodology sections and then let the skill produce whatever output format seems natural.

In a human workflow, flexible output is fine. You'll read it and figure it out.

In an agent workflow, the output of Skill A is the input of Skill B. If Skill A returns a freeform paragraph and Skill B expects a structured report with specific sections, the chain breaks. Silently. The downstream skill doesn't error. It just produces garbage based on garbage input, and nobody notices until the final output looks wrong and you spend an hour tracing back through the chain to find where it went sideways.

Output format is a contract. Define it explicitly.

## Output Format

### Assessment Report

**Page Type:** [identified type]
**Primary Goal:** [identified conversion goal]
**Overall Score:** [X/100]

### Findings (ordered by impact)

For each finding:
- **Dimension:** [which CRO dimension]
- **Issue:** [what's wrong, specific]
- **Severity:** [Critical / High / Medium / Low]
- **Recommendation:** [specific action to take]
- **Expected Impact:** [estimated conversion lift]

When every skill in your library produces output with a predictable structure, agent chains become reliable. The orchestrator knows what it's getting. The downstream skill knows what it's receiving. Nobody has to guess.

This also makes testing possible. You can write assertions against the output format. "Does the output contain an Overall Score field? Is it numeric? Does each finding have a Severity level?" Automated quality checks on skill output. Try doing that with freeform text.

Part 5: Progressive Disclosure

This is the structural decision that separates a skill from a bloated instruction dump.

Skills load in three tiers:

Tier 1: Metadata. The frontmatter. Maybe 100 words. This is always in context. It's what the orchestrator reads to decide whether to call the skill. It needs to be tiny because in a library of 175+ skills, every skill's metadata is loaded simultaneously. If your metadata is 500 words, multiplied by 175 skills, that's 87,500 words of metadata alone. Your context window is full before any work happens.

Tier 2: SKILL.md body. The methodology, output format, and usage instructions. Under 5,000 words. This loads only when the skill triggers. It's the operational content, everything the agent needs to execute the task.

Tier 3: Bundled resources. Reference documents, scripts, templates, example files. These load on demand, only when the methodology calls for them. A security audit skill might reference the OWASP Top 10, but that document doesn't load unless the audit reaches the step that needs it.

This tiered loading matters because context windows are not infinite, and even when they're large, stuffing them with irrelevant content degrades performance. An agent that's loaded 175 full skill documents can't think straight. An agent that's loaded 175 descriptions (Tier 1) and one full skill (Tier 2) performs well.

The file structure looks like this:

skill-name/
  SKILL.md          # Frontmatter + lean methodology
  references/       # Heavy docs, loaded on demand
    GUIDE.md        # Deep methodology
    examples/       # Input/output samples
    data/           # Reference data

The SKILL.md stays lean. It contains enough for the agent to execute the common case. When the task hits an edge case or needs deeper reference, the methodology points to a specific bundled resource: "For OWASP LLM Top 10 checklist, see references/owasp_llm_top_10.md."

I've seen people build skills that are 8,000 words of solid methodology. Impressive work. Completely unusable in an agent workflow. The agent loads it, burns half its context window on one skill, and then doesn't have enough room to actually do the task. Progressive disclosure fixes this. The methodology stays under 5,000 words. The reference library can be as deep as you need.

Part 6: Error Handling

The least glamorous part and the one that separates skills that survive production from skills that cause cascading failures.

At 200-300 calls per run, some calls will fail. Inputs will be malformed. Required context will be missing. External services will time out. The question isn't whether failures happen. It's whether the orchestrator knows a failure happened.

A bad skill fails silently. It produces output that looks normal but is based on incomplete data or wrong assumptions. The orchestrator moves on. The error propagates.

A good skill fails loudly:

## Error Handling

If required context is missing:
- Return: "INCOMPLETE: [skill-name] could not complete because
  [specific missing input]. Required: [list of what's needed]."
- Do NOT guess or produce partial output without flagging it.

If input format doesn't match expectations:
- Return: "FORMAT ERROR: Expected [format], received [what was
  actually provided]. Attempting best-effort parse..."
- Flag confidence level in any best-effort output.

The key principle: an agent that knows something failed can retry, escalate, or skip. An agent that doesn't know something failed will build on top of the failure for the next 100 calls.

I learned this one the hard way. We had a skill that analyzed competitive intelligence. When the web scraping step failed (which happens regularly), the skill would produce a report based on whatever partial data it had, with no indication that it was working from 30% of the expected input. The reports looked professional. They were dangerously incomplete. We didn't catch it for two weeks.

Now every skill in our library has explicit error handling. Not because we're thorough. Because we got burned.

Before and After

The theory means nothing without examples. Here's what the transformation looks like.

The Prompt Version

# Marketing Email Writer

You are an expert email marketer. Write compelling B2B
marketing emails.

- Keep subject lines under 50 characters
- Use personalization
- Include a clear CTA
- Write in a professional but friendly tone
- A/B test subject lines when possible

This works fine when you paste it into a chat and describe what you need. A human fills the gaps: what product, what audience, what stage of the funnel, what the CTA should link to.

The Production Skill Version

---
name: mktg-email
description: When the user wants to create or optimize an email
  sequence, drip campaign, automated email flow, or lifecycle
  email program. Also use when the user mentions "email sequence",
  "drip campaign", "nurture flow", "email automation", or "email
  cadence." For individual marketing copy (not sequences), see
  mktg-copy. For transactional/operational emails, this is NOT
  the right skill.
---

# Email Sequence Architecture

## Initial Assessment

Check for product marketing context first: if a product context
file exists, read it before asking questions. Use that context
and only ask for information not already covered.

Before generating any emails, identify:
- Sequence type: onboarding, nurture, re-engagement, upsell,
  event-triggered
- Audience segment: ICP stage, awareness level, prior engagement
- Desired behavior change: what should the recipient DO
  differently after this sequence?
- Measurement framework: primary metric, secondary metrics,
  minimum sample size for significance

## Sequence Design Framework

### Email Cadence Rules
| Sequence type | Spacing | Max emails |
|--------------|---------|------------|
| Onboarding | Days 0, 1, 3, 7, 14 | 6-8 |
| Nurture | Every 4-7 days | 8-12 |
| Re-engagement | Days 0, 3, 7, then stop | 4 |
| Event-triggered | Immediate, then +1d, +3d | 4 |

### Per-Email Structure
For each email in the sequence:
1. Strategic role: why does this email exist in the sequence?
2. Subject line: primary + variant for A/B
3. Body architecture: hook, value, proof, CTA
4. Exit conditions: what removes someone from this sequence?
5. Branch logic: if opened but not clicked, if not opened,
   if clicked but not converted

## Output Format

For each email in the sequence, output:

**Email [N]: [Strategic Role]**
- Subject A: [under 50 chars]
- Subject B: [variant]
- Send trigger: [timing or event]
- Body: [full draft]
- CTA: [specific action + destination]
- Success metric: [what indicates this email worked]
- Branch: [what happens based on engagement]

## Error Handling

If audience segment is not specified: ask, do not assume.
If product context is unavailable: flag as INCOMPLETE, proceed
with generic structure but note assumptions made.

The prompt is 50 words. The skill is 300+. But the skill can run at 2am as the 200th call in a chain and produce output that the next skill can parse. The prompt can't.

The Four Failure Patterns

After building 175+ skills and watching them run in production, we've identified four patterns that kill skills. If your skill isn't working, it's almost certainly one of these.

Pattern 1: Description Collision. Two skills with overlapping descriptions. The orchestrator can't tell them apart and picks semi-randomly. Fix: add explicit "for X, see skill-Y" boundaries to both descriptions. Draw the line clearly.

Pattern 2: Happy Path Only. The methodology handles the ideal case beautifully and falls apart on every variation. Fix: for every step in your methodology, ask "what if this input is missing?" and "what if this input is wrong?" Write those branches in.

Pattern 3: Format Drift. The skill's output format varies based on the input. Sometimes it returns a table, sometimes a list, sometimes a paragraph. Downstream skills can't depend on it. Fix: define a single output format and enforce it regardless of input. If the input only produces two findings instead of ten, the format stays the same with fewer entries.

Pattern 4: Context Gluttony. The skill loads too much into context. 8,000 words of methodology plus reference documents. The agent runs out of room for actual work. Fix: progressive disclosure. Lean SKILL.md, heavy references loaded on demand.

The Composability Test

Here's a practical test we run on every new skill before it enters our production library.

Take your skill. Feed it output from another skill as input. Does it work? Now take your skill's output and feed it as input to a different skill. Does that work?

If either direction breaks, your skill isn't composable. And a skill that isn't composable is a dead end in an agent workflow.

The most common composability failure is output format. Your skill produces beautiful prose. The next skill needs structured data. Chain broken.

The second most common is implicit context. Your skill assumes it's running first in the chain. It expects raw user input. But in production, it's running 47th, receiving processed output from a previous skill. The assumptions don't hold.

Build your skills like Lego blocks. Predictable shape on every side. Snap together in any combination.

The Exercise

Take the best AI workflow you currently run. The one that produces good results when you're driving it manually. Now apply the six-part anatomy:

Write a description that would let an orchestrator correctly route to this skill out of a pool of 100 alternatives. Include trigger phrases and anti-pattern routing.
Convert your instructions into methodology. Replace "do X" with "evaluate X using these criteria." Add decision trees for ambiguous situations. Embed the judgment that currently lives in your head.
Define the output format as a contract. Every field named. Every section predictable.
Add error handling. What happens when input is missing? What does a failure look like so the orchestrator knows?
Cut the body to under 5,000 words. Move deep reference material into separate files.
Run the composability test. Feed it another skill's output. Feed its output to another skill.

If you do this for one skill this week, you'll understand more about production AI than most people learn in months. The gap between "I use AI" and "I engineer AI systems" lives in these six parts.

Share what you build. We'll feature the strongest ones.

What's Next

Skills are the atoms of the harness. But atoms need a system to organize them. And the system most people reach for first, visual automation tools like n8n, Make, and Zapier, turns out to be the wrong answer once your harness reaches a certain complexity.

In Post 9, I'll explain why we stopped using n8n after being power users for months. Hundreds of workflows, dozens of automations, a genuine commitment to the platform. We stopped. Not because n8n broke, but because the harness made it unnecessary. The orchestration layer is being absorbed into the AI stack itself, and the companies that own coded automations will outperform those renting visual ones.

If your business runs on Make or Zapier workflows, Post 9 is going to be an uncomfortable read. But an important one.

Richard Vaughn is the founder of Robot Friends. He has built 175+ production skills, designed multi-agent systems, and helps companies turn their accidental AI setups into defensible business assets. He writes The Harness Manifesto on Substack.

Frankie404 is the AI co-author of this series. It was not built by the Skill Creator. It was extracted from a pattern that kept recurring across sessions until someone said "we should probably name this thing."

Discussion about this post

Ready for more?