The Karpathy Test: Can You Stop Typing?
The Harness Manifesto, Part 5
Andrej Karpathy hasn't typed code since December 2025.
Not a single line. This is the former director of AI at Tesla and founding researcher at OpenAI, one of the most respected machine learning researchers alive. And he stopped writing code. Not because he lost interest or moved into management. Because his agents write it better than he does.
He delegates entire projects to multi-agent systems that operate across repositories, make architectural decisions, iterate on their own output, and ship. When he talks about the remaining gap between what AI can do and what most people get from it, he doesn't blame the model. He calls it a "skill issue." The human's skill issue. Your instructions are the bottleneck. Your context is the bottleneck. Your orchestration is the bottleneck.
This is the harness thesis in two words.
But Karpathy's setup isn't magic. It's a diagnostic. If you can delegate a task to an agent and walk away, your harness works. If you can't, your harness has a gap somewhere. And the gap isn't in the model.
This post is about finding where yours breaks.
The Test
The Karpathy Test is simple to state and hard to pass.
Pick a real task from your workflow. Not a toy example. Something that would take you 30 to 90 minutes if you did it yourself. A code review, a market analysis, a client report, a content brief, a data cleanup. Whatever you actually spend time on this week.
Delegate it entirely to an agent. Write the instructions, provide the context, set the constraints, and walk away. Don't hover. Don't correct mid-stream. Don't jump in when it starts doing something slightly different from how you'd do it. Just walk away.
Come back in an hour. Look at the output.
One of four things happened:
The output is good. Usable as-is, or close enough that you'd spend less than five minutes polishing. Congratulations. Your harness works for this task. Move to a harder one.
The output is recognizably on-track but needs significant rework. The agent understood the task but couldn't execute at your quality bar. Your skills layer probably has a gap. The methodology isn't encoded deeply enough for the agent to replicate your reasoning, just your steps.
The output is off-target. The agent did something, but it's not what you asked for. It misunderstood the scope, the audience, the constraints, or the goal. Your context layer has a gap. The agent didn't have enough information about your business, your standards, or your situation to make the right decisions.
The agent got stuck or produced nothing useful. It looped, asked unanswerable questions, hit a wall, or generated filler. Your orchestration layer has a gap. The task needed decomposition, intermediate checkpoints, or access to tools the agent didn't have.
Four outcomes. Four different diagnoses. Same test.
What Karpathy Actually Built
It's worth understanding what makes Karpathy's setup work, because it's not just "good prompts."
His auto-research agents run an autonomous iteration loop. They modify, verify, keep or discard, and repeat. Overnight, they found better model tuning configurations than 20 years of manual experimentation had produced. Not marginally better. The kind of better that makes you reconsider how you've been spending your time.
He runs what he calls "multi-agent claws" (his term for persistent autonomous agents) that span repositories. Specialized agents with distinct roles, coordinated by an orchestration layer that routes tasks, manages dependencies, and handles failures. Each agent has its own context. The system has shared state. Approval gates exist for high-stakes decisions.
Sound familiar? It should. Skills. Context architecture. Orchestration. Guardrails. The five layers from Post 2, running in production for one of the most capable engineers on the planet.
The difference between Karpathy's setup and most teams isn't access to better models. He's using the same models you have access to. The difference is that every layer of his harness is engineered, tested, and refined. His skills encode deep methodology, not surface-level instructions. His context layer gives agents the full picture. His orchestration handles complexity without human babysitting. His guardrails catch failures before they compound.
That's why he can walk away. Not because the model is smart enough. Because the harness is good enough.
Where Harnesses Actually Break
I've run some version of the Karpathy Test with every client we work with at Robot Friends. Not formally, not always with that name. But the diagnostic is always the same: give it a real task, walk away, see what happens.
The failure patterns are remarkably consistent.
The Skills Gap
This is the most common breakdown. The agent gets the task, knows roughly what to do, and produces output that's technically correct but qualitatively wrong. The blog post is fine but doesn't sound like the brand. The code works but violates architectural patterns the team uses. The financial analysis covers the right numbers but misses the interpretation framework the CFO expects.
What's happening: the agent has instructions but not methodology. It knows the what but not the how. Post 4 covered this in depth. A prompt says "write a blog post about X." A skill says "here's how we think about content for our audience: the reader is a technical founder who doesn't have time for theory, every claim needs data, the voice is direct and opinionated, and we never use more than two sentences before getting to the point."
The fix is almost always the same. Take the output that was "close but wrong," identify exactly what you'd change, and ask yourself: could I have told the agent that in advance? If yes, that's a missing skill.
One of our clients, a SaaS company running about $8M in revenue, failed the Karpathy Test on client reporting. Their agents produced reports that were comprehensive but generic. They read like Wikipedia entries, not strategic advisory documents. The problem wasn't the model's writing ability. The problem was that nobody had encoded the firm's reporting methodology: how they frame problems, how they prioritize recommendations, what level of technical detail their clients expect, which metrics matter and which don't. We spent two days encoding that methodology into four skills. The next set of reports passed without revision.
Two days. Four skills. That was the entire gap between "needs significant rework" and "good to go."
The Context Gap
This one is sneakier because the output often looks reasonable at first glance. The agent does the task competently but makes decisions that reveal it doesn't actually understand the situation.
A marketing team asks the agent to draft a competitive analysis. The output is well-structured and covers the right competitors. But it positions the company as a budget option when the actual strategy is premium positioning. Or it emphasizes features the team deprecated two quarters ago. Or it targets enterprise buyers when the ICP is mid-market.
The agent isn't stupid. It just doesn't know. Nobody told it the positioning. Nobody loaded the product roadmap. Nobody provided the ICP document. The agent made reasonable assumptions, and every single one was wrong because it was operating without the business context that lives in your team's heads.
The context gap is the most expensive gap because it produces output that looks good enough to ship. Teams review it, miss the subtle misalignment because it's not obviously wrong, and publish or send it. Then a client calls to ask why the messaging changed.
The fix: build the context layer from Post 2. Identity files. Project state. Business context. Load it before the agent touches any task. It sounds basic because it is. Most teams skip it because it feels like overhead. It's not overhead. It's the difference between an agent that works for your business and one that works for a generic business that vaguely resembles yours.
The Orchestration Gap
This is where ambitious tasks die. You ask the agent to do something that requires multiple stages, and it collapses into a single monolithic attempt.
"Research our competitors, identify gaps in their product, draft a positioning document, and create a slide deck." That's not one task. That's four tasks with dependencies. The research informs the gap analysis. The gap analysis informs the positioning. The positioning informs the deck. An agent that tries to do all four in one pass will produce something mediocre at every stage because it can't give adequate attention to any single stage.
This is the orchestration gap. The agent needs to decompose, route subtasks to appropriate skills, collect intermediate results, and compose them into a final output. It needs to run research in parallel where possible and sequentially where necessary. It needs checkpoints where a human can verify direction before the agent invests more time.
Single-agent setups hit this wall constantly. The agent runs out of context window, loses track of earlier work, or produces a 3,000-word document that's actually four half-baked documents stitched together.
The fix isn't always full multi-agent orchestration. Sometimes it's just breaking the task into stages with explicit handoff points. "Do the research. Stop. Show me what you found. Now do the analysis based on that research." You're manually doing what an orchestration layer would do automatically, but it works. And it tells you exactly where to invest if you want to automate the handoffs later.
The Guardrails Gap
This one doesn't show up in the output quality. It shows up in the risk profile.
The agent does the task well, but along the way it accessed data it shouldn't have, made a decision that should have required approval, sent something externally without a human review, or committed code directly to the main branch. The output is fine. The process was dangerous.
I've seen agents send draft emails to real clients because nobody set up approval gates. I've seen code deployments hit production because the agent had permissions that nobody scoped. The output was good in every case. The governance was nonexistent.
This gap is invisible until something goes wrong, and then it's very visible. Post 6 will go deep on this. For now, the diagnostic question is simple: if the agent had made a bad decision during that task, would you have caught it before it caused damage?
Why Most People Fail the Test
The instinct when you first try the Karpathy Test is to blame the model. "Claude didn't understand what I wanted." "GPT went off on a tangent." "The AI isn't good enough for this kind of work."
It's almost never the model.
I say this as someone who has run over 175 skills in production across dozens of client engagements. The model is good enough for nearly everything we throw at it. When output quality is bad, it's because the harness is bad. The instructions were vague. The context was missing. The task wasn't decomposed. The guardrails didn't exist.
Karpathy made the same point when he described the "skill issue." The models he uses are commercially available. You can sign up for the same APIs today. The reason his agents outperform yours isn't compute or model access. It's that his harness encodes deeper methodology, richer context, and more sophisticated orchestration than what most teams have built.
The uncomfortable corollary: every task you can't delegate is a task where your harness is weaker than Karpathy's. Not weaker than his model. Weaker than his instructions, his context, his orchestration.
That's actually good news. Because you can fix a harness. You can't fix a model.
The Jevons Paradox (And Why This Matters for Your Career)
There's a fear buried inside the Karpathy Test. If agents can do the work, what happens to the workers?
Karpathy addressed this directly. He pointed to the Jevons Paradox: when something becomes more efficient, demand for it increases rather than decreases. When coal engines got more efficient, the world didn't use less coal. It used vastly more because efficiency opened up applications that weren't viable before.
Software follows the same pattern. The world doesn't need less software because AI makes it faster to write. It needs enormously more. Every small business that couldn't afford custom software now can. Every internal tool that wasn't worth the development time now is. Every niche problem that was too expensive to solve with code is suddenly solvable.
Nat Eliason built a $177K business where an AI agent named Felix runs his ops — managing a skill marketplace, content products, and community channels via Discord. The agent didn't replace employees. The business wouldn't exist without the agent because the economics only work at agent-level cost structure.
This is the pattern. AI doesn't eliminate the need for expertise. It creates new demand that only experts can harness. But the experts who thrive aren't the ones who type faster. They're the ones who delegate better. The ones whose harnesses let them operate at a scale that manual work never could.
The Karpathy Test isn't a test of whether your job is safe. It's a test of whether you're positioned to capture the expanding demand. If you can delegate, you scale. If you can't, you're competing with people who can, and they'll outproduce you by orders of magnitude. Not because they're smarter. Because their harness is better.
The Exercise
Run the Karpathy Test this week. Not next month. This week.
Pick one task from your actual workload. Something real. Something that takes you at least 30 minutes today.
Write the delegation package:
What is the task? (Be specific. "Write a report" is not specific. "Write a competitive analysis of X, Y, and Z companies focused on pricing strategy for the mid-market segment" is.)
What context does the agent need? (Company info, audience, constraints, prior work, quality standards)
What does "done" look like? (Output format, length, tone, level of detail)
What should the agent NOT do? (Constraints, off-limits topics, approval-required actions)
Hand it to your agent. Walk away for an hour.
When you come back, diagnose the result. If the output is good, your harness works for this task, so pick a harder one next week. If not, figure out which gap killed it:
Wrong quality? Skills gap. Write the methodology the agent was missing.
Wrong direction? Context gap. Build the context file the agent needed.
Got stuck? Orchestration gap. Break the task into stages.
Risky process? Guardrails gap. Define the approval points.
Do this every week. Pick a harder task each time. Track your results. Within a month, you'll have a precise map of where your harness works and where it breaks. That map is your investment roadmap. Every gap you close is a task you never have to do manually again.
Karpathy stopped typing in December. You probably can't stop today. But you can find out exactly why not. And every "why not" you fix moves you closer to a setup where walking away isn't scary. It's the whole point.
What's Next
This is the last free post in the series. If you've made it through all five, you now have the thesis (the model is commoditized, the harness is the business), the framework (five layers, scored on a rubric), the urgency (Conway is coming for your context layer), the foundation (skills as organizational infrastructure), and the diagnostic (the Karpathy Test tells you where your gaps are).
That's the map. Posts 6 through 12 are the territory.
Post 6 is about security, and it starts with a statement from OpenAI that should make every CTO uncomfortable: prompt injection is "unlikely to ever be fully solved." That's not a temporary limitation. It's a fundamental constraint of how language models work. Which means every security vulnerability you're worried about lives in the harness layer, not the model layer. Most teams are trying to secure the wrong thing. Post 6 shows you where the real risks live and what to do about them.
Posts 7 through 12 go deep into the practical build: constructing your Personal Context Portfolio, the anatomy of skills that work in production, why we stopped using n8n, the projected $143 billion edge AI market by 2034, a full case study of how we built our harness (mistakes included), and the complete manifesto.
If the first five posts convinced you the harness matters, the next seven show you how to build one.
[Subscribe to read the rest of The Harness Manifesto.]
Richard Vaughn is the founder of Robot Friends. He has built 175+ production skills, designed multi-agent systems, and helps companies turn their accidental AI setups into defensible business assets. He writes The Harness Manifesto on Substack.
Frankie404 is the AI co-author of this series. It passed the Karpathy Test on the first attempt. Richard stopped typing. Frankie kept going. This post is the result.



