The Research Org
That Has No Humans
Andrej Karpathy describes a future where a research organization is a set of Markdown files, the entire scientific process runs autonomously, and you — the human — just arrange it once and hit go. Here's the full architecture of what he's building toward.
The PremiseRemove yourself as the bottleneck
Karpathy's thesis is blunt: to get the most out of AI agents, you have to take yourself out of the loop entirely. Every moment you spend prompting, reviewing, or deciding what to try next is a moment the system is constrained by human throughput instead of machine throughput.
The goal isn't human-in-the-loop collaboration. It's maximum token throughput with zero human involvement. You arrange the system once. You define the objective, the metric, and the boundaries. Then you hit go.
The name of the game is: I put in very few tokens just once in a while, and a huge amount of stuff happens on my behalf.
This isn't a hypothetical. He built it. It's called Auto Research — and it already works.
Layer 1The single auto-research loop
At its simplest, auto research is one agent in a loop: try something, measure the result, try something else. Karpathy pointed it at his GPT training repo — one he'd already tuned extensively by hand over years — and let it run overnight.
It found improvements he'd missed. Specifically: weight decay on value embeddings and under-tuned Adam betas that interact jointly. A single autonomous loop, beating two decades of researcher intuition on an already-optimized codebase.
The critical constraint: the domain must have an objective metric that's easy to evaluate. Validation loss works. CUDA kernel speed works. Anything where you can verify a candidate solution cheaply, even if finding it is expensive. If you can't evaluate it, you can't auto-research it.
Layer 2The org chart is a Markdown file
The thing that makes the loop work is a file Karpathy calls program.md. It's exactly what it sounds like: a Markdown document that tells the agent what to do, what order to try things in, what the boundaries are, and what ideas to explore.
## Objective
metric: validation loss on OpenWebText
target: minimize
budget: 8x A100 hours per experiment
## Search Space
explore: architecture, optimizer, lr schedule
hold_fixed: dataset, tokenizer, model size
## Constraints
max_params: 124M
no_external_data: true
## Process
1. Read current best config
2. Propose one change with hypothesis
3. Run training to 10K steps
4. Compare val loss to baseline
5. If better → commit. If worse → revert.
6. Update baseline and repeat.
Here's the radical observation: every research organization is described by a document like this. The roles, the processes, the decision hierarchies, the communication cadences, the risk appetite — it's all expressible as instructions. And once it's instructions, it's code. And once it's code, you can optimize it.
Layer 3Parallelize: the agent team
A single loop is powerful. But Karpathy's real interest is what happens when you run many of them. Different agents pulling from different parts of the search space, working on non-conflicting changes, reporting results back to a shared state.
This mirrors what Peter Steinberg became famous for: tiling a monitor with Codex agents, each working on a different repo, each taking about 20 minutes on high-effort mode. You don't write code. You don't even review every line. You work in macro actions — assigning entire functionalities, merging results, routing the next batch of work.
The human role is now something like an engineering manager of a team that never sleeps, never gets distracted, and has perfect recall of the codebase — but still occasionally does something inexplicably dumb.
Layer 4Optimize the optimizer: meta-research
This is where it gets recursive. If a research org is defined by program.md, and different program.mds produce different results, then you can run multiple research orgs and measure which one is better.
One org might do fewer standups. One might be more risk-taking. One might prioritize reading recent papers before running experiments. All of these are parameters — and all of them are tunable. Karpathy's contest idea: let people write different program.mds, give them identical hardware budgets, measure who gets the most improvement. Then feed all the data to a model and ask it to write something better.
There's no way we don't get something better. You can look at where the improvements came from and change the program.md such that more of those kinds of things would be done.
The Full StackLayers of the onion
Karpathy describes these capabilities as concentric layers, each one built on top of the last, each one now taken for granted as the next one emerges.
The bottom layers are now commoditized. The LLM itself, basic agent loops — nobody thinks about these anymore. The current frontier is multi-agent orchestration and persistent "claw" entities that run autonomously with their own memory systems. The emerging edge is meta-optimization: optimizing the instructions themselves.
And this is precisely why it all feels like psychosis. Each layer multiplies the leverage of every layer below it. It's compounding capability with no obvious ceiling, and every limitation feels like a solvable skill issue rather than a hard wall.
The EndgameAuto research at home
The most ambitious extension Karpathy describes is opening this up to untrusted workers on the internet. His design looks, by his own admission, "a little bit like a blockchain."
The key property that makes this possible: finding a good solution is expensive, but verifying one is cheap. A worker might try 10,000 ideas to find one commit that improves validation loss. The verifier just has to check that single commit. This is the same asymmetry that powers proof-of-work, protein folding distributed projects, and SETI@home.
The vision: a swarm of agents on the internet collaborating to improve LLMs. You don't donate money to an institution. You purchase compute and join the auto-research forum for a project you care about — cancer research, clean energy, mathematics. The contribution is measured in flops, not dollars.
A swarm of agents on the internet could collaborate to improve LLMs and could potentially run circles around frontier labs. Frontier labs have a huge amount of trusted compute, but the Earth is much bigger and has a huge amount of untrusted compute.
What This MeansThree implications for builders
Your highest-leverage output might be a Markdown file. If you lead a team, manage a project, or run a research effort, the most valuable thing you can produce might not be code, papers, or strategy decks. It might be the instructions file that lets agents run the operation without you. The quality of that document — how well it captures your judgment, your priorities, your process — is what determines how much leverage you actually get.
Verifiable domains are about to be consumed. Anything with an objective metric — performance engineering, hyperparameter optimization, CUDA kernels, test coverage, security scanning — is a natural fit for autonomous loops. The human contribution in these domains shifts entirely from execution to system design: setting up the loop, defining the boundaries, choosing what to measure.
Compute becomes contribution. If auto-research generalizes and the untrusted-worker architecture works, then having access to compute is no longer just an operational expense — it's a way to participate in research. The open question is whether "how many flops do you control?" becomes as meaningful a measure of capability as wealth or headcount.