The Research Org That Has No Humans — Karpathy's Vision for Agent-Driven Science

The PremiseRemove yourself as the bottleneck

Karpathy's thesis is blunt: to get the most out of AI agents, you have to take yourself out of the loop entirely. Every moment you spend prompting, reviewing, or deciding what to try next is a moment the system is constrained by human throughput instead of machine throughput.

The goal isn't human-in-the-loop collaboration. It's maximum token throughput with zero human involvement. You arrange the system once. You define the objective, the metric, and the boundaries. Then you hit go.

The name of the game is: I put in very few tokens just once in a while, and a huge amount of stuff happens on my behalf.

This isn't a hypothetical. He built it. It's called Auto Research — and it already works.

Layer 1The single auto-research loop

At its simplest, auto research is one agent in a loop: try something, measure the result, try something else. Karpathy pointed it at his GPT training repo — one he'd already tuned extensively by hand over years — and let it run overnight.

It found improvements he'd missed. Specifically: weight decay on value embeddings and under-tuned Adam betas that interact jointly. A single autonomous loop, beating two decades of researcher intuition on an already-optimized codebase.

Single Auto-Research Loop

program.md

instructions

→

Agent

try idea

→

Train

run experiment

→

Eval

val loss ↓?

→

Commit

or discard

↻ LOOP UNTIL CONVERGENCE

The critical constraint: the domain must have an objective metric that's easy to evaluate. Validation loss works. CUDA kernel speed works. Anything where you can verify a candidate solution cheaply, even if finding it is expensive. If you can't evaluate it, you can't auto-research it.

Layer 2The org chart is a Markdown file

The thing that makes the loop work is a file Karpathy calls program.md. It's exactly what it sounds like: a Markdown document that tells the agent what to do, what order to try things in, what the boundaries are, and what ideas to explore.

    # program.md

    ## Objective

    metric: validation loss on OpenWebText

    target: minimize

    budget: 8x A100 hours per experiment

    ## Search Space

    explore: architecture, optimizer, lr schedule

    hold_fixed: dataset, tokenizer, model size

    ## Constraints

    max_params: 124M

    no_external_data: true

    ## Process

    1. Read current best config

    2. Propose one change with hypothesis

    3. Run training to 10K steps

    4. Compare val loss to baseline

    5. If better → commit. If worse → revert.

    6. Update baseline and repeat.

Here's the radical observation: every research organization is described by a document like this. The roles, the processes, the decision hierarchies, the communication cadences, the risk appetite — it's all expressible as instructions. And once it's instructions, it's code. And once it's code, you can optimize it.

Layer 3Parallelize: the agent team

A single loop is powerful. But Karpathy's real interest is what happens when you run many of them. Different agents pulling from different parts of the search space, working on non-conflicting changes, reporting results back to a shared state.

Parallel Agent Team — Simultaneous Exploration

Agent 01

Optimizer tuning — sweep Adam β₁, β₂

● Running

Agent 02

Architecture — test GLU variants in FFN

● Running

Agent 03

LR schedule — cosine vs. WSD warmup

● Running

Agent 04

Regularization — weight decay per layer group

● Running

Agent 05

Literature scan — new arxiv ideas

● Queued

This mirrors what Peter Steinberg became famous for: tiling a monitor with Codex agents, each working on a different repo, each taking about 20 minutes on high-effort mode. You don't write code. You don't even review every line. You work in macro actions — assigning entire functionalities, merging results, routing the next batch of work.

The human role is now something like an engineering manager of a team that never sleeps, never gets distracted, and has perfect recall of the codebase — but still occasionally does something inexplicably dumb.

Layer 4Optimize the optimizer: meta-research

This is where it gets recursive. If a research org is defined by program.md, and different program.mds produce different results, then you can run multiple research orgs and measure which one is better.

Meta-Optimization — Competing Research Orgs

Org A

aggressive — high risk, fast iterations

Org B

methodical — ablations before commits

Org C

literature-driven — arxiv scout first

→

Compare: which org improved most?

same hardware budget, same starting point

→

Feed results → generate better program.md

the organization itself is the hyperparameter

One org might do fewer standups. One might be more risk-taking. One might prioritize reading recent papers before running experiments. All of these are parameters — and all of them are tunable. Karpathy's contest idea: let people write different program.mds, give them identical hardware budgets, measure who gets the most improvement. Then feed all the data to a model and ask it to write something better.

There's no way we don't get something better. You can look at where the improvements came from and change the program.md such that more of those kinds of things would be done.

The Full StackLayers of the onion

Karpathy describes these capabilities as concentric layers, each one built on top of the last, each one now taken for granted as the next one emerges.

Capability Stack — Each Layer Enables the Next

Optimization over instructions

emerging

Multiple agents + orchestration

active frontier

Persistent autonomous agents (Claws)

early adoption

LLM + Agent primitives

taken for granted

The bottom layers are now commoditized. The LLM itself, basic agent loops — nobody thinks about these anymore. The current frontier is multi-agent orchestration and persistent "claw" entities that run autonomously with their own memory systems. The emerging edge is meta-optimization: optimizing the instructions themselves.

And this is precisely why it all feels like psychosis. Each layer multiplies the leverage of every layer below it. It's compounding capability with no obvious ceiling, and every limitation feels like a solvable skill issue rather than a hard wall.

The EndgameAuto research at home

The most ambitious extension Karpathy describes is opening this up to untrusted workers on the internet. His design looks, by his own admission, "a little bit like a blockchain."

Untrusted Worker Pool — Verify, Don't Trust

W1

V1

W2

W3

V2

W4

Verifier

trusted core

● Trusted verifiers ● Untrusted workers

The key property that makes this possible: finding a good solution is expensive, but verifying one is cheap. A worker might try 10,000 ideas to find one commit that improves validation loss. The verifier just has to check that single commit. This is the same asymmetry that powers proof-of-work, protein folding distributed projects, and SETI@home.

Commit Chain — Verified Improvements Build on Each Other

a3f2

-0.8%

b7d1

-1.2%

c4e9

+0.3%

d8a3

-0.5%

e2f7

-2.1%

Each commit = code change that improves the model. Rejected commits are discarded.

The vision: a swarm of agents on the internet collaborating to improve LLMs. You don't donate money to an institution. You purchase compute and join the auto-research forum for a project you care about — cancer research, clean energy, mathematics. The contribution is measured in flops, not dollars.

A swarm of agents on the internet could collaborate to improve LLMs and could potentially run circles around frontier labs. Frontier labs have a huge amount of trusted compute, but the Earth is much bigger and has a huge amount of untrusted compute.

What This MeansThree implications for builders

Your highest-leverage output might be a Markdown file. If you lead a team, manage a project, or run a research effort, the most valuable thing you can produce might not be code, papers, or strategy decks. It might be the instructions file that lets agents run the operation without you. The quality of that document — how well it captures your judgment, your priorities, your process — is what determines how much leverage you actually get.

Verifiable domains are about to be consumed. Anything with an objective metric — performance engineering, hyperparameter optimization, CUDA kernels, test coverage, security scanning — is a natural fit for autonomous loops. The human contribution in these domains shifts entirely from execution to system design: setting up the loop, defining the boundaries, choosing what to measure.

Compute becomes contribution. If auto-research generalizes and the untrusted-worker architecture works, then having access to compute is no longer just an operational expense — it's a way to participate in research. The open question is whether "how many flops do you control?" becomes as meaningful a measure of capability as wealth or headcount.

The Research OrgThat Has No Humans

The PremiseRemove yourself as the bottleneck

Layer 1The single auto-research loop

Layer 2The org chart is a Markdown file

Layer 3Parallelize: the agent team

Layer 4Optimize the optimizer: meta-research

The Full StackLayers of the onion

The EndgameAuto research at home

What This MeansThree implications for builders

The Research Org
That Has No Humans