Stop Writing Prompts. Start Building Loops.

A loop is a recursive goal. You define a purpose, an agent iterates against it, and it keeps running until a real stopping condition is met. The agent forgets everything between runs. The loop does not. Hold onto that, because everything below is just plumbing around it.

Addy Osmani breaks a loop into six parts: automations, worktrees, skills, connectors, sub-agents, and memory. Every working loop is some combination of these. You do not need all six on day one. You need one.

Automations

The thing that starts the run when you are not typing.

Worktrees

Separate sandboxes for parallel agents and experiments.

Skills

Reusable local instructions the loop does not need re-explained.

Connectors

Inputs and outputs from CI, issues, docs, Slack, or deploy systems.

Sub-agents

Specialized workers for planning, building, checking, and reporting.

Memory

The durable file that survives the model context reset.

Below, each piece is shown twice, once in Claude Code and once in OpenAI Codex, because the concepts are identical even though the commands differ.

01Start with one trigger

A loop becomes a loop the moment something fires without you typing. A cron job. A git hook. A webhook. The agent finds and triages work before you ask.

Pick one recurring task you do by hand and automate the trigger. Example: every morning at 8am, read yesterday's CI failures, recent commits, and open issues, then write findings to a markdown file. That one automation is already a complete, working loop.

Claude Code

0 8 * * * cd ~/repo && claude -p "Read yesterday's CI failures, recent commits, and open issues. Append findings to STATE.md." --allowedTools "Bash(git log:*),Bash(gh:*)"

Codex

0 8 * * * cd ~/repo && codex exec --sandbox workspace-write "Read yesterday's CI failures, recent commits, and open issues. Append findings to STATE.md."

One automation that writes one file is more leverage than a hundred hand-crafted prompts, because it runs without you.

02Give the loop a memory file

Create one markdown file, STATE.md, where every iteration can read and write. This is the loop's only memory.

The agent reads it first at the start of each run and writes back last: what was done, what is in progress, what is blocked, what to try next. This is the PROGRESS.md pattern, and it is the single most important file in any loop. Without it every run starts from zero.

Keep it short. A memory file the agent has to wade through 2,000 lines of is worse than no memory file at all.

Claude Code can inherit the habit through CLAUDE.md: always read and update STATE.md.
Codex can inherit the habit through AGENTS.md: read STATE.md before acting, update it before exiting.

## Last run (2026-06-17 08:00)
Done: triaged 3 CI failures, opened issue #412 for the flaky auth test.
In progress: none.
Blocked: #408 needs a staging DB credential.
Next: investigate the timeout in checkout.spec.ts.

03Split the writer from the checker

The model that wrote the code is too nice grading its own homework. A single agent that writes and reviews will mark itself done far more often than it should.

The fix is the evaluator-optimizer pattern: one agent generates, a second critiques against an objective standard, and the loop repeats until the check passes.

The critical word is objective. A second agent told to "review this" with no hard signal is just a second optimist.

The verifier needs a gate that fails on something real:

until npm test && npm run typecheck && npm run lint; do
  claude -p "Tests/typecheck/lint are failing. Read the output, fix the cause, update STATE.md."
done

Same gate with Codex:

until npm test && npm run typecheck; do
  codex exec --sandbox workspace-write "The build is failing. Read the output, fix the root cause, update STATE.md."
done

The test suite is the checker. Not the agent's opinion of itself.

04Isolate parallel work with worktrees

Once two agents touch the same files at once, you get collisions. Git worktrees give each agent its own working directory on its own branch:

git worktree add ../agent-plan plan
git worktree add ../agent-build build
git worktree add ../agent-verify verify

A typical pipeline: one sub-agent explores and writes a plan, a second implements it in its own worktree, a third verifies against tests in a third. Each agent only ever sees its own copy.

Claude Code runs one agent per worktree; sub-agents fan out work and report back to the shared STATE.md. Codex can do the same with one CLI run per worktree, or by handing parallel tasks to Codex Cloud, where each task runs in an isolated OpenAI-managed container.

This is where a loop scales from one background task to a pipeline of tasks at once.

05Set a hard stop condition

A loop without a real exit fails quietly. Geoffrey Huntley documented the "Ralph Wiggum loop": an agent meant to signal completion only when finished signals early, and the loop exits believing a half-done job is done.

Your stop condition must be checkable by something other than the agent's claim:

Good: the test suite passes.
Good: the build succeeds.
Good: the linked ticket moves to Done with passing CI.
Bad: the agent says it is finished.

And always set a maximum iteration count as a backstop. Ten or twenty is reasonable for most loops.

for i in $(seq 1 15); do
  npm test && break
  codex exec --sandbox workspace-write "Tests failing on iteration $i. Fix and update STATE.md."
done

If it hits the cap without passing, halt and flag for a human. Do not keep burning tokens.

06Wire in a human checkpoint

Not every loop should run unattended on day one. Boris Cherny's autonomy ladder has four rungs:

Suggests only.
Drafts changes for a human to apply.
Applies low-risk changes, human approves before publish or merge.
Applies and completes automatically, with audit logs.

Start every new loop at level 1 or 2. Run it for a week, read the output, correct what it gets wrong. Promote to level 3 only once it consistently produces work you would approve unchanged. Level 4 is earned, not assumed.

This maps cleanly onto each tool's permission model. Codex has flags like --sandbox read-only, workspace-write, and approval policies. Claude Code mirrors the pattern with tool permissions and an --allowedTools allowlist.

Runs that find something go to a triage inbox. Runs that find nothing should archive themselves silently.

07Watch the token cost

A bad iteration is a wasted prompt. A bad loop running overnight is a bill, because every iteration is a full model call dragging context and tool output with it.

Before any unsupervised run, execute it manually 3 to 5 times, check tokens per iteration, then do the arithmetic.

tokens/iteration x max iterations = worst-case cost/run
worst-case cost/run x runs/day    = worst-case daily spend

And restrict the shell. Build a command allowlist for anything that can execute commands: npm, git, ls, cat. Codex enforces this with sandbox and approval policy; Claude Code with --allowedTools.

Unrestricted shell access inside an unattended loop turns a cost problem into a security problem.

08Build the second loop differently than the first

Your first loop is small, single-purpose, and heavily supervised. Your second loop connects to the first.

A daily triage loop writes findings to STATE.md. A second scheduled loop reads that file, picks the highest-priority item, and acts on it. Neither needs the other to function, but together they move work from discovered to in progress without you touching either.

This is where skills start paying off. Write a skill once for how to triage CI failures, as a SKILL.md in Claude Code or a Skill or AGENTS.md rule in Codex, and every future loop reads it instead of you re-explaining.

The loops stop merely running in parallel and start sharing what they have learned.

09What your job becomes

Once a few loops are running, the shape of your day changes. You stop opening a chat to ask a question and start opening a triage inbox to review what the loops found overnight. The to-do list stops being a static pile and becomes a set of agents converting ideas into drafts, fixes, and reviews.

You do not stop deciding what matters. The deciding moves up a level: from per-task prompting to loop design. You write fewer prompts not because you do less, but because the loops write them for you.

Your attention moves to the three things that still need a human: the review checkpoint, the stop condition, and the next loop worth building.