One technique I keep returning to is improving a small, well-tested piece of code over a metric, in many small autonomous passes, with the worst attempts thrown away. The unit of work is one cycle: an isolated git worktree, an open-ended prompt, automatic tests, an automatic merge or discard. No single cycle is dramatic. The whole point is that they compound.

It works when two things are true: the code under improvement is well tested, and the metric you care about is honestly definable. It does not work otherwise — the gate becomes either meaningless (no tests) or gameable (the wrong metric).

Where the idea came from

The seed was Don Knuth’s paper Claude’s Cycles (Stanford CS, February 2026). Knuth had been stuck on an open graph-decomposition conjecture from a future volume of TAOCP — partitioning every directed edge of an m × m × m cube into exactly three Hamiltonian cycles. He fed it to Claude Opus 4.6, which ran 31 explorations in about an hour: brute force, serpentine patterns, fiber decompositions, simulated annealing. Most failed. The 15th attempt — a fiber decomposition — landed on the right shape, and Knuth turned it into a proof.

What I took from the paper is not the math. It’s the loop: many small, isolated attempts; the failures are free; one of them eventually finds something real. Knuth used it to crack a conjecture. I wanted it for refactoring.

The technique

Pick a target module. Write a metrics script that emits JSON — file count, lines of code, total cyclomatic complexity, max method CC, max method LOC, a perf benchmark, test pass count. That script is the thermostat. Then in a loop:

  1. Read current metrics and the history of previous cycles.
  2. Spin up a git worktree on a cycle/NN branch.
  3. Invoke claude -p with an open-ended prompt: here are the metrics, here is what previous cycles did, here are the constraints (tests must pass, perf must not regress >15%, public API stays stable). Find one thing to improve, do it, commit.
  4. Run the tests in the worktree. Re-collect metrics.
  5. If the gates pass, merge the branch. If not, throw the worktree away and log why.
  6. Repeat.

The first cycle is the only one I write a specific prompt for — usually a known fix to bootstrap the pipeline. After that the prompts are emergent. Claude reads the metrics, decides what stinks, and goes.

The skill

I packaged the loop as a Claude Code skill at .claude/skills/claude-cycles/SKILL.md, with a sister /cycles slash command that scaffolds the infrastructure for a new domain. The skill’s preamble:

---
name: claude-cycles
description: Iterative autonomous refactoring with metrics-driven quality gates
triggers:
  - claude cycles
  - iterative refactoring
  - cycle-based improvement
  - autonomous refactoring
  - metrics-driven refactoring
---

# Claude Cycles

Iterative, autonomous refactoring methodology where each cycle runs an
independent Claude instance in a git worktree with metrics-driven feedback,
quality gates, and automatic merge/discard.

## Overview

Claude Cycles decomposes large refactoring efforts into small, independently
verifiable steps. Each cycle:
- Runs in an isolated git worktree (no risk to mainline)
- Gets current metrics + history as context
- Has full creative freedom to explore and improve
- Must pass quality gates (tests, perf) to be merged
- Failed cycles are discarded with zero cost

The infrastructure is three files per domain:

  • metrics.py — measures the target. Supports --json for gate checks, --save to append to a CSV trend log, --path to point at a worktree instead of the main repo.
  • cycle-next.py — the orchestrator. Reads the plan, collects before-metrics, creates the worktree, builds the prompt, invokes claude -p --dangerously-skip-permissions, runs tests, collects after-metrics, checks gates, merges or discards.
  • plan.md — the live progress table, the cycle log, the baseline metrics.

The lifecycle, lifted from the skill:

c12345678y........clRCCBRRCGeeoruuuoa-aleinnltndlal:CrlePFeetdluteAAxpcecaneccSItltplussthSL.awradtepnmoouetsmcy.ermdeekmtkpeesit:mddrttxtnreiir-psirsce(pl,wcgcsemooseac:e<rcrry(tpeok(bdcbcrrsmtarleyio,mrfabefccmietnrolsprteecaNretesrhne/+>f),c)NahNhcu,itpsodltraoostgr,eyw)phlyan

What’s measured, what’s gated

MetricPurposeGate?
FilesTrack consolidationNo
LOCTrack simplificationNo
Max method CCComplexity hotspotsTarget (< 10)
Max method LOCMethod sizeTarget (< 50)
Total CCOverall complexityNo
Services / depsCouplingNo
Perf benchmarkRegression detectionYes (< 115% baseline)
Test pass rateCorrectnessYes (100%)

Most metrics are targets, not gates. They show up in the prompt so Claude knows what to focus on, but a cycle can still merge without moving them. Only tests and perf are hard gates — and tests are never modified, so green tests means the public behavior is preserved.

Where I ran it

Five domains in this repo have cycles*/plan.md directories. Headline numbers as of today:

DomainCycles doneLOC (start → now)Max CC (start → now)TestsNotes
Market (original)131,878 → 80736 → 1028/28The first run, the cleanest result
Transport81,535 → 96019 → 8680/80LOC down 37%, but Max CC went up
Merchant11662 → 1,18218 → 1272/272LOC up because half the cycles added missing features
Migration71,041 → 1,2167 → 7103/156 → 176/176Mostly stub completion, not refactoring
Market (second pass)1886 baseline12 baseline49/49Just started

Two of those numbers need an asterisk. Transport’s Max CC went up: the cycles aggressively inlined services into the facade and ended up with one big function that does the orchestration. That’s the gate working as designed — perf and tests are fine, complexity isn’t gated, so the cycles found a local optimum that traded structure for size. I’ll either pay it down with a future cycle that explicitly targets max CC, or accept it. Merchant and Migration LOC went up because most of those cycles weren’t refactoring — they were finishing stub implementations to get tests green. Same loop, different mode.

About 40 cycles total across the five domains. Git records 43 commits with the cycle: prefix — 7.7% of the project’s 555 commits, roughly aligned (some cycles produce multiple commits, some fail and leave nothing).

The Market domain, in detail

The original Market run is the cleanest example because it was pure refactoring on already-passing tests. Thirteen cycles took it from 38 files / 1,878 LOC / max method CC 36 to 29 files / 807 LOC / max CC 9, with the perf benchmark 40% faster and all 28 tests still green.

Five things stood out across those 13 cycles.

Algorithmic improvements beat structural ones. The early cycles inlined services and removed layers — typical “delete the indirection” wins. The biggest single jump came from a different direction: an algorithmic change from O(P·N) to O(N+P) for equilibrium-price finding, plus mutable accumulators replacing intermediate ItemBundle allocations. That cycle alone moved perf from 2,624 ms to ~1,750 ms. I didn’t prescribe it. Claude found it from looking at the metrics and the code.

Failed cycles are free and informative. Cycle 5 first attempt regressed perf 41% and got discarded. The failure went into plan.md, the next attempt at cycle 5 read it as context, tried a different shape, and that one merged. Discarded cycles cost nothing but model tokens, and they sharpen the next attempt.

Emergent beats prescribed. After cycle 0 (a known test-fix to unblock the pipeline), every subsequent prompt was open-ended. Claude consistently found things I wouldn’t have thought to prescribe — collapsing six mutable parameters into a result struct, spotting redundant double-lookups, noticing that static on private methods inflated the CC counter. The wins came from giving the model both what the metrics said and freedom to pick the angle of attack.

Small cycles compound. No single cycle was a hero. The headline result — 57% LOC reduction, 75% complexity reduction, 40% faster — is 13 modest improvements stacked. Each cycle left the codebase a little easier to read, which made the next cycle’s improvements easier to find.

Tests are the safety net. Tests were never modified after cycle 0. That made every subsequent cycle safe by construction: if the test suite is green, the public behavior is preserved. The test investment has to come before you start cycles, not during.

What it feels like to use

It is time-consuming. Each cycle takes a coffee-break’s worth of token spend — minutes, not seconds — because the model is doing real exploration in the worktree, not a one-shot rewrite. You watch the orchestrator log scroll, the test runner spin up, the metric numbers tick. Some cycles land cleanly and the diff is satisfying. Some get discarded and you see the failure reason in the log and shrug.

The thing that surprised me is how much of the value is in the honest accounting. The progress table doesn’t lie — if Max CC is going up, it’s right there. If LOC is creeping in the wrong direction because cycles are implementing features rather than removing them, you see it. The loop forces you to write down what “better” actually means before you start, and then it shows you whether you’re getting there.

Limitations

  • Tests must be real. If the test suite is thin, the gate is theatre. Cycles will happily delete code that has no test coverage, and you’ll find out later.
  • The metric must be honest. Pick the wrong proxy and you’ll optimize the proxy. Max method CC is gameable — Claude can split a complex method into two simpler ones with shared mutable state and the metric goes down while the underlying code is worse. The only real defense is reading the diffs, at least sometimes.
  • Not for greenfield code. There has to be something to simplify. Empty files don’t refactor.
  • Public API has to be stable. The constraint is in the prompt, but if the API is still in flux, every cycle will fight you. Cycles work best on a settled seam.

Closing

The loop is not magic. It’s a structured way to spend tokens on the kind of refactoring that’s individually too small to schedule but cumulatively worth doing. Most of the work happens while you do something else; the result is a codebase that drifts toward simpler over time instead of away from it. I’ll keep running them on whichever domain feels heaviest each week and see where it stops paying.