The Autoresearch Pattern: What Happens When You Let AI Optimize Its Own Instructions

Andrej Karpathy pushed a 630-line Python file to GitHub on March 6, pointed Claude at a Markdown file called program.md, went to sleep, and woke up to 126 completed experiments and a measurably better language model. Not because the AI was smarter than a human researcher. Because it was willing to run experiments at 3 AM on a Tuesday while the human was unconscious. That's not a story about AI capability. That's a story about what happens when you remove the bottleneck of human availability from the research loop.

The repository is called autoresearch, and it has accumulated 43,400 GitHub stars and 6,000 forks since its release. Shopify CEO Tobias Lutke ran it overnight on internal data and reported a 19% performance gain from 37 autonomous experiments. Karpathy himself achieved over 700 autonomous code changes in a two-day run, finding 20 genuine improvements that transferred to larger models. We've been watching this closely at INS, and we decided to try something the community hadn't: we applied the autoresearch pattern not to a neural network, but to the writing instructions that produce every Imagine blog post you read on this site.

What follows is a practitioner's account of the pattern itself, our adaptation of it, and what we learned. As we explored in Skills for Probabilistic Systems and Delete Then Automate, the compounding advantages in the AI era come from systematizing the things that used to depend on individual effort. Autoresearch is the most concrete example we've seen of that principle applied to knowledge work.

The Three Files

Autoresearch is deliberately minimal. The entire framework is three files:

File 1

prepare.py: The Fixed Evaluation

Downloads training data, trains a tokenizer, and provides the evaluation function. The agent cannot modify this file. The evaluation metric (validation bits-per-byte) is computed on a fixed validation set using a fixed number of tokens. This is the tamper-proof constraint that prevents the agent from gaming its own score. In Karpathy's words, this is the "alignment constraint" of the system.

File 2

train.py: The Artifact Under Optimization

Contains the full GPT model, optimizer, and training loop in ~630 lines. Architecture, hyperparameters, batch size, activation functions, attention patterns, everything is in scope. The agent modifies this file, commits the change, runs it against the evaluation function, and keeps or reverts based on the result. Every experiment runs under a fixed 5-minute time budget, making results comparable regardless of what the agent changes.

File 3

program.md: The Research Strategy

The Markdown file the agent reads for instructions. Karpathy describes it as "research org code written in English." It carries three registers simultaneously: instructions (what to search for), constraints (what must not change), and stopping criteria (when to wrap up). The most distinctive instruction: "Once the experiment loop has begun, do NOT pause to ask the human. The human might be asleep. You are autonomous. If you run out of ideas, think harder."

That's it. No orchestration framework. No multi-agent coordination layer. No complex infrastructure. Three files, a git repository, and a single instruction: loop forever.

The Core Insight

Autoresearch is not a tool. It is a pattern. One artifact to modify, one metric to evaluate, one keep-or-discard gate, repeat indefinitely. The simplicity is the point. Karpathy compared it to neural architecture search and dismissed the comparison: "Neural architecture search is such a weak version of this that it's in its own category. This is an actual LLM writing arbitrary code, learning from previous experiments."

The Numbers

The results from Karpathy's runs are specific enough to ground the pattern in reality rather than speculation.

700+

Autonomous code changes in 2-day run

126

Experiments completed overnight

11%

Efficiency gain on GPT-2 benchmark

19%

Performance gain (Shopify CEO overnight run)

In the flagship two-day run, the agent processed roughly 700 code changes and found approximately 20 genuine additive improvements. Validation bits-per-byte dropped from 1.003 to 0.974. The improvements transferred when applied to larger models, dropping the "Time to GPT-2" leaderboard metric from 2.02 hours to 1.80 hours. The types of improvements found: bugs, better resource allocation under time constraints, normalization tweaks, regularization tuning, and narrow hyperparameter sweet spots that no human would have the patience to find manually.

One finding from the agent comparison runs is worth noting: Claude Opus 4.6 via Claude Code maintained 12-hour autonomous loops and ran 118+ experiments without breaking. GPT-5.4 failed to follow the "loop forever" instruction, stopping to ask permission to continue. Karpathy's conclusion: "Harness affordances matter more than raw model capability." The agent that can sustain autonomous operation wins, regardless of which model scores higher on benchmarks.

The Conceptual Leap

Here's where the pattern gets interesting for organizations like ours that don't train language models. The abstract structure of autoresearch is:

Define an artifact you want to improve (train.py)
Define a metric that measures quality (prepare.py)
Let an agent modify the artifact, evaluate the result, and keep or revert (program.md)
Repeat until stopped

Nothing in that structure requires the artifact to be a neural network. Nothing requires the metric to be validation loss. The pattern works for any domain where you can define a measurable quality signal: code performance, test coverage, response latency, content quality, configuration optimization, prompt engineering. The only hard requirement is that the evaluation function must be automated and tamper-proof.

The community recognized this immediately. Within days of the release, forks appeared for code quality optimization (iterating on test coverage and bundle size), content optimization (iterating on readability scores and keyword density), and a generalized version that generates setup files for any project. The pattern is substrate-independent. The loop is the contribution.

The Pattern

Autoresearch inverts the traditional relationship between human and AI in knowledge work. Instead of a human using AI as a tool, the human writes the evaluation criteria and the AI does the iterating. The human's job shifts from doing the work to defining what "good" looks like. That's a different skill, and as we discussed in The Great Convergence, it's the skill that compounds.

Our Adaptation: Optimizing the Imagine Skill

We decided to test whether the autoresearch pattern could improve something that isn't code at all: the writing instructions that shape every Imagine blog post. The voice guide, vocabulary rules, structural patterns, and example library that Claude follows when generating a post for this blog.

Here's how we mapped the three files:

prepare.py: The Evaluation Function

We built a composite scoring function from 0 to 100. Forty points come from binary rule checks that are fully automated and reliable: does the post contain em dashes (deduct 5), exclamation points (deduct 4), banned vocabulary from a list of 15 hype terms (deduct 3 per word), first-person singular pronoun (deduct 5), a missing pivot sentence in the lead (deduct 4), or missing structural elements like key insight boxes, industrial application boxes, or CTA sections (deduct 3 each). These checks run in milliseconds with zero ambiguity.

The remaining 60 points come from an independent LLM judge that scores six criteria on a 1-10 scale: voice adherence, lead quality, abundance framing, anti-hype compliance, concrete specificity, and structural flow. The judge runs as a separate Claude CLI call, ensuring the evaluator is independent from the generator. This prevents the kind of self-bias that would make the scores meaningless.

train.py: The Skill Definition

The equivalent of Karpathy's 630-line model file is a set of editable strings containing our complete skill definition: the voice guide, vocabulary, structure rules, example patterns, and the system prompt assembly function. The agent modifies these strings, generates a test post using the modified instructions, evaluates the result, and keeps or reverts.

program.md: The Agent Instructions

Our version follows Karpathy's structure closely: setup, constraints, the experiment loop, strategy guidance, and the "never stop" instruction. We added diagnostic guidance specific to our scoring system: if binary_score is low, look at binary_failures first (free points). If abundance is low, strengthen the pivot formula. If specificity is low, add instructions about grounding claims in numbers.

What We Found: Ten Experiments

We ran ten experiments with the agent modifying our skill definition and generating test posts across three different topics (private 5G, digital twins, cybersecurity compliance). Three changes were kept. Seven were reverted. Here are the ones that mattered most.

Baseline

Starting Point: 80-81/100

The initial skill definition scored 81 on private 5G, 80 on digital twins, and 90 on cybersecurity compliance. Common failures: three banned hype terms leaking through despite being listed in the vocabulary rules, and abundance framing consistently scoring 7/10.

Kept

FINAL CHECK at End of Prompt: +8 points (KEPT)

Moved the banned words list from the middle of the prompt to the absolute last line, framed as "FINAL CHECK - THESE WORDS MUST NOT APPEAR ANYWHERE IN YOUR OUTPUT." Binary score jumped to 40/40 across all topics. The same information, repositioned for recency bias, eliminated every violation. This single change produced the largest improvement of any experiment.

Kept

Specific Replacement Words: Stable (KEPT)

Added explicit replacements for the most persistent offenders: the vague systems term becomes "environment" or "landscape," the change-hype noun becomes "shift" or "transformation," the corporate process word becomes "path." These domain-natural terms kept reappearing because the model needed alternatives, not just prohibitions.

Kept

Conservative Deduplication: Simplification Win (KEPT)

Removed four rules from the "What We Never Do" section that were exact duplicates of rules in CRITICAL RULES (em dashes, exclamation points, hype vocabulary, first-person pronoun). Four fewer lines, identical scores. Per Karpathy's simplicity criterion: removing code for equal results is always a win.

Reverted

Seven Experiments That Didn't Work

Adding verbose abundance examples made the prompt too long and hurt specificity. Requiring forward-looking last sentences caused em dashes and banned words to leak. A formula-based abundance pattern scored worse than the original verbose example. An explicit specificity rule produced no improvement. And an aggressive deduplication revealed a load-bearing instruction hiding in plain sight. But here's the encouraging part: every revert taught us something specific about how the instructions actually work. The seven failures mapped the boundaries of the system more precisely than the three successes did. That knowledge compounds across every future experiment.

Final result: Average scores improved from 83.7 to 88.3 across all three topics, a 5-point gain from three kept changes out of ten attempts. The skill definition is measurably better at producing posts that comply with our voice rules. The biggest win came from a change no human would have prioritized: moving a list from the middle of the prompt to the end. The biggest lesson came from a revert: discovering that a single line about "leaving the reader in fear" was doing more work for abundance framing than any explicit abundance instruction we tried.

The Real Transformation

The value of autoresearch isn't the specific improvements it finds. It's the shift from subjective assessment to empirical measurement. Before this experiment, we evaluated blog post quality by reading the output and making judgment calls. Now we have a scoring function that runs in under a minute, produces a number from 0 to 100, and identifies exactly which criteria are weakest. The skill definition became something we can iterate on with data, not just intuition.

The Six Lessons

Ten experiments, three kept, seven reverted. Each one taught us something specific about how AI instructions respond to modification, and these lessons generalize beyond blog writing.

Prompt positioning matters more than prompt content. The banned words list existed in our vocabulary section from the start. Moving it to a FINAL CHECK block at the absolute end of the prompt, where the model's attention is strongest, eliminated every violation across all three test topics. Same information, different position, dramatically different result.
Examples can hurt. Adding three detailed abundance pivot examples made the prompt 15% longer and the output measurably worse. The model's attention was spread across more text without becoming more focused. Concise rules outperform verbose examples when the rules are already clear.
Binary checks are free points. Mechanical rule violations (banned words, em dashes, missing structural elements) are the easiest scores to improve because the fixes are deterministic. Fix these first before attempting subjective quality improvements. In our case, binary compliance went from 31 to 40 in one experiment.
Creative freedom and rule compliance are in tension. Giving the model more structural guidance (the H2 arc) improved one criterion (abundance) while degrading another (voice discipline). But this tension is productive: it tells you exactly where the frontier is. Constraints that prevent violations are more valuable than suggestions that enable creativity, and that insight itself compounds across every future experiment.
Reverts teach you what's load-bearing. When we removed the line "Leave the reader in fear without a positive reframe" during a deduplication pass, abundance scores dropped immediately. That single line was doing more work for abundance framing than any explicit abundance instruction we added. The autoresearch pattern finds these hidden dependencies because it tests each change in isolation. Seven reverts out of ten sounds like a low hit rate, but each revert identified a boundary we wouldn't have found otherwise.
The evaluation function is the hardest part, and the most valuable asset. Building a scoring function that reliably distinguishes a 74 from an 89 is genuinely difficult. Our binary checks are reliable but crude. Our LLM judge scores vary by 2-3 points across identical inputs. But once you have a working evaluation function, every future improvement is incremental. The function itself becomes institutional knowledge that outlasts any individual experiment.

From Overnight Loop to Integrated Refine Cycle

The autoresearch pattern we ran on the skill definition is a one-off effort: you optimize the instructions, keep the improvements, and every future post benefits from the better baseline. But we also built something complementary that runs continuously: an integrated refine loop that applies the same evaluate-and-improve cycle to each individual post as it's being written.

When we generate a new Imagine post, the system now runs through user-defined refinement cycles. Each cycle evaluates the current draft against the scoring function, identifies binary violations and the weakest LLM criteria, makes targeted edits, and re-evaluates. The default is three cycles. You can specify more.

The results on the posts we've produced since integrating this loop: initial drafts typically score in the 74-84 range. After three refinement cycles, they consistently land in the 85-89 range. The binary score reaches 40/40 by the end of cycle one (mechanical fixes). The remaining cycles target whichever LLM criterion scored lowest, usually abundance framing or structural flow.

This is the autoresearch pattern applied at two levels: the skill-level optimization is a one-off effort (though it can be re-run on demand as the voice evolves or new patterns emerge). The per-post refine loop runs continuously, optimizing each individual output before a human sees it. The first sets a better baseline. The second ensures consistency within each post.

The Compounding Advantage

Every improvement to the skill definition makes every future post better. Every per-post refinement cycle produces a measurably higher-quality output. The one-off skill optimization and the per-post refine loop are independent but mutually reinforcing: better instructions produce better first drafts, which require fewer refinement cycles, which frees up budget for more ambitious content. This is what compounding looks like when applied to creative quality rather than financial returns.

The Pattern Generalizes

The autoresearch pattern is not specific to language model training or blog writing. It applies to any workflow where you can define three things:

An artifact that can be modified. A prompt template, a configuration file, a style guide, a set of business rules, a workflow definition, a test suite, a data pipeline configuration.
A metric that can be evaluated automatically. Test pass rate, response latency, customer satisfaction score, compliance check results, output quality score, processing time, error rate.
A keep-or-discard gate. Did the modification improve the metric? Keep it. Did it make things worse? Revert. This is the simplest possible optimization strategy, and it works because the agent can run hundreds of experiments in the time a human runs one.

The organizations that apply this pattern won't just produce better outputs. They'll develop a systematic method for continuous improvement that compounds over time. Every experiment teaches the agent (and the team reviewing its results) something about how their systems respond to modification. That knowledge compounds in exactly the way we've been describing across Imagine: each iteration makes the next one faster, more targeted, and more likely to succeed.

What This Means for INS

We've already applied the autoresearch pattern to our Imagine blog skill, and the results are measurable: a 5-point average improvement from ten experiments (three kept, seven reverted), with an integrated refine loop now running on every post we produce.

But the pattern extends well beyond content. We see three immediate opportunities:

Custom tool generation: We build internal tools for everything from inventory sync to project scoping. Each tool has measurable quality signals (test pass rate, user completion rate, error frequency). The autoresearch loop could iterate on the prompts and specifications that generate those tools, improving output quality the same way it improved our writing instructions.
Customer documentation quality: Our case studies, technical guides, and service descriptions have structured quality criteria similar to our blog scoring function. Applying the evaluate-and-refine pattern to customer-facing documentation creates the same compounding effect we've seen with Imagine.
Proposal generation: As we discussed in 14 Projects, 12 Months, Zero Without AI, our development velocity has accelerated dramatically. The autoresearch pattern could improve the SOW and proposal templates we use for every customer engagement, iterating on clarity, completeness, and compliance with customer requirements.

The pattern is simple enough to implement in a day and general enough to apply anywhere we have a quality metric. That's the definition of a primitive that compounds.

The Path Forward

Karpathy described the next step for autoresearch as "asynchronously massively collaborative for agents. The goal is not to emulate a single PhD student, it's to emulate a research community." He also predicted that "all LLM frontier labs will do this. It's the final boss battle."

For organizations outside frontier AI labs, the implications are more immediate and more practical. The autoresearch pattern gives you a systematic method for improving any AI-assisted workflow. The barrier to entry is three files and a quality metric. The potential is limited only by how many processes in your organization have measurable outputs.

As we discussed in The New Default, the question has shifted from "How can we use AI?" to "Why aren't we using AI?" The autoresearch pattern pushes that question one level deeper: you're using AI, but are you letting AI improve how you use AI? The answer to that question separates organizations that use AI as a tool from organizations that build AI into a compounding flywheel.

We've shown the numbers. Baselines averaging 83.7, finals averaging 88.3. Three kept changes out of ten experiments. The change that worked best was one no human would have prioritized. The seven reverts taught us more about our instructions than the three keeps did, including which single line was secretly holding the entire abundance score together. Every experiment, kept or reverted, made the system smarter. That's the autoresearch pattern. And it compounds.

The Optimization Loop

Three files. One metric. An agent that never sleeps. The autoresearch pattern doesn't care whether you're optimizing a neural network, a blog post, or a business process. It cares about one thing: is the next version better than the last? If you can measure it, you can improve it. If you can automate the measurement, you can improve it while you sleep. The teams that build this into their workflows now will compound their advantage with every cycle.