Autoresearch
Overview
Autoresearch is Andrej Karpathy’s small-scale framework for autonomous ML experimentation. It gives an agent a real LLM training setup, lets it modify only train.py, and has it repeatedly run fixed-budget experiments, measure val_bpb, and decide whether each change should advance or be discarded.
Why it matters
- Makes autonomous research concrete instead of hypothetical: the loop is small enough to understand but real enough to produce measurable model improvements.
- Reframes the human role from hand-editing training code to writing the agent’s research brief in
program.md. - Demonstrates a useful pattern for agentic experimentation: immutable evaluation + narrow mutation surface + automatic keep/discard decisions.
How the system is structured
prepare.pyis intentionally fixed and contains data prep, tokenizer training, constants, data loading, and evaluation.train.pyis the only file the agent is supposed to edit, making experiments easy to review and revert.program.mdcontains the operating instructions for the agent, including how to log results and when to keep or discard a commit.
Core operating loop
- Read the research instructions in
program.md. - Modify
train.pywith a new experimental idea. - Run training for a fixed 5-minute wall-clock budget.
- Extract
val_bpbfrom the run log. - Log the result in
results.tsv. - Keep the commit only if the metric improves; otherwise reset and try again.
Key design choices
- Fixed time budget: experiments are compared at equal wall-clock cost rather than equal step count.
- Single optimization target:
val_bpbis the main metric, which keeps comparisons fair even if tokenizer decisions change. - Simplicity bias: the instructions explicitly prefer simpler code when performance is similar.
- Single-GPU scope: the repo is deliberately minimal and avoids distributed-training complexity.
Practical implications
- The default setup aims for roughly 12 experiments per hour and around 100 experiments in an overnight run.
- Results are highly platform-specific: autoresearch is optimized for the machine it is run on, not for universal benchmark comparability.
- The framework is small enough to fork and adapt for other hardware, datasets, or agent workflows.