Autoresearch

Overview

Autoresearch is Andrej Karpathy’s small-scale framework for autonomous ML experimentation. It gives an agent a real LLM training setup, lets it modify only train.py, and has it repeatedly run fixed-budget experiments, measure val_bpb, and decide whether each change should advance or be discarded.

Why it matters

  • Makes autonomous research concrete instead of hypothetical: the loop is small enough to understand but real enough to produce measurable model improvements.
  • Reframes the human role from hand-editing training code to writing the agent’s research brief in program.md.
  • Demonstrates a useful pattern for agentic experimentation: immutable evaluation + narrow mutation surface + automatic keep/discard decisions.

How the system is structured

  • prepare.py is intentionally fixed and contains data prep, tokenizer training, constants, data loading, and evaluation.
  • train.py is the only file the agent is supposed to edit, making experiments easy to review and revert.
  • program.md contains the operating instructions for the agent, including how to log results and when to keep or discard a commit.

Core operating loop

  1. Read the research instructions in program.md.
  2. Modify train.py with a new experimental idea.
  3. Run training for a fixed 5-minute wall-clock budget.
  4. Extract val_bpb from the run log.
  5. Log the result in results.tsv.
  6. Keep the commit only if the metric improves; otherwise reset and try again.

Key design choices

  • Fixed time budget: experiments are compared at equal wall-clock cost rather than equal step count.
  • Single optimization target: val_bpb is the main metric, which keeps comparisons fair even if tokenizer decisions change.
  • Simplicity bias: the instructions explicitly prefer simpler code when performance is similar.
  • Single-GPU scope: the repo is deliberately minimal and avoids distributed-training complexity.

Practical implications

  • The default setup aims for roughly 12 experiments per hour and around 100 experiments in an overnight run.
  • Results are highly platform-specific: autoresearch is optimized for the machine it is run on, not for universal benchmark comparability.
  • The framework is small enough to fork and adapt for other hardware, datasets, or agent workflows.