Autoresearch

Overview

Autoresearch is Andrej Karpathy’s small-scale framework for autonomous ML experimentation. It gives an agent a real LLM training setup, lets it modify only train.py, and has it repeatedly run fixed-budget experiments, measure val_bpb, and decide whether each change should advance or be discarded.

Why it matters

Makes autonomous research concrete instead of hypothetical: the loop is small enough to understand but real enough to produce measurable model improvements.
Reframes the human role from hand-editing training code to writing the agent’s research brief in program.md.
Demonstrates a useful pattern for agentic experimentation: immutable evaluation + narrow mutation surface + automatic keep/discard decisions.

How the system is structured

prepare.py is intentionally fixed and contains data prep, tokenizer training, constants, data loading, and evaluation.
train.py is the only file the agent is supposed to edit, making experiments easy to review and revert.
program.md contains the operating instructions for the agent, including how to log results and when to keep or discard a commit.

Core operating loop

Read the research instructions in program.md.
Modify train.py with a new experimental idea.
Run training for a fixed 5-minute wall-clock budget.
Extract val_bpb from the run log.
Log the result in results.tsv.
Keep the commit only if the metric improves; otherwise reset and try again.

Key design choices

Fixed time budget: experiments are compared at equal wall-clock cost rather than equal step count.
Single optimization target: val_bpb is the main metric, which keeps comparisons fair even if tokenizer decisions change.
Simplicity bias: the instructions explicitly prefer simpler code when performance is similar.
Single-GPU scope: the repo is deliberately minimal and avoids distributed-training complexity.

Practical implications

The default setup aims for roughly 12 experiments per hour and around 100 experiments in an overnight run.
Results are highly platform-specific: autoresearch is optimized for the machine it is run on, not for universal benchmark comparability.
The framework is small enough to fork and adapt for other hardware, datasets, or agent workflows.

Sources

karpathy/autoresearch | DeepWiki

Carter's Knowledge Base

Explorer

Autoresearch

Autoresearch

Overview

Why it matters

How the system is structured

Core operating loop

Key design choices

Practical implications

Sources

Graph View

Table of Contents

Backlinks

Carter's Knowledge Base

Explorer

Autoresearch

Autoresearch

Overview

Why it matters

How the system is structured

Core operating loop

Key design choices

Practical implications

Related pages

Sources

Graph View

Table of Contents

Backlinks