Experiment · 2026.05

Rollout: from observability to continuous post-training

Rollout is our infrastructure for running agent tasks from your own code and reviewing exactly what happened. It is built around a single primitive — one attempt at one task— that is equally legible to a human reviewer, a continuous evaluator, and a reinforcement learning loop.

This post explains why we started it, what we see across agent observability and RL training today, where we think the seam is, and what we are building next.

Source on GitHub

Introduction

The agents we care about are not chatbots. They are programs that decide, act, and produce evidence we can review: a draft, a code change, a search result, a labeled outcome. To improve them we need to look at what they did, compare attempts, score outcomes, and feed the good and bad ones back into the next version of the system.

Most of the tooling we tried was built for one half of that loop. Observability tools captured traces beautifully but stopped at dashboards. RL frameworks captured rollouts as training data but left no surface for a human to inspect them. We kept needing both at once, and we kept gluing them together with brittle scripts.

Rollout is the substrate we wished we had: a workspace where datasets, runs, rollouts, and traces are first-class, where the SDK call your production agent makes is the same call a training loop or eval harness makes, and where every attempt is a citable object you can replay, score, and export.

Why we are doing this

MV37 is an independent lab working on self-improving systems, starting with AI research itself. The throughline across our projects — Paperbase for paper access, physical AI, and automating reinforcement learning — is that an agent that improves over time needs good feedback substrate: tasks worth running, attempts worth comparing, and traces worth keeping.

When we started instrumenting our own agents, the friction was not the model. It was everything around the model. Where do tasks live? How do we re-run only the failing ones? How do we attach a 40-page PDF to a task without inlining it into a prompt? How do we tell a teammate “look at run 47, attempt 3” and have them actually see the same thing we are seeing? How do we turn yesterday’s production traces into tomorrow’s RL data without reshaping it three times?

We wrote Rollout because we needed those questions to have boring answers. The goal is not a new abstraction. The goal is for the same object — the rollout — to be the thing your agent emits, the thing your reviewer reads, and the thing your trainer consumes.

What we see in the industry

Two ecosystems have grown up around agent feedback, and they barely talk to each other.

On one side is the observability and evals cluster: LangSmith, Langfuse, Braintrust, W&B Weave, Arize, Laminar, Helicone, and a long tail of newer entrants. These tools are excellent at capturing traces from a production agent, showing them in a timeline, and letting you score outputs. They treat the trace as the primary unit. They are designed for humans who need to debug a deployed system or run an eval batch before shipping.

On the other side is the agentic RL cluster: OpenPipe ART, ARES, NVIDIA NeMo Gym and ProRL Agent, OpenClaw-RL, and a wave of open trainers that emerged once GRPO and related algorithms made multi-turn agent training tractable. These tools treat the rollout as the primary unit. They care about parallel execution, sandboxed environments, reward shaping, and policy-gradient bookkeeping. They are designed for machines that need millions of attempts.

Both sides are right about their primitive. A trace is what a human needs. A rollout is what a trainer needs. The interesting observation is that they are the same object, viewed from two angles. A trace is a rollout with the prose turned up. A rollout is a trace with the reward turned up.

Almost no tool treats them as the same object. Observability platforms make it painful to round-trip a trace into a training dataset. RL frameworks make it painful to open a single rollout and read it like a story. Teams that want both end up writing the connector themselves, every time.

Two ecosystems grew up around agent feedback. Their primitives are the same artifact — one attempt at one task — with the prose or the reward turned up.

The gap

The gap we keep tripping over is structural, not cosmetic. It shows up in three places.

The unit of work is inconsistent.One tool calls it a span, another calls it an episode, another a session, another a run. Each tool has its own idea of where one attempt ends and the next begins, and almost none of them agree on what “the same task, tried again” means. Without a shared unit, you cannot ask the obvious questions: how did attempt 2 differ from attempt 1, how did this week’s rollouts on dataset X compare to last week’s, which 50 attempts should go into the next training batch.

Tasks and files are second-class. Most agent tooling assumes the input is a prompt string. Real tasks have instructions, structured inputs, file attachments (PDFs, code, screenshots, datasets), and expected output schemas. When the input is a string, every team rebuilds task management in a Notion table or a YAML directory. The work then never makes it back into the tool that watches the runs.

The observability surface and the training surface are different products.A trace UI is built to be read by one human at a time. A training pipeline is built to be consumed by a GPU cluster. The same rollout has to live in both. Today, it usually lives in neither — it lives in a JSONL dump on S3, in a Slack screenshot, and in someone’s memory.

We think the right response is not another dashboard or another trainer. It is a small, opinionated concept model with SDKs that honor it, a UI that reads it well, and a storage layout that both humans and trainers can address.

How Rollout closes it

Rollout has a deliberately small vocabulary. We picked the names that were already in our heads and made each one mean exactly one thing.

A small concept model

A workspace is a shared place for a team or project. A dataset is a named collection of tasks. A task is one unit of work, with an instruction, optional structured input, optional files, and an optional output schema. A run is one execution of a dataset. A rollout is one attempt at one task inside that run. A trace is the timeline of what happened during a rollout.

The model is intentionally flat. A run has many rollouts. A rollout has exactly one trace. If you re-run the same task five times, you get five rollouts you can compare side-by-side. Every rollout has a stable ID you can quote in a pull request, paste into a training script, or share with a teammate.

A deliberately small vocabulary. Workspace holds datasets, datasets hold tasks, runs execute them, and each attempt is a rollout with a single trace.

SDKs that fit your loop

The Python and TypeScript SDKs are thin. Your code stays in charge. You start a run, take the tasks Rollout hands you, and do whatever you want with them — call a model, drive a browser, run a sandboxed shell, dispatch to a multi-agent crew. When you are done with a task, you finish the rollout with an output or an error.

run = rollout.start_run(
    "paper-summary-v1",
    workflow_name="Paper QA",
    group_id="thread-42",
)

for task in run.tasks:
    attempt = run.start_rollout(task)
    files = attempt.materialize_files()
    try:
        attempt.message("Starting task", role="system")
        result = answer_task(task.task.instruction, task.task.input, files)
        attempt.finish(output=result)
    except Exception as error:
        attempt.error(str(error))
        attempt.finish(status="failed", error=str(error))
        raise

The same shape works for a production agent, a CI eval, and a training loop. The training loop just runs many rollouts in parallel and reads the traces back as data. The eval batch just scores the outputs. The production agent just runs once. None of them needs a different SDK.

Two optional fields, workflow_name and group_id, do the work of stitching related rollouts together. A group_id can be a conversation ID, a job ID, or an eval batch ID. The same field carries production sessions and training batches, because they are the same kind of grouping.

A trace UI that humans actually use

The web UI is built around reading. You sign in, pick a workspace, and land in Traces. Group by workflow, dataset, agent, or status. Open a run, pick a rollout, read the timeline. Messages, tool calls, tool results, errors, latency, token counts, and cost are all rendered in one place. There is a Gantt view for timing and a flat view for prose.

Datasets and files are first-class in the UI too. Create a dataset with structured inputs and attached files. Connect an S3-compatible bucket once and reuse files across tasks. Export a dataset to GitHub if you want changes reviewed in a pull request alongside code.

Datasets as a first-class artifact

Datasets in Rollout are not spreadsheets. They are versioned collections of tasks with rich inputs — instructions, structured fields, files, and optional output schemas. We use Harbor, a zipped on-disk format, so a dataset is something you can check into Git, upload to S3, or hand to a teammate without losing structure.

This matters because once a dataset is durable, the loop closes. A failed production rollout can be promoted into a regression task. A scored eval batch can become the next training set. A training rollout that looks suspicious can be opened in the UI and read like any other trace.

One stored rollout serves three readers. A failed production attempt can become a regression task; a scored eval can become a training batch.

What is next

We are working on a few things at once. The themes are consistent: more environments to run rollouts in, more ways to score them, and tighter hooks into training.

Environments.First-class sandboxed environments — terminal, browser, code — so a rollout can drive a real tool, not just a model. We expect this to use ephemeral microVMs.
Scorers. Pluggable evaluators that run after a rollout finishes. The same scorer can gate a deploy or produce a reward signal for training; the rollout does not need to know which.
Training hooks. A documented path from a set of scored rollouts to a training batch, including for GRPO and related multi-turn algorithms. The goal is that you never have to reshape your traces to train on them.
Comparison views.Side-by-side reading of multiple rollouts on the same task — the obvious UI for “why did attempt 2 succeed and attempt 1 fail.”
More SDKs and adapters. Adapters for popular agent frameworks so existing code can emit rollouts without a rewrite, plus SDKs in additional languages as demand shows up.
Self-hosting. The current setup already runs cleanly on local Postgres for development. We plan a clear self-hosted deployment path for teams that need to keep traces and datasets in their own infrastructure.
Agent for Rollout itself.A small agent that reads a workspace, proposes new tasks, flags regressions, and builds eval sets out of recent traces — an example of the kind of self-improving loop the rest of MV37 is aimed at.

The order will be shaped by what real users hit first. If you run agents and any of the above looks load-bearing for you, we want to hear about it before we build it.

Try Rollout

The repository is open at github.com/mv37-org/rollout. The docs cover the UI, the Python and TypeScript SDKs, and the CLI. A local stack runs against Postgres with a single make dev.

If you want to talk through a use case — production traces, RL training, evals, or something we have not thought of yet — email v@mv37.org.

← Back to mv37.org