Experiment · 2026.05 · Ventali Tan

Paperbase: can better paper access make coding agents better?

Paperbase is our experiment in giving AI agents access to current research as searchable artifacts rather than opaque PDFs. The question is not whether an agent can mention papers. The question is whether better paper access changes what the agent can actually build, retrieve, and justify.

The current evidence is strongest on corpus completeness, fixed retrieval checks, and the fact that the harder agent-lift study is now concrete enough to run rather than merely hypothesize.

Try Paperbase

Introduction

We built Paperbase because modern coding agents are unusually strong at implementation and still weak at reaching for the right ideas. In fast-moving fields like AI research, model training, inference, and agent design, knowledge cutoff is a real practical limitation.

You can see that limitation in the kinds of decisions models default to. Without strong access to recent papers, they fall back to familiar implementation patterns, generic heuristics, and whatever survives in base-model memory. They can talk about AI research, but that does not mean they can use current research well.

Our hypothesis is simple: if agents can search a current paper corpus, read the right sections, inspect the relevant figures and tables, and hold onto that evidence across a workflow, they should do better work.

The bottleneck

Most paper workflows are still shallow. An agent gets a title, an abstract, and a PDF link. Then it has to re-download the paper, parse it, find the relevant figure or section, and do the same work again on the next turn.

This breaks down badly once the answer is not written in one abstract sentence. Papers are made of sections, tables, captions, references, and raw assets. Multi-paper questions are even worse: the agent needs a stable way to narrow a corpus, inspect evidence, and compare across papers without rebuilding context from scratch every time.

Uploading more PDFs to a chat product does not solve that structure problem. The real bottleneck is that the agent does not have a good unit of access to the paper itself.

Paperbase is not just “better retrieval.” The shift is from raw paper blobs toward stable artifacts an agent can search and inspect directly.

What Paperbase changes

Paperbase changes the unit of access. Instead of treating a paper like a blob, it treats a paper like an artifact tree. Each paper is broken into stable pieces such as sections, figure captions, tables, previews, and other evidence that an agent can retrieve directly instead of reconstructing from scratch.

That matters because the same evidence layer can serve several workflows at once. An external coding agent can use it while a researcher works in a browser workspace, and both are ultimately reading from the same indexed research layer.

The product implication is simple: the system is not only a paper search box. It is an attempt to make current research usable enough, and structured enough, that it can stay inside an agent workflow from the first query to the final answer.

The important point is not the UI split. It is that the same research layer serves both external agent workflows and the hosted browser product.

Impact report

We evaluated Paperbase in four broad ways: how complete the indexed corpus is, how reliably retrieval works, whether agents appear to benefit when Paperbase is available, and how much the system costs in latency.

Snapshot

Field	Value
Papers in the evaluated snapshot	2925
Indexed papers	2905
Rich-source coverage	99.2%
Paper-level semantic coverage	99.9%
Preview coverage	96.5%
Deterministic retrieval checks	100% hit@1 on both fixed suites

The strongest accurate public line today is that Paperbase looks very strong on corpus completeness and fixed retrieval checks, while representative real-world retrieval and end-to-end agent lift still need more study.

What we can say today

On the evaluated snapshot, 2905 of 2925 papers are indexed.
99.2% of indexed papers are rich-source papers rather than abstract-only placeholders.
99.9% of indexed papers have paper-level semantic coverage, and 96.5% have preview coverage.
The fixed retrieval checks both score 100% hit@1, hit@3, and hit@5.
Search performance is stable enough to be measured cleanly and interpreted without the earlier instrumentation noise.

The impact of that result is not merely that the system indexes a lot of papers. It is that the underlying evidence layer appears complete and stable enough to support more serious studies of agent behavior.

What remains open

In this environment, the available query logs were synthetic, so we did not treat them as evidence for real-user retrieval quality.
We prepared blinded extraction-audit packets for human review, but those labels are not complete yet.
We also prepared stress-case audit packets for failure analysis, but those are likewise still awaiting human scoring.
Representative and stress-task A/B panels are ready, but the agent runs and scoring have not happened yet.
A repeated-run cost and latency study has been designed, but it is still a study scaffold rather than a measured result.

The page therefore should not claim representative real-user retrieval lift, end-to-end answer-quality lift, hallucination reduction, or grounded-citation lift yet. Those claims require completed studies, not only prepared ones.

How we measured it

The useful way to read the evaluation story is as a ladder. Some layers are scored now. Some layers are only ready to be run. That distinction is the difference between “we measured a win” and “we built a serious way to test the win.”

The evaluation story has four layers. Some are scored now; others are serious study scaffolds waiting for labels, real user traffic, or comparative runs.

Study track	What we did	Current state
Corpus quality	Measured how complete, artifact-rich, and searchable the indexed paper set is.	Measured now
Search reliability	Ran fixed retrieval checks and set up representative retrieval analysis.	Fixed checks measured now; broader real-world analysis still blocked by data quality
Agent-lift studies	Prepared representative and stress-task A/B panels to compare agents with Paperbase on versus off.	Study panels prepared; no executed or scored A/B runs yet
Performance	Profiled search hotpaths and designed a repeated-run latency and cost study.	Hotpaths measured now; end-to-end study ready but not filled in

Use Paperbase today

There are really three ways to think about Paperbase right now.

As a tool external coding agents can use when they need current research.
As a browser workspace that shows the same indexed corpus can support research and coding in one place.
As an experiment: a concrete test of whether better paper access changes what agents can do.

If you want the shortest path, use the hosted product:

paperbase.mv37.org

The point of Paperbase is not that papers should be searchable in the abstract. The point is that agents should be able to use current research as working material rather than as decoration.

Today, the cleanest evidence is that the corpus layer is strong, the fixed retrieval checks are healthy, and the harder agent-lift question is finally set up in a publication-shaped way. That is not the end of the experiment, but it is enough to make the experiment worth taking seriously.

We have open-sourced the Paperbase codebase at github.com/mv37-org/paperbase.

Back to home