Experiment · 2026.05
Paperbase: can better paper access make coding agents better?
Paperbase is our experiment in giving AI agents access to current research as searchable artifacts rather than opaque PDFs. The question is not whether an agent can mention papers. The question is whether better paper access changes what the agent can actually build, retrieve, and justify.
The current evidence is strongest on corpus completeness, fixed retrieval checks, and the fact that the harder agent-lift study is now concrete enough to run rather than merely hypothesize.
Introduction
We built Paperbase because modern coding agents are unusually strong at implementation and still weak at reaching for the right ideas. In fast-moving fields like AI research, model training, inference, and agent design, knowledge cutoff is a real practical limitation.
You can see that limitation in the kinds of decisions models default to. Without strong access to recent papers, they fall back to familiar implementation patterns, generic heuristics, and whatever survives in base-model memory. They can talk about AI research, but that does not mean they can use current research well.
Our hypothesis is simple: if agents can search a current paper corpus, read the right sections, inspect the relevant figures and tables, and hold onto that evidence across a workflow, they should do better work.
The bottleneck
Most paper workflows are still shallow. An agent gets a title, an abstract, and a PDF link. Then it has to re-download the paper, parse it, find the relevant figure or section, and do the same work again on the next turn.
This breaks down badly once the answer is not written in one abstract sentence. Papers are made of sections, tables, captions, references, and raw assets. Multi-paper questions are even worse: the agent needs a stable way to narrow a corpus, inspect evidence, and compare across papers without rebuilding context from scratch every time.
Uploading more PDFs to a chat product does not solve that structure problem. The real bottleneck is that the agent does not have a good unit of access to the paper itself.
What Paperbase changes
Paperbase changes the unit of access. Instead of treating a paper like a blob, it treats a paper like an artifact tree. Each paper is broken into stable pieces such as sections, figure captions, tables, previews, and other evidence that an agent can retrieve directly instead of reconstructing from scratch.
That matters because the same evidence layer can serve several workflows at once. An external coding agent can use it while a researcher works in a browser workspace, and both are ultimately reading from the same indexed research layer.
The product implication is simple: the system is not only a paper search box. It is an attempt to make current research usable enough, and structured enough, that it can stay inside an agent workflow from the first query to the final answer.
Impact report
We evaluated Paperbase in four broad ways: how complete the indexed corpus is, how reliably retrieval works, whether agents appear to benefit when Paperbase is available, and how much the system costs in latency.
Snapshot
| Field | Value |
|---|---|
| Papers in the evaluated snapshot | 2925 |
| Indexed papers | 2905 |
| Rich-source coverage | 99.2% |
| Paper-level semantic coverage | 99.9% |
| Preview coverage | 96.5% |
| Deterministic retrieval checks | 100% hit@1 on both fixed suites |
The strongest accurate public line today is that Paperbase looks very strong on corpus completeness and fixed retrieval checks, while representative real-world retrieval and end-to-end agent lift still need more study.
What we can say today
- On the evaluated snapshot,
2905of2925papers are indexed. 99.2%of indexed papers are rich-source papers rather than abstract-only placeholders.99.9%of indexed papers have paper-level semantic coverage, and96.5%have preview coverage.- The fixed retrieval checks both score
100%hit@1, hit@3, and hit@5. - Search performance is stable enough to be measured cleanly and interpreted without the earlier instrumentation noise.
The impact of that result is not merely that the system indexes a lot of papers. It is that the underlying evidence layer appears complete and stable enough to support more serious studies of agent behavior.
What remains open
- In this environment, the available query logs were synthetic, so we did not treat them as evidence for real-user retrieval quality.
- We prepared blinded extraction-audit packets for human review, but those labels are not complete yet.
- We also prepared stress-case audit packets for failure analysis, but those are likewise still awaiting human scoring.
- Representative and stress-task A/B panels are ready, but the agent runs and scoring have not happened yet.
- A repeated-run cost and latency study has been designed, but it is still a study scaffold rather than a measured result.
The page therefore should not claim representative real-user retrieval lift, end-to-end answer-quality lift, hallucination reduction, or grounded-citation lift yet. Those claims require completed studies, not only prepared ones.
How we measured it
The useful way to read the evaluation story is as a ladder. Some layers are scored now. Some layers are only ready to be run. That distinction is the difference between “we measured a win” and “we built a serious way to test the win.”
| Study track | What we did | Current state |
|---|---|---|
| Corpus quality | Measured how complete, artifact-rich, and searchable the indexed paper set is. | Measured now |
| Search reliability | Ran fixed retrieval checks and set up representative retrieval analysis. | Fixed checks measured now; broader real-world analysis still blocked by data quality |
| Agent-lift studies | Prepared representative and stress-task A/B panels to compare agents with Paperbase on versus off. | Study panels prepared; no executed or scored A/B runs yet |
| Performance | Profiled search hotpaths and designed a repeated-run latency and cost study. | Hotpaths measured now; end-to-end study ready but not filled in |
Use Paperbase today
There are really three ways to think about Paperbase right now.
- As a tool external coding agents can use when they need current research.
- As a browser workspace that shows the same indexed corpus can support research and coding in one place.
- As an experiment: a concrete test of whether better paper access changes what agents can do.
If you want the shortest path, use the hosted product:
Sharing
The point of Paperbase is not that papers should be searchable in the abstract. The point is that agents should be able to use current research as working material rather than as decoration.
Today, the cleanest evidence is that the corpus layer is strong, the fixed retrieval checks are healthy, and the harder agent-lift question is finally set up in a publication-shaped way. That is not the end of the experiment, but it is enough to make the experiment worth taking seriously.