Product

Let the tokens flow. Sail powers efficient, reliable deep research

Sail inference tops BrowseComp-Plus benchmark

Kavin AnandToby LiangApr 22 20267 min read

Deep Research Agents
Abundant Inference
Novel Inference Approach
An Efficient Agent Swarm
Appendix

Sail’s agentic AI researcher topped BrowseComp-Plus with 90.72% accuracy at 6-35x lower cost with a novel, efficient inference engine.

Deep Research Agents

Deep research is a transformative application of modern AI. We can now ask agents to search through hundreds or thousands of documents, looking for the finest needles in the biggest haystacks. It’s a task that rewards tenacity and scale over raw intelligence. Accuracy matters most. The goal is a single, authoritative answer. So how do we get there?

Abundant Inference

Sail’s inference service is designed for maximum efficiency, so that our customers can build agents at ambitious scale, with great economics. The current token market is dominated by human-in-the-loop systems where inference is throttled by demands for lower and lower latency. Agents are far more patient than people. We're at an inflection point where agentic workflows that solve real problems are becoming possible, and where increasingly capable open-source models are in the hands of more AI-native people.

Background work will be won by the platform that makes long, token-heavy, unattended agent trajectories efficient and reliable to run.

We focused on winning BrowseComp-Plus, a deep research evaluation benchmark that is built on top of OpenAI's BrowseComp.

Novel Inference Approach

Standard single-agent + retriever systems fail. The canonical approach to BrowseComp-Plus has been to select the most powerful frontier model available and to invest immense resources developing the strongest retriever system to surface only high-signal documents. We believe the true bottleneck is bulk parallel processing of immense quantities of data. Research is a question of compute. Fundamentally the answer lives in the corpus of data; even without beautifully constructed retriever algorithms, brute-force searching the entire corpus should net the correct answer assuming infinite context. However, this approach to deep research is untenable, leading to high inference costs and agent performance with diminishing returns.

Research is a question of compute.

To solve this, two things need to happen:

Persistence: The main agent cannot be overwhelmed by context.
1. A performant retriever passes a limited set of high signal documents to the main agent. Reading every document in the corpus is infeasible. A SoTA custom retriever is not required, just one capable enough to help a multi-turn agent cut through noise
Cost: Feasibility relies on cheap compute
1. A better infrastructure layer that treats background agents as a first-class citizen.
2. The use of open models helps lower per token costs, pushing performance to near frontier levels with increased throughput.

Sail leverages a better inference engine to accomplish state-of-the-art performance. Using GLM-5.1 and GPT-OSS-120B in conjunction with simple retrievers, Qwen3-Embed-8B/BM25, Sail achieved 90.72% accuracy and 84.31% recall. Our inference product is built for exactly that: open models at production scale, so deep-research-style agent stacks stay reliable and economical to run.

Token breakdown

Component	Uncached Input	Cached Input	Output	Total Tokens
Orchestrator: GLM-5.1	11,133,388	24,572,544	3,259,993	38,965,925
Swarm: gpt-oss-120b	6,076,321,295	—	359,481,027	6,435,802,322
Total	6,087,454,683	24,572,544	362,741,020	6,474,768,247

Agent swarm sneak peek, vast majority of tokens consumed by more efficient swarm agents. This allows us to increase number of search documents k.

The previous best open-source solution offers just 68% accuracy over the 830 queries. And that's with a specifically trained agent for open research tasks and a deep research–specific embedding model. Truly off-the-shelf implementations result in just 57% accuracy with GPT-OSS-120B-high as the main LLM and Qwen3-Embed-8B as the retriever. Closed solutions, on the other hand, were able to push 90.48% with GPT-5 and proprietary retrievers.

We have bridged this gap.

As seen from the number of tokens consumed, Sail makes this kind of trajectory economical and first-class by routing the bulk work to efficient, reliable open-model workers, allowing breathing room for the main agent to persist through difficult tasks.

Provider cost comparison

Provider	Orchestrator	Swarm	Orchestrator $	Filter $	Total $	$/query	vs Sail
Sail	GLM-5.1	gpt-oss-120b	$12	$116	$129	$0.15	1×
OpenRouter	GLM-5.1	gpt-oss-120b	$33	$772	$805	$0.97	6×
Baseten	GLM-5	gpt-oss-120b	$26	$787	$813	$0.98	6×
Fireworks	GLM-5.1	gpt-oss-120b	$36	$1,127	$1,163	$1.40	9×
Together AI	GLM-5.1	gpt-oss-120b	$64	$1,127	$1,191	$1.44	9×
OpenAI	GPT-5.4	GPT-5.4 Nano	$83	$1,665	$1,748	$2.11	14×
Z AI	GLM-5.1	GLM-4.7	$36	$4,437	$4,473	$5.39	35×

[1,2]: pricing and models expanded below

An Efficient Agent Swarm

The Sail architecture is simple.

Have an orchestrator agent propose a search given the research query.
The retriever returns k matching documents.
Swarm of cheaper reader agents read truncated documents in parallel
1. Drop irrelevant documents
2. Summarize evidence in document OR extend document
Orchestrator sees compacted evidence, re-searches or submits answer

There are several key optimizations here. The orchestrator never reads any document. It relies on signal from the swarm agents. This prevents the orchestrator's context window from blowing up and alleviates pressure from the retriever — if the retriever fails to find relevant documents, the swarm agents protect the orchestrator from bad signal that could lead the overall trajectory astray. The swarm agents have fresh context per query and are cheaper to run so there's far less risk if the retriever is more liberal in its search results. This also helps with finding answers to abstruse questions as the retriever can return more periphery documents that could lead to the answer.

Second, the vast majority of tokens are consumed by the swarm reader agent. The reader agents are doing the heavy lifting of ingesting k documents per turn and producing a summary of the evidence when relevant. This allows for having a more expensive, frontier-esque model as the orchestrator doing the heavy reasoning, search query edits, and final answer submission and using a far cheaper alternative as the swarm agent per document. As the number of documents k increases, the economics become increasingly lucrative. The vast majority of tokens are consumed by lightweight swarm agents.

The lion's share of tokens will soon be consumed by background agents. We've built the inference infrastructure to unlock this value. Try building on our platform to unlock your agentic workflow.

Appendix

Sail Research GLM-5.1 and gpt-oss-120b pricing. OpenRouter: GLM-5.1 pricing, gpt-oss-120b pricing. Baseten GLM-5 & gpt-oss-120b pricing. Fireworks GLM-5.1 pricing, gpt-oss-120b pricing. Together AI GLM-5.1 pricing, gpt-oss-120b pricing. OpenAI GPT-5.4 & 5.4 Nano pricing. Z.AI GLM-5.1 & 4.7 pricing.
At time of writing, Baseten does not support GLM 5.1, so GLM-5 prices were used instead. Comparable models to GLM-5.1 and gpt-oss-120b were selected for OpenAI and Z.AI. Performance isn't guaranteed to be the same on these models; they were used to compare prices of similarly capable models.

← Back to News