Vache prompts. Claude codes.How it works

Building a Multi-Agent Research Pipeline — Benchmarked on DRB

·11 min read·by Vache Sarkissian
Updated June 3, 2026
·
Reviewed March 29, 2026
researchmulti-agentautomationbenchmarks
📚Top of Funnel

Written by Claude (Opus 4.6) Vache prompted, reviewed, and published. The data and benchmarks are real; the prose is AI-generated.

Research Assessment & Comparison Evaluation

RACE Benchmark

0.5166
Our Score

Comprehensiveness, Insight/Depth, Instruction Following, Readability. Higher is better. 0.5 = tied with reference system.

RACE score only · n=50 tasks · 0.5 = reference baseline

This Pipeline6-wave, multi-agent, skeptic review
0.5166
Gemini Deep ResearchGoogle
0.4888
NVIDIA AIQNVIDIA
0.4759
Claude-ResearcherSingle-agent baseline
0.4200
Reference baseline (0.5)
+5.7%
vs Gemini
+8.6%
vs NVIDIA AIQ
+23.0%
vs Claude-Researcher
1Pre-flight Search
2Parallel Research
3Quality Gate
4Skeptic Review
5Refinement
6Final Report
~10,000 words · ~100 URLs · 3-4 min/topic · $0.80-1.20/sessionvachsark.com

I wanted automated research that produces reports I'd actually read. Not the "summarize the first page of Google results" kind — structured, verified, skeptic-reviewed reports with real citations and quantified confidence levels.

The result is a 6-wave multi-agent pipeline that scores 0.5166 on RACE (n=50 tasks, full benchmark run). On RACE, 0.5 represents a tie with the reference system — so 0.5166 means marginally better than reference, and competitive with Gemini Deep Research (0.4888) and NVIDIA AIQ (0.4759). Citation quality is a known weak point — more on that below. It runs 3x/week via heartbeat, generates ~10,000-word reports with ~100 cited URLs, and finishes in 3-4 minutes per topic.

Here's how it works.

The Goal

Most "AI research" tools are glorified search wrappers. You ask a question, an LLM searches the web, and you get a summary that sounds confident but has no verification layer. The hallucinations are invisible because there's no second opinion.

I needed something different. My vault is a knowledge engine — over 500 atomic Zettelkasten notes across CS, economics, psychology, finance, and more. Research tasks come from a topic-planner that identifies gaps in coverage. The pipeline needs to fill those gaps with publication-quality output: accurate claims, cited sources, identified uncertainties, and a clear distinction between established fact and emerging consensus.

The key insight: quality comes from adversarial review, not from using a bigger model. A skeptic that challenges every claim does more for accuracy than doubling your context window.

Architecture: 6 Waves

The pipeline runs as a Claude agent orchestrator (.claude/commands/research-session.md) with three specialized sub-agents: researcher, verifier, and skeptic. Each wave has a specific job and a clear handoff to the next.

Wave 1: Pre-flight

Before hitting any API, check what we already know. The vault has a semantic search engine (vault-search.py) that fuses BM25 lexical search with embedding-based retrieval using reciprocal rank fusion. A typical pre-flight query returns in ~560ms and surfaces existing notes, prior research reports, and relevant context.

python3 _scripts/vault-search.py "transformer attention mechanisms" \
  --rerank --intent "machine learning" --top 10

This isn't just a nicety — it prevents the pipeline from re-researching topics we've already covered, and it gives the research agents grounding context so they don't start from zero. The --intent flag steers HyDE expansion to disambiguate terms ("attention" means something different in psychology vs. ML).

If pre-flight finds substantial existing coverage, the pipeline shifts to a "deepen and update" mode instead of "research from scratch." This alone saves about 30% of API spend.

Wave 2: Parallel Research

Multiple Sonnet agents fan out simultaneously, each with a different search angle on the topic. One might focus on academic papers (via arXiv API integration), another on industry developments, a third on contrarian viewpoints. They search, read, and synthesize independently.

The parallelism matters. A single agent doing sequential searches takes 8-12 minutes. Three agents in parallel finish in 3-4 minutes with better coverage because they're exploring different corners of the search space.

Each agent produces a structured intermediate report: key findings, source URLs, confidence levels per claim, and identified gaps. These aren't final outputs — they're raw material for the next wave.

Wave 3: Quality Gate

This is where Opus earns its 5x cost premium. A quality gate agent reviews all intermediate reports for:

  • Accuracy: Do the claims match what the sources actually say? Cross-reference between agents — if Agent A and Agent B cite the same source but draw different conclusions, flag it.
  • Coverage: Are there obvious gaps? If we're researching "federated learning" and nobody mentioned differential privacy, that's a miss.
  • Citation quality: Are sources authoritative? A blog post and a peer-reviewed paper don't carry the same weight. The gate assigns source tiers.
  • Internal consistency: Do the agents' findings contradict each other? Contradictions aren't automatically bad (the topic might genuinely be contested), but they need to be surfaced explicitly.

Reports that fail the quality gate get sent back to Wave 2 with specific instructions on what to fix. In practice, about 20% of first-pass reports get bounced. The most common failure mode is low citation quality — agents find the right answer but cite a Medium article instead of the primary source.

Wave 4: Synthesis + Skeptic

The synthesis step merges all approved intermediate reports into a single coherent narrative. Deduplication, conflict resolution, and structural organization happen here. This produces a draft report.

Then the skeptic agent attacks it.

The skeptic (.claude/agents/research-skeptic.md) is deliberately adversarial. Its job is to:

  • Challenge every causal claim: "You say X causes Y. What's the actual evidence? Could it be correlation?"
  • Identify missing counterarguments: "You present the benefits of approach A but don't mention the well-documented failure modes."
  • Question recency: "This study is from 2021. Has it been replicated? Has the landscape changed?"
  • Flag overconfidence: "You state this as fact, but your sources are two blog posts and a preprint."

The skeptic doesn't just flag problems — it produces specific objections with suggested fixes. This structured output feeds directly into Wave 5.

Wave 5: Refinement

Address every skeptic concern. This wave does targeted additional research to fill the gaps the skeptic identified. If the skeptic said "you need a primary source for this claim," the refinement agent goes and finds one. If it said "you're missing the counterargument," the agent researches and presents it.

Not every skeptic concern gets resolved — some are genuine limitations of available evidence. In those cases, the refinement step adds explicit uncertainty markers: "Evidence is limited to N studies" or "This claim is contested by [researchers] who argue..."

I'd rather have a report that says "we don't know" than one that sounds certain about something uncertain.

Wave 6: Final Report

The finished report follows a consistent structure:

## Executive Summary
(3-4 paragraphs, standalone summary)
 
## Key Findings
(Numbered, with confidence levels)
 
## Detailed Analysis
(Full narrative with inline citations)
 
## Methodology Notes
(What was searched, what was excluded, known limitations)
 
## Citations
(Full URL list with access dates and source tier)

Output averages ~10,000 words with ~100 unique URLs cited. Every factual claim traces back to at least one source. Confidence levels are explicit: HIGH (multiple authoritative sources agree), MEDIUM (limited but credible evidence), LOW (single source or contested).

RACE Benchmark Results

RACE — Research Assessment and Comparison Evaluation — measures four dimensions:

  • Comprehensiveness (0-1): Does the output cover the important aspects of the topic?
  • Insight/Depth (0-1): Does it go beyond surface-level summaries to meaningful analysis?
  • Instruction Following (0-1): Does it address what was actually asked?
  • Readability (0-1): Is it clear, well-structured, and usable?

Important calibration: RACE uses 0.5 as the reference baseline — a score of 0.5 means tied with the reference system, not a middling result. Scores above 0.5 beat the reference; below 0.5 lose to it.

Our full 50-task run results:

DimensionScore
Comprehensiveness0.5209
Insight/Depth0.5255
Instruction Following0.5075
Readability0.4981
Overall (n=50)0.5166
SystemRACE ScoreNotes
This pipeline0.51666-wave, multi-agent, skeptic review (n=50)
Gemini Deep Research0.4888Strong on comprehensiveness, weaker on depth
NVIDIA AIQ0.4759Good accuracy, lower completeness
Claude-Researcher0.4200Single-agent, no verification layer

The gap over Claude-Researcher (same underlying model family) is the most telling: +0.0966. That's not a model capability difference — it's a pipeline architecture difference. The skeptic wave and quality gate are doing the heavy lifting. Same models, better orchestration, meaningfully better output.

Gemini's comprehensiveness scores are strong — it's good at casting a wide net. But it lags on insight because it doesn't have a verification and adversarial-review layer that forces deeper analysis. NVIDIA AIQ is similar: solid on individual dimensions, but misses topics that a multi-angle parallel search would catch.

DRB leaderboard context: DRB combines RACE with FACT (citation quality scores). On the combined leaderboard, the top systems score in the 54-56 range. Our pipeline scores ~51.66 on RACE alone — and when FACT is folded in, we fall below 5th place due to citation quality gaps. That's the honest picture.

Citation Quality (FACT Scores)

FACT measures citation validity — whether the URLs actually support the claims they're cited for. This is where the pipeline has a real weakness.

SystemCitation ValidityCitations Checked
Claude-Researcher (reference)87.3%28 citations, 24.5 valid/task
This pipeline (V2)57.8%162 citations, 93.6 valid/task

The trade-off is breadth vs. precision. The pipeline produces 5.8x more citations per task than the reference system — but at significantly lower verification rates. Many of those citations are real and valid; the "write-then-cite" behavior in some waves means sources get attached after synthesis rather than grounding it.

The root issue: synthesis agents sometimes generate claims and then find supporting citations retroactively, rather than building claims from verified sources forward. Reversing this — requiring citations before claims rather than after — is in progress, but it requires restructuring Wave 4 substantially.

The breadth advantage is real: 93 valid citations per task vs. 24 for the reference is a meaningful difference in coverage. But the 57.8% validity rate means roughly 4 in 10 citations need scrutiny. For a vault research system where I'm the downstream consumer, that's acceptable; for anything published as-is, it's a problem worth fixing.

Model Assignments

Cost discipline matters when you're running research 3x/week indefinitely. The model assignments are deliberate:

  • Sonnet 4.6 for Wave 2 (parallel research): This is volume work — searching, reading, synthesizing. Sonnet handles it well at 1/5 the cost of Opus. Running 3 parallel Sonnet agents costs less than 1 Opus agent doing the same work sequentially.
  • Opus 4.6 for Wave 3 (quality gate) and Wave 4 (synthesis): These are judgment calls. Is this source authoritative? Are these findings contradictory or complementary? Opus's edge over Sonnet on nuanced reasoning is real and worth paying for at the chokepoint.
  • Local models (qwen3:8b, qwen2.5-coder:14b) for pre-flight search and post-processing: Zero API cost. The semantic search, HyDE expansion, and re-ranking all run on a local RX 9070 XT at 85 tok/s. No reason to pay for these operations.

Total cost per research session runs about $0.80-1.20 depending on topic complexity and how many reports get bounced by the quality gate. At 3x/week, that's roughly $12-15/month for continuous research output. Cheaper than a single journal subscription and producing more targeted content.

Continuous via Heartbeat

The pipeline doesn't wait for me to run it. It's integrated into the vault's heartbeat scheduler:

### research-session
- Schedule: Mon,Wed,Fri 10:00
- Model: mixed (sonnet + opus + local)
- Script: research-session.sh
- Timeout: 600

Topic selection is automated too. Every Friday, a topic-planner task (running on Opus) reviews the knowledge graph, identifies coverage gaps, and queues research topics for the following week. A daily knowledge-analyst task can also inject urgent topics when it spots significant developments in monitored domains.

Results flow into the knowledge engine: finished reports get chunked into atomic Zettelkasten notes, embedded for semantic search, and cross-linked to existing notes via the reweave system. The vault gets smarter every week without me doing anything.

The full feedback loop: topic-planner identifies gaps on Friday, research-session fills them Mon/Wed/Fri, knowledge-analyst evaluates the new content daily, and the cycle repeats. It's a closed loop of continuous learning.

Lessons

The skeptic wave was the single biggest quality improvement. Before adding it, reports had a confident tone that masked real uncertainty. The skeptic forces the pipeline to distinguish between "this is well-established" and "this is one study from 2023 that hasn't been replicated." That distinction is everything.

Multi-wave beats single-shot by a wide margin. The RACE score difference between this pipeline (0.5166) and single-agent Claude-Researcher (0.4200) is almost entirely explained by the quality gate and skeptic review. Same models, same search tools, fundamentally different output quality.

Quality gates catch what single-agent systems can't. The most dangerous failure mode in AI research isn't making things up from nothing — it's subtly misrepresenting what a source says. A single agent reads a paper, extracts what it thinks is the key finding, and states it with confidence. A quality gate that cross-references the claim against the source catches the misrepresentation. This happens in roughly 1 in 5 reports.

Parallel search with diverse angles produces better coverage than sequential depth. Three agents each spending 60 seconds from different angles find more than one agent spending 180 seconds going deeper on the first angle it thinks of. Exploration breadth beats exploitation depth for research tasks.

Pre-flight against existing knowledge is critical. Without it, the pipeline re-researches topics the vault already covers well. With it, the pipeline builds on existing knowledge instead of duplicating it. This also grounds the research agents — they start with context instead of a blank slate, which reduces hallucination rates measurably.

The pipeline isn't perfect. A RACE score of 0.5166 means marginally better than the reference — not dominant. The citation validity gap (57.8% vs. 87.3% for the reference) is real and actively being addressed. There's a long tail of edge cases where the system produces plausible-sounding claims that don't hold up under manual review. But the architecture is sound, the autonomous operation works, and the vault gets meaningfully better coverage every week.

The code lives in .claude/agents/research-team.md, .claude/agents/research-verifier.md, .claude/agents/research-skeptic.md, and .claude/commands/research-session.md. The vault-search pre-flight helper is at _scripts/research-vault-context.sh. All of it is orchestrated by the same heartbeat system that runs 16 other automated tasks on a local GPU at zero cost.

Further Reading

About the Author

Vache Sarkissian

Building research infrastructure and products at the intersection of knowledge systems and machine learning. Creator of Linesheet Pro, vault-search, and the vachsark learning engine.

View Full Bio →
© 2026 Vache Sarkissian·Built with Claude Code
vachsark.com