Agentic Deep Research is a Search & Infra problem, not an AI one

As I write this, the world has been using Gemini Deep Research for a while now. I always find it fascinating as a product and what it promises to accomplish but often find myself underwhelmed by the result it delivers for my queries. Being an AI Engineer, I am pretty familiar with the complexities of such a system, but I wanted to explore them first-hand. So, I set out to build MiniDeepResearch, my not so intelligent twin brother of Gemini Deep Research if you will, that can answer questions about life, the universe, and everything else, with more than just 42.

What I discovered is that building a genuine "Deep Research" report isn't just about LLM performance anymore, it's about the friction of the data pipeline. Shallow LLM wrappers break down because they treat the web as a reliable and structured source, failing to account for the inherent noise of search results, the massive bloat of modern web pages, and the looming threat of a combinatorics explosion in tokens.

How it works: The Core Components

Lets start with looking at a complete data flow for MiniDeepResearch and its various components.

The Planner (The "Brain"): When a query comes in, the Planner uses a high-reasoning model to decompose it into a set of focused, logical sub-questions. It uses a MECE (Mutually Exclusive, Collectively Exhaustive) strategy to ensure we cover the maximum ground without redundant work. While I wanted MECE to sound fancy, it is simply a little bit of smart prompt engineering to tweak the agent behavior.

The Searcher & Fetcher (The "Hands"): For each sub-question, the agent hits the Search API like Travily to find high-quality URLs. It then concurrently fetches the raw content of these pages using aiohttp and trafilatura, stripping away the HTML noise to leave behind clean, searchable text.

The Extractor (The "Filter"): Instead of asking for a summary, the Extractor uses strict JSON schemas to pull out specific, factual findings. Each finding must include a claim, the evidence from the text, and a confidence score. By grounding the model in the specific search query, This is important to keep the research flow aligned to the higher goal and prevent distractions.

The Gap Analyzer (The "Critic"): After the first round of extraction, the agent doesn't just stop. The Gap Analyzer reviews the current findings against the original sub-question. It identifies exactly what is missing or conflicting and generates new, highly targeted follow-up queries to fill those specific voids.

The Synthesizer (The "Writer"): Finally, once the sub-questions are satisfied or the budget is met, the Synthesizer compiles all deduplicated findings into a professional Markdown report, complete with inline citations and a full bibliography.

Key Challenges and Insights

Building an autonomous researcher revealed four systemic challenges that a standard LLM pipeline simply isn't equipped to handle.

1. The Search API Quality Gap

By far, I feel Search quality is the biggest limiting factor for building a useful Deep Research Agent. Search APIs are optimized for human click-through rates, SEO, and general intent. They are not optimized for factual density or specific research contexts. A search for "Nvidia Q4 earnings" might return ten news articles that all summarize the same three bullet points and completely miss to provide explicit data on "Q4 Earnings". The spectrum of result may be anywhere from "What analyst expect Nvidia Q4 earnings" to "Why you should buy Nvidia". If your agent treats these as ten unique sources, you're just paying to process the same data repeatedly. The "best" results for a human aren't always the "best" for an extraction agent.

2. The RAG Context Pollution

Web pages are massive, and 90% of their content is "junk" in context of answering your query. Ingesting full page content to LLM to find insights is expensive and yields poor results specially when using fast models like gemini-2.5-flash. After ending up using 1M tokens just in a few hours of dev time, I implemented a bruteforce approach to trim down pages to top N words. This is clearly a sub-optimal approach and smarter solutions can be applied. Crucially, it is also important to have a cheap strategy to find page relevancy score before plumbing the content into Extraction agent.

3. The Search Explosion Problem

This is where costs spiral out of control. If you decompose a query into 5 sub-questions, and each sub-question triggers 5 search results, and each of those triggers a gap analysis with 3 follow-up queries... you are suddenly looking at a combinatorial explosion of LLM calls. Without strict budget management, you'll exhaust your token quota or may every research task too expensive and with strict budget, your agent may fail to collect enough data to answer the user's query. This is where even SOTA Agents like Gemini Deep Research also face the same dilemma.

Explosion of search results in a single tasks.

4. Redundancy & Duplicate Processing

In an iterative loop, the same page is often fetched multiple times for different sub-queries. The same authoritative source (e.g., a specific SEC filing) might appear in three different searches. If your agent fetches, parses, and extracts findings from that same URL multiple times, you're leaking money and slowing down the entire process.

Things that worked well

Multi-Model Orchestration

To solve the cost/quality trade-off, I implemented a "routing" strategy. I use the cheaper gemini-2.5-flash for the bulk text extraction—where the task is simple: "find the facts." I reserve the smarter, more expensive gemini-2.5-pro for high-level planning, gap analysis, and final synthesis, where complex reasoning is mandatory.

Targeted Extraction over Summarization

I force the LLM into strict structured mode using Pydantic schemas. It cannot return a summary and it must return a list of Finding objects containing a claim and the exact evidence. This ensured the model stays focused on the specific "needle" it was sent to find and returned a condensed response.

Visited URL Tracking & Jaccard Deduplication

To patch the efficiency leak, the orchestrator maintains a visited_urls set. No page is ever fetched twice. Furthermore, I implemented a Jaccard similarity algorithm that compares new findings against the global state. If a new finding overlaps significantly with an existing one, it's discarded before it ever hits the final synthesizer. We can also use an LLM to do this but I did not get a pay hike this year so wanted to keep costs low.

The Path Forward: High-Level Ideas

While MiniDeepResearch is functional, true "Deep Research" needs more sophisticated strategies to improve quality while reducing costs.

Semantic Chunking & Pre-filtering: Instead of aggressive truncation, agents can use embeddings to identify the specific paragraphs relevant to a sub-query before the LLM ever sees them, or even discard irrelevant pages after a quick metadata check.
Multi-Agent Debate: If two sources provide conflicting data, a specialized "Auditor" agent should be triggered to find a tie-breaker or flag the discrepancy in the report.
Persistent Semantic Caching: Extracted facts and page relevance should be stored in a local vector DB so the agent "remembers" the foundational data across multiple research tasks.
Sufficient Context: Build a strategy to evaluate if the Agent already has sufficient information to answer the user's query at any point and work backwards from there. This is an active area of research for my colleague and friend Cyrus. Go check him out.

Check out the code on GitHub: [mini-deep-research-ai]

Agentic Deep Research is a Search & Infra problem, not an AI one

How it works: The Core Components