Deep-Research Eval: An Automated Framework for Assessing Quality and Reliability in Long-Form Reports

Tuohetiyaer, Yeerpan; Zhu, Yuye; Hu, Yan; Lu, Siyuan; Wang, Zhongfeng

doi:10.3390/app16052546

Open AccessArticle

Deep-Research Eval: An Automated Framework for Assessing Quality and Reliability in Long-Form Reports

by

Yeerpan Tuohetiyaer

¹,

Yuye Zhu

²,

Yan Hu

¹,

Siyuan Lu

¹

and

Zhongfeng Wang

^1,3,*

¹

School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China

²

School of History, Nanjing University, Nanjing 210023, China

³

School of Integrated Circuits, Sun Yat-sen University, Shenzhen 510275, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2546; https://doi.org/10.3390/app16052546

Submission received: 13 February 2026 / Revised: 25 February 2026 / Accepted: 27 February 2026 / Published: 6 March 2026

(This article belongs to the Special Issue Construction of Knowledge System Based on Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Deep Research Agents (DRAs) generate detailed literature surveys but often suffer from hallucinations and inconsistent structures. Existing evaluation methods face significant limitations. Human evaluation is time-consuming and requires domain expertise. Meanwhile, current LLM judges struggle with long reports due to context limits and the inability to verify source reliability. To address this, we propose Deep-Research Eval. This framework standardizes the page as the basic unit for evaluation. It features an adaptive scoring system that assesses the logical quality of each page. Furthermore, it employs Paged-RAG with a constructible reference database to verify facts against specific evidence. Experiments on five agents show that our method effectively identifies errors. It achieves a strong correlation with human judgment, reaching a Composite Consistency Index (CCI) of 0.7585, an absolute increase of 0.4588 over baselines. Additionally, the Paged-RAG module improves factual verification accuracy, increasing the QA-F1 score by up to 6.9 times compared to standard retrieval methods. This work offers a scalable and practical approach for assessing AI-generated academic content.

Keywords:

Deep Research Agents; large language models; retrieval-augmented generation; automatic evaluation; research report assessment; LLM-as-a-judge

1. Introduction

Large language models (LLMs) are now essential tools for automated knowledge production. They are widely used for literature reviews, market analysis, and technical reporting. Recently, a new class of systems called Deep Research Agents (DRAs) has emerged [1]. These agents build upon LLMs by adding modules for planning, retrieval, and iterative reasoning. This structure allows them to handle complex, multi-step research tasks. Notable examples include OpenAI’s Deep Research [2], Google Gemini Deep Research [3], and Perplexity [4]. These systems can efficiently generate reports comparable to those of human analysts. As shown in Figure 1, DRAs break down high-level queries into actionable sub-tasks to create comprehensive outputs. This workflow reduces research costs, but the complexity of generating long-form content creates significant reliability problems.

The main difficulty lies in the opacity and length of the generated content. DRA reports often span dozens of pages and combine information from many different sources. A small error in retrieval or reasoning can spread across multiple sections. This propagation leads to structural incoherence and subtle hallucinations. Unlike errors in short summaries, these mistakes are often hidden within dense text. Consequently, they are hard to find and verify.

Current evaluation methods do not solve these problems effectively. Manual checking is slow and requires expensive domain experts [5]. Automated methods also struggle with granularity. Traditional metrics like ROUGE rely heavily on surface-level lexical overlap. They fail to capture semantic logic or detect reasoning errors. Furthermore, standard “LLM-as-a-judge” approaches lack empirical reliability for long documents. They consistently suffer from “attention dilution” and the well-documented “lost-in-the-middle” phenomenon when processing extensive context [6,7]. Consequently, they frequently fail to detect subtle contradictions or verify cross-sectional factual consistency.

These limitations reveal fundamental gaps in granularity and verifiability within existing paradigms. To address these gaps, our work introduces a conceptual shift in the assessment of AI-generated long-form content. We theoretically redefine the basic unit of assessment. We no longer treat a generated report as a monolithic entity. We shift the evaluation paradigm to a fine-grained and functional page-level framework. This framework operates within a controlled verification environment. To support this environment, we introduce a reliability verification mechanism based on external retrieval. This mechanism dynamically constructs task-specific evidence pools from a Configurable Reference Database. It allows the synthesis quality of the agent to be systematically evaluated against a defined and retrievable knowledge base.

Based on this conceptual foundation, we propose Deep-Research Eval. This is an automated framework designed for deep research reports. It features two core components. First, it introduces an adaptive scoring system. This system executes the precise page-level paradigm. By applying specific criteria to each page, we detect local failures. Global scores often miss these local failures. Second, it employs a Paged-RAG mechanism alongside a Configurable Reference Database. This aligns generated content with specific evidence to ensure factual accuracy. In summary, the main contributions of this work are as follows:

Page-Level Evaluation Paradigm: We propose a novel methodology. This methodology shifts from monolithic document scoring to fine-grained page-level assessment. By decomposing reports into functional units, we effectively detect local failures. Global metrics often mask these failures.
Paged-RAG with Configurable Reference: We introduce a controlled verification environment. This environment uses a Paged Retrieval Augmented Generation mechanism. This mechanism aligns content with a dynamic evidence pool. It establishes a clear boundary for factual verification.
Deep-Research Eval Framework: We present an automated and scalable framework. This framework is designed for the unique challenges of Deep Research Agents. It addresses the critical gaps in granularity and verifiability found in existing evaluation methods.

2. Related Work

The evaluation of Deep Research Agents (DRAs) relates to several research fields. These fields include agentic large language model systems, retrieval-augmented generation, and automated text evaluation. This section first explains DRAs as a research concept. It then reviews recent progress in RAG and long-form text evaluation. The discussion focuses on their limits when they are applied to deep research reports.

2.1. Deep Research Agents

Deep Research Agents (DRAs) are LLM-based systems that aim to automate complex research workflows [1]. These systems differ from common search-enabled LLMs. They rely on multi-step planning and iterative reasoning instead of single-pass generation. A DRA breaks a high-level query into smaller tasks. It retrieves information from the web. It then combines evidence from multiple sources to form a final report.

Several systems represent this line of work. Examples include OpenAI’s Deep Research [2], Google Gemini Deep Research [3], Perplexity Deep Research [4], and Qwen-Deep-Research [8]. These systems target long-form outputs that resemble reports written by human analysts. Recent benchmarks describe DRAs as tools for PhD-level research tasks [5].

Prior studies also report clear weaknesses. DRAs often fail to maintain structural coherence in long documents. They also suffer from hallucination cascades. In such cases, early mistakes influence later reasoning steps [9]. These problems show why evaluation methods must locate errors at a fine level of detail.

2.2. Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) grounds LLM outputs in external knowledge sources [10,11]. Most RAG systems first retrieve relevant documents. The model then generates text based on the retrieved content. This design is common in question answering and research-oriented text generation.

Early RAG methods use dense vector similarity to retrieve the top-k text chunks. This approach often adds irrelevant content when the context is long. It also suffers from the lost-in-the-middle problem [7]. Another limitation is the lack of document structure awareness. This weakness reduces performance in complex report generation.

Hybrid RAG methods combine sparse retrieval, dense retrieval, and re-ranking [12,13]. Graph-augmented RAG adds explicit relations between entities and documents [14]. These methods improve retrieval precision. At the same time, they increase system complexity and computational cost.

Chunk size is still a core design issue in RAG. Small chunks improve retrieval accuracy but remove necessary context. Large chunks keep context but weaken localization. Paged-RAG addresses this trade-off by using page-level units. This design matches the structure of deep research reports. It also supports detailed fact checking without complex graph construction.

2.3. Long-Form Text Evaluation

The evaluation of LLM outputs has changed as generation tasks have become longer and more structured. Metrics based on lexical overlap, such as ROUGE and BLEU, are no longer sufficient. They fail to measure reasoning quality and factual consistency [15,16]. They also ignore evidence alignment and source attribution.

Recent work proposes semantic and pipeline-aware evaluation methods for RAG systems. RAGAS [17] defines metrics such as faithfulness, relevance, and context precision. Other tools, including TruLens and RAGelo [18], separate retrieval quality from generation quality. These approaches perform well on short or medium-length outputs.

Long-form generation introduces new challenges for evaluation. LLM-as-a-judge methods [19] offer scalable qualitative assessment. Yet earlier studies show clear limits. Judges often overlook local errors in long documents because attention is spread too thinly [6].

Benchmarks such as DeepResearch Bench [5] and DeepResearch Bench II [9] address this issue. They introduce PhD-level tasks and multi-dimensional evaluation rubrics [20,21]. These benchmarks improve task coverage. Still, they rely on document-level scoring. They provide little information about errors within specific sections.

3. Methodology

As shown in Figure 2, Deep-Research Eval provides a multidimensional and fine-grained evaluation framework for long-form reports generated by Deep Research Agents (DRAs). The framework focuses on both content quality and citation reliability. It does not treat a report as a single undivided unit. Instead, it evaluates the report through a structured and page-aware process.

At a high level, Deep-Research Eval divides a candidate report into thematically coherent pages. It then evaluates each page along two dimensions. One dimension is content quality, and the other is source reliability. The framework finally aggregates the results into interpretable quantitative and qualitative scores. Deep-Research Eval consists of three core components. These components are Structural Alignment, a Dynamic Content Quality Assessment module (CQA), and a Source Reliability Assessment module (SRA) based on the proposed Paged-RAG verification paradigm.

3.1. Structural Alignment

The evaluation pipeline starts with structure recognition and page extraction. This step defines a consistent evaluation unit for reports having different writing styles and layouts. Given a candidate report R produced by a DRA and a templated report T that serves as a reference structure, Deep-Research Eval applies an LLM-based page restructuring module to align the two reports at the functional level. The templated report T is a manually defined gold standard, and one consistent template is used per research task.

The restructuring module does not rely on surface cues such as headings or paragraph breaks. Instead, it identifies the functional role of each segment in R. Typical roles include introduction, methodology, literature review, and empirical analysis. Each segment is then mapped to its corresponding page in T. This process produces a set of aligned modules

{m_{1}, m_{2}, \dots, m_{n}}

. Each module represents a coherent unit for evaluation in terms of both semantics and function. Crucially, this structural alignment does not aim to constrain the agent’s creativity but to assess its capability to adhere to standardized reporting schemas.

Evaluating the complex and varied structures of generated reports is a major difficulty in this field. We explicitly acknowledge a structural vulnerability in our approach. Automatic classification carries a risk of inaccurate mapping. Misclassified segments could receive mismatched evaluation criteria. An early alignment error can indeed propagate through the pipeline and distort downstream scores. To mitigate this risk, our framework focuses primarily on content quality rather than rigid formatting. Candidate reports frequently contain hybrid sections or unconventional ordering. Our restructuring module addresses this challenge directly. It groups text based on thematic coherence rather than surface headings. It maps these semantic chunks to a flexible functional template.

Structural alignment reduces the impact of contextual drift during evaluation. Contextual drift occurs when a section is penalized for missing information that appears in other parts of the report. By constraining evaluation to the page level, Deep-Research Eval ensures that each module is assessed only within its intended scope. This design improves the accuracy and fairness of the evaluation results.

To illustrate this conceptual advantage, consider a specific case of contextual drift. An agent might generate a report that mistakenly places experimental details in the literature review instead of the methodology section. A standard document-level evaluation scans the entire text. It detects these details and awards a high score for methodological completeness. The evaluation model mixes content across different sections and misses the structural failure. In contrast, Deep-Research Eval isolates the methodology page as a strict evaluation unit. It detects the missing information exactly where it belongs. The framework then properly penalizes the structural error. This mechanism prevents content from one section from confusing the assessment of another.

3.2. Content Quality Assessment

The Content Quality Assessment module evaluates the linguistic quality, structural integrity, and analytical rigor of each aligned page. The module adopts a criteria-driven scoring strategy. It does not apply a fixed rubric to all the pages. Instead, the evaluation criteria are adjusted according to the functional role and thematic focus of each page.

3.2.1. Dynamic Dimension and Criteria Generation

For each page

m_{i}

, the system first performs dynamic dimension generation using LLMs. A large language model analyzes the topic and communicative purpose of the page. Based on this analysis, it generates a set of evaluation dimensions:

D_{i} = {d_{i, 1}, d_{i, 2}, \dots, d_{i, k}} .

(1)

Commonly generated dimensions include the following:

Content Completeness: Whether the page covers the key elements required by its intended function.
Analytical Depth: The extent to which the content demonstrates reasoning, abstraction, and technical detail.
Academic Expression: The clarity, coherence, formality, and precision of the terminology.

The dynamic generation mechanism is designed to adapt to the research task, not the individual report. Once a task is defined, the evaluation dimensions and rubrics remain fixed for all candidate reports under that task to ensure fairness.

After the dimensions are defined, the system generates evaluation criteria. For each dimension

d_{i, j}

, the model constructs a five-level rubric whose scores range from 1 to 5. Each level includes explicit qualitative descriptions and corresponding rewards or penalties. The output of language models indeed introduces a certain degree of randomness, which is a challenge encountered by LLM-as-a-judge. Therefore, we utilize this fine-grained scoring strategy to constrain the model’s reasoning path and improve scoring stability. To illustrate this process, consider the ‘Analytical Depth’ dimension generated for a Methodology page. The dynamic five-level rubric is constructed as follows:

Level 1 (Score 1): The text merely lists methods without any explanation.
Level 2 (Score 2): The text describes methods briefly but lacks justification.
Level 3 (Score 3): The methods are explained and justified, but technical details are shallow.
Level 4 (Score 4): The text provides strong reasoning and sufficient technical depth for the chosen methods.
Level 5 (Score 5): The text demonstrates profound analytical reasoning, thorough technical specifications, and a critical evaluation of the methodology.

3.2.2. Scoring and Quantitative Aggregation

Each page

m_{i}

is scored on all assigned dimensions, producing raw scores

s_{i, j}

. Each dimension is associated with a weight

w_{j}

, which reflects its relative importance. Crucially, these weights are not uniform. The large language model dynamically assigns them during the criteria generation phase. This assignment depends entirely on the specific research topic. The overall Score Rate is calculated as follows:

Score Rate = \frac{\sum_{i = 1}^{N} \sum_{j} w_{j} \cdot s_{i, j}}{\sum_{i = 1}^{N} \sum_{j} w_{j} \cdot S_{max, j}} .

(2)

In addition to this normalized metric, Deep-Research Eval generates a Content Quality Score. The model is prompted using the original page content and the aggregated scoring results. This process produces an integrated assessment that reflects content coverage, analytical quality, compliance with instructions, and readability.

3.3. Source Reliability Assessment

Content Quality Assessment evaluates writing quality. It does not judge whether a report is supported by reliable external evidence. This limitation is critical because large language models often produce fluent text that contains factual errors. This problem is more evident in long analytical reports generated by Deep Research Agents. To address this issue, we introduce Source Reliability Assessment. This module evaluates whether a candidate report relies on authoritative sources and whether its claims accurately reflect those sources.

Source Reliability Assessment follows two core principles. First, a reliable report should use authoritative external sources. Second, the factual claims in the report should be verifiable against these sources at a fine-grained level. Based on these principles, the assessment centers on factual grounding verification, which checks page-level claims under the proposed Paged-RAG paradigm.

3.3.1. Configurable Reference Database

We define a reference database, denoted as

𝒟

. This serves as the reference knowledge space for factual verification. This database contains documents expected to support accurate content for a specific task. This design ensures task transferability. It enables the construction of domain-specific resource pools. Examples include academic papers, technical manuals, and legal case files.

The database defines the strict external information scope for judging factual grounding. It creates a strictly controlled verification environment. The assessment intentionally measures conformity to this designated corpus. It does not evaluate boundless open-world correctness. The system applies a penalty if an agent generates claims outside this predefined boundary. This penalty does not imply that the out-of-scope claims are inherently false. Rather, it indicates that the agent failed to strictly adhere to the provided evidence pool. This strict boundary provides a clear optimization target for the future training of task-specific models.

3.3.2. Paged-RAG: Page-Aligned Retrieval-Augmented Verification

Standard Retrieval-Augmented Generation (RAG) methods face an inherent trade-off regarding retrieval granularity. Large chunks often dilute retrieval signals with irrelevant noise, leading to decreased precision. Conversely, small chunks offer high precision. However, they frequently lack sufficient context for complex reasoning. To balance retrieval precision with contextual completeness, Paged-RAG employs a “Retrieve-then-Trace” strategy. This approach aligns retrieval with the “page” unit, which serves as a natural container for complete information.

Specifically, the system first retrieves relevant fine-grained text fragments. It then identifies their source page via a majority consensus mechanism. The theoretical foundation of this approach is rooted in the attention mechanism of large language models (LLMs). When the context contains repeated content, the model naturally amplifies the important signals within that content. Once the full page is recalled, the system focuses on the information aligning with the retrieved fragments. The attention mechanism then cross-verifies and synergistically enhances this specific content. This process effectively highlights factual evidence. Simultaneously, it naturally filters out dispersed, non-repetitive noise present on the page.

While majority voting could theoretically introduce noise if the chunk distribution is too scattered, our design mitigates this risk. The use of small chunks increases the density of chunks per page. Consequently, the chunks participating in the vote become more concentrated on the correct source page. This concentration further reduces noise and leads to the improved retrieval accuracy observed in our experiments.

The Paged-RAG verification pipeline consists of the following steps.

Step 1: Page Segmentation

Let

𝒟 = {d^{(1)}, d^{(2)}, \dots, d^{(M)}}

denote the collection of reference documents, where M is the total number of documents.

Each document

d^{(m)} \in 𝒟

is segmented into pages:

d^{(m)} \to {p_{1}^{(m)}, p_{2}^{(m)}, \dots, p_{L_{m}}^{(m)}},

(3)

where

L_{m}

is the number of pages in document

d^{(m)}

. Each page

p_{l}^{(m)}

is assigned a unique provenance identifier, denoted as

page_{id}_{l}^{(m)}

.

Step 2: Chunk Generation and Vectorization

Each page

p_{l}^{(m)}

is further segmented into a set of text chunks:

p_{l}^{(m)} \to {c_{l, 1}^{(m)}, c_{l, 2}^{(m)}, \dots, c_{l, n_{l}}^{(m)}},

(4)

where

n_{l}

is the number of chunks on page

p_{l}^{(m)}

.

Each chunk

c_{l, j}^{(m)}

is encoded into a dense vector using an embedding encoder

vec (\cdot)

:

e_{c_{l, j}^{(m)}} = vec (c_{l, j}^{(m)}) .

(5)

The retrieval index stores chunk-level information as triplets:

I = \{(e_{c_{l, j}^{(m)}}, c_{l, j}^{(m)}, page_{id}_{l}^{(m)})\},

(6)

where

m \in [1, M]

,

l \in [1, L_{m}]

, and

j \in [1, n_{l}]

.

Step 3: Hybrid Retrieval

Given a candidate report chunk

r_{i}

, its embedding is computed as

e_{r_{i}} = vec (r_{i}),

(7)

where i indexes the chunks in the candidate report.

Each chunk in the index is scored using a hybrid function that combines dense and sparse retrieval:

Score (r_{i}, c_{l, j}^{(m)}) = λ \cdot cos (e_{r_{i}}, e_{c_{l, j}^{(m)}}) + (1 - λ) \cdot BM 25 (r_{i}, c_{l, j}^{(m)}),

(8)

where

λ \in [0, 1]

controls the balance between semantic similarity and lexical matching.

The top-K scoring chunks are retrieved together with their page identifiers:

C_{i} = {(c_{i, k}, page_{id}_{k}) ∣ k = 1, \dots, K} .

(9)

Step 4: Hit Page Selection

For the retrieved set

C_{i}

, the hit page

h_{i}

is selected by majority voting over page identifiers:

h_{i} = arg max_{page_id} |{page_{id}_{k} ∣ (c_{i, k}, page_{id}_{k}) \in C_{i}}| .

(10)

This page is assumed to be the most relevant source for

r_{i}

.

Step 5: LLM-Based Verification

The verification context for chunk

r_{i}

is constructed as

A_{i} = {c_{i, 1}, \dots, c_{i, K}} \cup {h_{i}},

(11)

where the context includes the retrieved chunks and their aligned page identifier.

A large language model computes the factual grounding verification score:

{FGV}_{i} = g_{LLM} (r_{i}, A_{i}),

(12)

where

{FGV}_{i} \in [0, 1]

measures how well

r_{i}

is supported by the evidence.

If the candidate report contains N chunks, the overall factual grounding verification score is defined as

SRS = \frac{1}{N} \sum_{i = 1}^{N} {FGV}_{i} .

(13)

The Source Reliability Score is used to measure overall report reliability.

3.4. Complexity Analysis

We acknowledge the computational overhead introduced by our method. While the latency in retrieval and voting steps is negligible, the generation step processes larger context windows. We analyze the complexity of the Transformer’s self-attention mechanism, which serves as the primary bottleneck. Let

L_{q}

denote the length of the user query and

L_{gen}

denote the length of the generated response. In a standard RAG setting, the model retrieves k chunks, each with a length of

L_{c}

. The total input context length,

N_{std}

, is defined as

N_{std} = L_{q} + k \cdot L_{c} .

(14)

The complexity of the pre-filling (prompt processing) phase scales quadratically with the input length. Thus, for standard RAG, the complexity is

C_{std} = O (N_{std}^{2}) = O ({(L_{q} + k \cdot L_{c})}^{2}) .

(15)

In Paged-RAG, the input context includes the query, the top-k chunks, and the full source page of length

L_{p}

. The total input length,

N_{page}

, becomes

N_{page} = L_{q} + k \cdot L_{c} + L_{p} .

(16)

Consequently, the pre-filling complexity increases to

C_{page} = O (N_{page}^{2}) = O ({(L_{q} + k \cdot L_{c} + L_{p})}^{2}) .

(17)

During the decoding (generation) phase, the complexity per token is linear with respect to the current sequence length. The computational cost increase can be approximated by the ratio of the context lengths:

R_{cost} \approx \frac{L_{q} + k \cdot L_{c} + L_{p}}{L_{q} + k \cdot L_{c}} = 1 + \frac{L_{p}}{L_{q} + k \cdot L_{c}} .

(18)

Since a full page typically contains significantly more tokens than individual chunks (i.e.,

L_{p} ≫ L_{c}

), the computational cost inevitably increases. However, it is important to note that this is a theoretical analysis. In real-world deployment, modern acceleration techniques, such as FlashAttention and KV caching, can effectively optimize the attention computation and mitigate the latency impact of longer contexts.

4. Experiments and Results

This section evaluates the effectiveness and generalization ability of the proposed framework. The experiments focus on two parts: the adaptive page-level scoring system and Paged-RAG. We report the experimental design and the main results for the adaptive scoring system.

4.1. Adaptive Page-Level Scoring System

This subsection evaluates the Content Quality Assessment module. We test five representative Deep Research Agents. We compare our method with a strong baseline and conduct ablation studies. The goal is to check whether the adaptive page-level scoring mechanism provides stable and reliable evaluations.

4.1.1. Dataset Construction

The evaluation dataset consists of two components. Each component serves a specific role.

Reference Database: We build a reference database using about 1600 papers published in 2024 and 2025. Each research topic includes around 80 papers. To ensure the academic rigor of the database, the papers were systematically selected. We first applied keyword filtering based on the target topics. We then refined the selection by prioritizing papers having high citation counts published in top-tier conferences and journals. While DRAs may have encountered some of these papers during their pre-training phase or through live web retrieval, this overlap aligns with real-world DRA usage. Our framework specifically aims to evaluate how reliably these agents synthesize and attribute this retrieved evidence, rather than measuring zero-shot novelty.
Survey Selection and Query Generation: We select 20 high-quality survey papers from several hundred candidates. To avoid selection bias, we implemented a structured protocol. First, we defined a broad set of keyword categories covering various AI subfields to guarantee topic diversity. Second, we filtered the candidate pool by selecting only highly cited surveys published in recognized academic venues, which ensured the baseline academic rigor. We then design natural language queries based on the topics of these surveys. These queries prompt five advanced research agents—GPT-4, Gemini Pro, Gemini Flash, Perplexity, and Qwen3—to generate research reports. This process produces 100 candidate reports. Although the dataset comprises 100 full-length documents, our fine-grained framework operates at the page level, resulting in over 1200 unique evaluation units. The selected topics focus on AI research areas, such as Multi-Modal LLMs and Intelligent Route Recommendation Systems. We review each topic to confirm that sufficient references are available.

4.1.2. Experimental Settings

We apply a unified experimental configuration. This setup ensures fair comparison and supports reproducibility.

Evaluated Research Agents: We evaluate five state-of-the-art research agents: GPT-4 [2], Gemini Pro [3], Gemini Flash [3], Perplexity [4], and Qwen3 [8]. These agents are widely used for automated research report generation.
Evaluation Model: We perform all evaluations using the Qwen-Plus model through an API (https://bailian.console.aliyun.com, accessed on 13 August 2025). We set the temperature to 0.4 and fix the random seed at 42. This model supports long-context inputs and follows evaluation instructions in a stable manner, which is required for full-length research reports.
Baseline and Ablation Settings: We use FACE as the baseline method [5]. FACE is a strong LLM-based evaluation framework and is closely related to our approach. We also run ablation studies to measure the impact of each module. We test three variants. The first variant, w/o Criteria, removes fine-grained evaluation criteria and scores only at the dimension level. The second variant, w/o Alignment, evaluates each report as a whole without page-level segmentation. The third variant, Naive LLM Judge, uses a single prompt without modular design or rule-based scoring.
Human Evaluation Setup: Three annotators scored reports on a scale from 1 to 10 across four dimensions. To ensure consistency, the annotators were provided with a detailed, task-specific scoring rubric containing explicit criteria for each score level. All the annotators strictly adhered to this unified standard. To minimize bias, the presentation order of the reports was randomized, and the identity of the generating model was blinded during the process.
Evaluation Metrics: We measure the agreement between automated scores and human ratings using a consistency-based metric system. The main metric is the Composite Consistency Index (CCI). The CCI jointly measures trend alignment and numerical accuracy between automated evaluation results and human judgments for long academic texts.

Positive Correlation Score (PCS)

The PCS evaluates the consistency of score trends between automated evaluations and human ratings. We compute the PCS using the Pearson correlation coefficient. We then normalize the result to a unified scale. This score reflects whether the automated system preserves the ranking given by human judges.

Accuracy Score (AS)

The AS measures numerical agreement between automated scores and human ratings. The metric is based on the mean absolute error (MAE). It applies an exponential penalty to large deviations, so larger errors result in lower scores.

Composite Consistency Index (CCI)

The CCI combines two complementary metrics having equal weights:

CCI = α \cdot PCS + β \cdot AS, α = β = 0.5

(19)

This design balances trend consistency and value accuracy.

Dimension-Wise Aggregation

We compute the CCI separately for four evaluation dimensions: hlComprehensiveness, Insight, Instruction Following, and Readability. We then aggregate these scores using weighted summation:

{CCI}_{total} = \sum_{d = 1}^{4} w_{d} \cdot {CCI}_{d}, \sum_{d = 1}^{4} w_{d} = 1

(20)

4.1.3. Results and Analysis

Table 1 and Figure 3 present the evaluation results of five Deep Research Agents under the proposed evaluation framework. The analysis focuses on two aspects: content quality and source reliability.

Content Quality Analysis

Table 1 reports the content quality results measured by the four dimensions: Comprehensiveness, Insight, Instruction Following, and Readability. Gemini Pro achieves the highest overall performance, having a Content Quality Score (CQS) of 4.14 and a Score Rate of 38.03. This result indicates that Gemini Pro produces research reports having clearer structure and more stable logical organization. Its advantage mainly comes from its ability to maintain long-context coherence while integrating relevant evidence across sections.

GPT-4 shows strong performance in Insight and Readability, but its Instruction Following score is noticeably lower, which limits its overall CQS. Gemini Flash achieves the highest Readability score, suggesting smoother language generation, yet its consistency across the dimensions is slightly weaker than Gemini Pro. Perplexity and Qwen3 obtain lower CQS values. This result suggests that these models struggle with content completeness and logical continuity in long-form academic reports.

Source Reliability Analysis

Figure 3 reports the source reliability results measured by the Source Reliability Score (SRS). The SRS is evaluated based on the external retrieval-based factual verification score.

Overall, GPT-4 achieves the highest SRS of 74.26. This indicates its strong performance in generating factually supported and verifiable content. Gemini Pro follows, having a score of 71.45. Gemini Flash and Qwen3 demonstrate comparable reliability, scoring 70.12 and 69.98, respectively. Conversely, Perplexity obtains the lowest score of 67.58 among the evaluated models.

Topic-Level Distribution Analysis

Figure 4 shows how the CQS and SRS are distributed across 100 reports covering 20 topics. Each data point represents a single report. The vertical spread reflects the variation in quality among the different topics. Gemini Pro exhibits a more concentrated distribution in higher score ranges. Other models show wider variance and lower upper bounds. This distribution pattern indicates that the proposed framework can distinguish report quality differences at the topic level.

Ablation Study

Table 2 reports the ablation results under consistent experimental settings. Three metrics are used: CCI, PCS, and AS. The CCI measures consistency with human judgments. The PCS evaluates pairwise ranking stability. The AS measures score stability across repeated runs, where higher values indicate more robust behavior.

The full Deep-Research Eval framework achieves the highest CCI, PCS, and AS values. Removing alignment leads to a clear drop in all the metrics. Removing explicit criteria also reduces performance, but the decline is less severe. These results indicate that alignment plays a more critical role, while criteria-based scoring provides additional stability.

Dimension-Level Evaluation Comparison

Figure 5 presents the dimension-level scores assigned by all the approaches across the four academic dimensions. Naive LLM assigns uniformly high scores to most models, which indicates limited discriminative ability. Deep-Research Eval produces more differentiated score distributions across the dimensions. FACE assigns relatively higher scores to the Perplexity model, which differs from the other frameworks.

Consistency Analysis Across Dimensions

Figure 6 reports the CCI scores of Deep-Research Eval, FACE, and three ablation variants across the four dimensions. Deep-Research Eval achieves the highest CCI values in all the dimensions. Removing alignment causes the most significant decline. Removing criteria also reduces consistency, but to a lesser extent. Naive LLM shows imbalanced performance, especially in the Instruction Following dimension. FACE obtains the lowest CCI values across all the dimensions.

Table 3 details the score variance across the models. For the deep-research evaluation, we report the variance from a secondary assessment. This approach ensures a consistent evaluation scale. We applied explicit grading rubrics and quantified reward mechanisms. Consequently, the models achieved highly stable scores having minimal variance. The FACE models utilized this same criteria-driven strategy. As a result, their performance remained highly consistent across multiple iterations.

Furthermore, we measured the computational cost of the adaptive page-level scoring system. Because the evaluation operates at the page level, the total number of LLM calls depends dynamically on the page count of the candidate report. On average, evaluating a full-length report requires approximately 433.63 s using the Qwen-Plus API.

4.2. Paged-RAG

This section evaluates the proposed Paged-RAG method on retrieval tasks. We report the experimental setup and results in detail.

4.2.1. Dataset and Preprocessing

We use the CRUD-RAG [22] benchmark to evaluate Retrieval-Augmented Generation (RAG) performance. CRUD-RAG is a Chinese benchmark designed for RAG evaluation in large language models. Its retrieval corpus contains 86,834 high-quality news articles. Most of the articles were published after July 2023. This time range reduces the risk of data leakage from model pre-training and improves temporal relevance.

The benchmark follows the CRUD (Create, Read, Update, Delete) framework and defines six task types, as detailed in Table 4. We evaluate more than 5000 question–answer pairs sampled from these tasks. This setting allows us to test retrieval and generation under different information usage patterns.

The selection of this benchmark aligns with the core objective of our framework. Our goal is to assess the reliability of DRA reports. This assessment depends on factual grounding verification. We conceptualize this verification process as a specialized form of question answering task. CRUD-RAG includes diverse tasks beyond basic QA, such as hallucination modification and text continuation. This multi-task environment facilitates a comprehensive evaluation of the effectiveness and applicable scenarios of Paged-RAG.

Page-level Document Processing: The preprocessing pipeline aims to group the top-p most related news articles into one retrieval unit. We define this unit as a logical “page” in Paged-RAG. In all the experiments, we set $p = 5$ . Each page therefore contains at most five documents.
We use a multi-stage strategy to build page-level units. Each stage refines document similarity step by step.
In the first stage, we apply coarse grouping. This step relies on shared named entities and time proximity. Its goal is to remove clearly unrelated documents and form candidate sets that describe the same event or topic.
In the second stage, we further split large candidate sets. We use TF-IDF features and K-means clustering. This step improves lexical consistency and limits the size of each set. It also keeps later similarity computation efficient.
In the final stage, we focus on the top-p constraint. For each refined candidate set, we compute pairwise semantic similarity using a neural ranking model. We then apply a greedy selection process. This process selects documents that are most similar to each other until the page reaches the size limit of $p = 5$ .
Each resulting cluster forms a page-level unit in Paged-RAG. All the documents keep their original source identifiers. This design supports page-level retrieval, citation, and factual grounding during generation.
Optimized Evaluation Pipeline: We improve the CRUD-RAG evaluation pipeline in two ways. First, we use pipeline parallelism to process multiple evaluation tasks at the same time. Second, we apply vLLM-based batch inference for large language model execution. These changes reduce overall evaluation time.

4.2.2. Experimental Settings

The experimental configuration ensures a fair comparison between Paged-RAG and baseline methods.

Baseline Retrieval Methods: We compare Paged-RAG with three standard retrieval approaches. The first is BM25, which is based on sparse term statistics. The second is dense retrieval using semantic embeddings. The third is a hybrid method that combines BM25 and dense vectors. These methods represent common retrieval strategies in RAG systems.
Evaluation Metrics: We follow the evaluation protocol of CRUD-RAG. We assess both generation quality and factual accuracy.
For generation quality, we use BLEU scores, including BLEU-1, BLEU-2, BLEU-3, and BLEU-4. BLEU-1 measures word-level overlap. BLEU-4 measures higher-order fluency and structure.
For factual accuracy, we use RAGQuestEval. This metric is proposed in CRUD-RAG. It builds factual questions from the reference answer. The model then answers these questions based on its own generated text. We compute a token-level QA-F1 score between predicted and reference answers. This score reflects factual completeness and helps detect hallucinations.
Model: We use Qwen2.5-7B-Instruct for both generation and evaluation. One model instance generates answers in the RAG pipeline. Another instance performs LLM-based evaluation, such as question answering in RAGQuestEval. All the experiments run on a single NVIDIA RTX 4090 GPU with 48 GB of memory. To mitigate potential variance and ensure methodological transparency, all the factual grounding verification scores are averaged across three independent evaluation runs.
Hyperparameter Configuration: We set the decoding temperature to 0.8 for all the experiments. For retrieval, we fix the number of retrieved chunks to top- $k = 10$ . We apply this setting to all the methods for fairness. We use vLLM (version 0.6.1) to support efficient batch inference during generation and evaluation [23]. For the Paged-RAG pipeline, the documents were segmented using a fixed chunk size of 200 tokens with an overlap of 0 tokens. We utilized the bge-m3 model to generate the semantic embeddings. All the vectorized chunks and their corresponding page identifiers were indexed and stored using the Milvus vector database (version 2.6.9).

4.2.3. Results and Analysis

Table 5 presents the evaluation results of several retrievers on multiple tasks under the CRUD-RAG evaluation protocol. The protocol measures both generation quality and factual accuracy. Generation quality is evaluated using BLEU scores from BLEU-1 to BLEU-4. Factual accuracy is measured by the question-based metric QA_avg_F1, which is used to identify factual omissions and hallucinated content.

Across most tasks, PageDrag performs better than the Dense, BM25, and Hybrid retrievers. The improvement is especially clear on question answering tasks and long-form generation tasks. On the QuestAnswer1Docs, QuestAnswer2Docs, and QuestAnswer3Docs benchmarks, PageDrag shows consistent gains on all the BLEU n-gram metrics. It also achieves the highest QA_avg_F1 on these benchmarks. This result indicates stronger factual completeness. It suggests that PageDrag retrieves more relevant and informative context. As a result, the model can answer a larger number of factual questions derived from the reference answers. This property is important for cross-document question answering.

On the ContinueWriting task, which focuses on long-context coherence, PageDrag again achieves the highest scores on both BLEU and QA_avg_F1. The absolute BLEU values remain low because the task is open-ended. However, the relative gains indicate better support for coherent continuation. They also indicate better preservation of factual information in long outputs.

On hallucination-sensitive tasks and summarization tasks, BLEU and QA_avg_F1 show different trends. On the HalluModified task, PageDrag reaches the highest QA_avg_F1. Its BLEU scores are slightly lower than those of BM25. A similar pattern appears on the Summary task. PageDrag again achieves the best factual accuracy but lower BLEU scores. This pattern occurs because PageDrag retrieves a wider range of contextual information. The retrieved content is relevant at the semantic level. It can introduce lexical repetition. This repetition lowers string-based BLEU scores. At the same time, it improves semantic coverage and factual completeness.

The Hybrid retriever combines Dense and BM25 signals. However, it does not consistently perform better than BM25 alone. This result suggests that simple fusion of dense and sparse retrieval signals can add noise instead of useful information.

Table 6 details the time efficiency of different retrievers. In standard RAG, the input consists solely of retrieved chunks, whereas Paged-RAG augments this context by appending the full source page. We have conducted supplementary experiments to quantify this impact. Our analysis reveals that this extended context results in an inference latency approximately two to three times higher than that of standard retrieval baselines.

This latency difference is most evident in the generation phase. Standard baselines generally require 480 to 650 s for text generation. Conversely, Paged-RAG takes significantly longer. For example, in the QuestAnswer1Docs task, Paged-RAG requires 1503.80 s. BM25 requires only 570.89 s. This represents a 2.6-fold increase. Similarly, in the Summary task, Paged-RAG (1585.04 s) is more than three times slower than the Dense retriever (482.74 s).

Retrieval time also increases for Paged-RAG. Fetching full source pages demands more computational resources. In the ContinueWriting task, Paged-RAG takes 175.81 s for retrieval. The Hybrid baseline only takes 39.66 s. Evaluation times, however, remain consistent across all the methods. Therefore, processing longer input prompts during LLM inference is the main computational bottleneck for Paged-RAG.

5. Limitations

We identify three specific limitations in the Deep-Research Eval framework. First, the evaluation uses a constructed reference database. This closed system prevents us from testing open-web discovery. But we made this choice for a reason. We prioritize verifiable precision and reproducibility over broad recall. This focus ensures the framework remains reliable for checking facts in serious contexts.

Second, our sample size of reports is small. Our fine-grained scoring strategy partially compensates for this limitation by generating numerous evaluation units, providing sufficient density to observe statistically significant effects. However, this segmentation approach does not resolve the limited topical diversity. It also does not improve model coverage. Furthermore, pages derived from the same report lack guaranteed statistical independence. We plan to expand the evaluation to include more topics and agent models in the future.

Finally, our observations regarding the superiority of Paged-RAG over traditional hybrid retrieval methods are currently limited to the CRUD-RAG benchmark configuration. We acknowledge that these findings provide preliminary evidence rather than a universal rule, and future work will need to validate this approach across a broader range of retrieval tasks and diverse knowledge domains.

6. Conclusions

This paper introduces Deep-Research Eval, an evaluation framework designed to assess the quality and reliability of long-form reports generated by Deep Research Agents. The framework features an adaptive fine-grained scoring system for content quality assessment. For factual verification, it employs a Paged-RAG mechanism to retrieve evidence from a Configurable Reference Database. This design supports factual verification and clear alignment between sources and generated content in complex reports.

The experimental results show that Deep-Research Eval provides stable and consistent page-level evaluations across different report structures. The framework also demonstrates strong agreement with human judgments on content quality and source reliability. In all the tested settings, it performs better than baseline methods and ablation variants across multiple evaluation metrics.

The framework allows detailed identification of errors at the page level. It also enforces explicit factual grounding through source-aware evaluation. By leveraging a modular ground truth pool, it offers a scalable, domain-adaptable method to evaluate the reliability of Deep Research Agents. These features make Deep-Research Eval a practical tool for analyzing both the strengths and the limitations of AI-generated academic reports.

Author Contributions

Conceptualization, Y.T. and Y.Z.; methodology, Y.T.; formal analysis, Y.T. and Y.H.; investigation, Y.T. and Y.H.; data curation, Y.Z.; writing—original draft preparation, Y.T. and Y.Z.; writing—review and editing, S.L. and Z.W.; visualization, Y.T. and Y.Z.; supervision, S.L. and Z.W.; project administration, Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jiangsu Province Major Scientific Project, grant number BG2024032.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, R.; Peng, J. A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications. arXiv 2025, arXiv:2506.12594. [Google Scholar] [CrossRef]
OpenAI. Deep Research System Card. Available online: https://openai.com/ (accessed on 28 January 2026).
Google Gemini Team. Gemini: A Family of Highly Capable Multimodal Models. Available online: https://deepmind.google/technologies/gemini/ (accessed on 28 January 2026).
Perplexity. Introducing Perplexity Deep Research. Available online: https://www.perplexity.ai/ (accessed on 28 January 2026).
Du, M.; Xu, B.; Zhu, C.; Wang, X.; Mao, Z. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents. arXiv 2025, arXiv:2506.11763. [Google Scholar] [CrossRef]
Jain, S. Human-Aligned Long-Form Evaluation (HALF-Eval): Framework for Assessing AI-Generated Content and Improvement. Amaz. Sci. 2025. Available online: https://www.amazon.science/publications/human-aligned-long-form-evaluation-half-eval-framework-for-assessing-ai-generated-content-and-improvement (accessed on 12 February 2026).
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
Qwen Team. Deep Research Model (Qwen-Deep-Research). Available online: https://help.aliyun.com/zh/model-studio/getting-started-deep-research (accessed on 28 January 2026).
Li, R.; Du, M.; Xu, B.; Zhu, C.; Wang, X.; Mao, Z. DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report. arXiv 2026, arXiv:2601.08536. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 9459–9474. [Google Scholar]
LeVine, W.; Varjavand, B. Relevance Isn’t All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking. arXiv 2025, arXiv:2504.07104. [Google Scholar]
Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Fabbri, A.R.; Wu, C.S.; Liu, W.; Xiong, C. QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1234–1248. [Google Scholar]
Honovich, O.; Aharoni, R.; Herzig, J.; Taitelbaum, H.; Kukliansy, D.; Cohen, V.; Scialom, T.; Szpektor, I.; Hassidim, A.; Matias, Y. TRUE: Re-evaluating Factual Consistency Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5588–5610. [Google Scholar]
Es, S.; James, J.; Anke, L.E.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1234–1249. [Google Scholar]
Rackauckas, Z.; Câmara, A.; Zavrel, J. Evaluating RAG-Fusion with RAGElo: An Automated Elo-Based Framework. arXiv 2024, arXiv:2406.14783. [Google Scholar]
Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. LLMs Instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. In Proceedings of the 2025 Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025. [Google Scholar]
Sharma, M.; Zhang, C.B.C.; Bandi, C.; Wang, C.; Aich, A.; Nghiem, H.; Rabbani, T.; Htet, Y.; Jang, B.; Basu, S.; et al. ResearchRubrics: A Benchmark of Prompts and Rubrics for Evaluating Deep Research Agents. arXiv 2025, arXiv:2511.07685. [Google Scholar] [CrossRef]
Yao, Y.; Wang, Y.; Zhang, Y.; Lu, Y.; Gu, T.; Li, L.; Zhao, D.; Wu, K.; Wang, H.; Nie, P.; et al. A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports. arXiv 2025, arXiv:2510.02190. [Google Scholar] [CrossRef]
Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E. CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. ACM Trans. Inf. Syst. 2025, 43, 1–32. [Google Scholar] [CrossRef]
Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP); Association for Computing Machinery: New York, NY, USA, 2023; pp. 425–441. [Google Scholar]

Figure 1. Process flow of report generation in a DRA. The query is processed by the DRA through decomposition, exploration, and synthesis to produce the final report.

Figure 2. Deep-Research Eval framework. The system takes a candidate report as input, performs page restructuring, and conducts parallel assessments of content quality and citation reliability to generate evaluation scores.

Figure 3. Source Reliability Score (SRS).

Figure 4. Topic-level CQS and SRS distributions across 20 topics for five different models.

Figure 5. Dimension-level evaluation results across four academic dimensions: (a) Comprehensiveness; (b) Insight; (c) Instruction Following; (d) Readability.

Figure 6. CCI scores across four academic evaluation dimensions.

Table 1. Content quality results. Bold indicates the best performance.

Model	Comp	Insight	Follow	Read	CQS	Score Rate
GPT-4	3.8	4.1	2.63	5.27	3.95	35.76
Perplexity	2.7	2.65	2.42	4.43	3.05	20.59
Gemini Flash	3.87	3.87	2.93	5.67	4.08	36.31
Gemini Pro	3.88	4.1	2.98	5.58	4.14	38.03
Qwen3	2.67	3.47	2.13	4.37	3.16	17.13

Table 2. Comparison and ablation study results. Bold indicates the best performance.

Method	CCI	PCS	AS
Deep-Research Eval	0.7585	0.9450	0.5720
FACE	0.2997	0.2738	0.3257
Naive LLM	0.2800	0.5355	0.0244
w/o Alignment	0.4924	0.8797	0.1051
w/o Criteria	0.6025	0.9525	0.2526

Table 3. Variance comparison between FACE and deep-research evaluations across different models.

Model	FACE Variance ( $s^{2}$ )	Deep-Research Variance ( $s^{2}$ )
Gemini Flash	0.011022	$1.4 \times 10^{- 6}$
Gemini Pro	0.000089	$1.2 \times 10^{- 6}$
Qwen3	0.001756	$3.2 \times 10^{- 6}$
Perplexity	0.000289	$6.1 \times 10^{- 6}$
GPT	0.002956	$2.1 \times 10^{- 6}$

Table 4. Scenarios of the CRUD-RAG dataset.

Task Category	CRUD Scenario	Task Description
Text Continuation	Create	Creative text expansion
1-doc QA	Read	Fact extraction from a single document
2-doc QA	Read	Information synthesis from two documents
3-doc QA	Read	Information synthesis from three documents
Hallucination Modification	Update	Factual error correction
Multi-doc Summarization	Delete	Summary generation

Table 5. Evaluation results across different tasks and retrievers. Bold indicates the best performance.

Task Name	Retriever Name	BLEU-Avg	BLEU-1	BLEU-2	BLEU-3	BLEU-4	QA_avg_F1	Length
QuestAnswer1Docs	Dence	10.01	24.38	12.09	8.85	6.81	6.37	172.79
	BM25	10.04	24.27	12.31	8.98	6.79	5.23	177.90
	Hybrid	10.63	25.06	12.90	9.48	7.25	5.80	154.38
	Pagedrag	27.85	42.02	30.84	25.36	21.38	30.20	126.81
QuestAnswer2Docs	Dence	6.47	26.77	8.26	5.08	3.51	4.35	218.54
	BM25	6.77	27.71	8.92	5.55	3.77	4.10	188.24
	Hybrid	6.14	27.32	8.42	5.04	3.35	3.32	181.88
	Pagedrag	20.60	40.71	23.87	17.53	13.63	22.96	236.09
QuestAnswer3Docs	Dence	5.74	28.81	8.17	4.39	2.81	3.82	190.86
	BM25	6.79	30.15	9.19	5.30	3.51	4.00	192.78
	Hybrid	5.56	28.41	7.70	4.30	27.86	3.58	188.03
	Pagedrag	20.13	42.68	23.55	16.63	12.72	21.26	250.39
ContinueWriting	Dence	0.31	4.97	0.78	0.22	0.09	2.79	2157.46
	BM25	0.34	5.02	0.85	0.24	0.11	2.55	2173.64
	Hybrid	0.34	5.00	0.83	0.23	0.10	2.50	2179.77
	Pagedrag	1.43	6.51	2.02	1.09	0.79	6.67	2165.11
HalluModified	Dence	33.93	57.27	40.19	32.12	25.96	25.15	41.53
	BM25	36.42	60.14	42.87	34.53	28.21	25.66	41.61
	Hybrid	35.02	59.08	41.55	33.10	27.20	25.78	38.04
	Pagedrag	34.40	54.49	38.81	32.31	27.71	26.18	45.92
Summary	Dence	33.51	61.50	37.46	28.39	22.43	19.12	118.47
	BM25	38.93	68.26	43.96	33.36	26.47	19.87	95.33
	Hybrid	36.02	64.34	40.40	30.63	24.34	19.19	110.35
	Pagedrag	26.07	50.45	29.39	21.78	16.87	23.00	279.45

Table 6. Time efficiency results across different tasks and retrievers.

Task Name	Retriever Name	Retrieval Time	Generation Time	Evaluation Time
QuestAnswer1Docs	Vector	26.91	555.07	541.75
	BM25	5.38	570.89	558.16
	Hybrid	29.97	563.49	565.63
	Pagedrag	53.42	1503.80	540.53
QuestAnswer2Docs	Vector	25.67	552.29	670.01
	BM25	5.12	567.90	643.28
	Hybrid	29.16	560.71	659.19
	Pagedrag	54.02	1492.57	611.02
QuestAnswer3Docs	Vector	28.17	554.50	662.92
	BM25	5.47	568.00	647.95
	Hybrid	30.51	560.55	685.98
	Pagedrag	53.88	1472.59	643.30
ContinueWriting	Vector	36.87	621.96	1304.58
	BM25	8.04	647.86	1306.90
	Hybrid	39.66	637.29	1310.53
	Pagedrag	175.81	1818.00	1290.04
HalluModified	Vector	35.54	615.14	761.07
	BM25	6.90	618.20	732.38
	Hybrid	37.70	612.33	748.73
	Pagedrag	88.92	1767.08	748.09
Summary	Vector	34.00	482.74	1008.73
	BM25	6.93	496.00	1001.44
	Hybrid	36.66	497.39	1003.39
	Pagedrag	78.31	1585.04	977.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tuohetiyaer, Y.; Zhu, Y.; Hu, Y.; Lu, S.; Wang, Z. Deep-Research Eval: An Automated Framework for Assessing Quality and Reliability in Long-Form Reports. Appl. Sci. 2026, 16, 2546. https://doi.org/10.3390/app16052546

AMA Style

Tuohetiyaer Y, Zhu Y, Hu Y, Lu S, Wang Z. Deep-Research Eval: An Automated Framework for Assessing Quality and Reliability in Long-Form Reports. Applied Sciences. 2026; 16(5):2546. https://doi.org/10.3390/app16052546

Chicago/Turabian Style

Tuohetiyaer, Yeerpan, Yuye Zhu, Yan Hu, Siyuan Lu, and Zhongfeng Wang. 2026. "Deep-Research Eval: An Automated Framework for Assessing Quality and Reliability in Long-Form Reports" Applied Sciences 16, no. 5: 2546. https://doi.org/10.3390/app16052546

APA Style

Tuohetiyaer, Y., Zhu, Y., Hu, Y., Lu, S., & Wang, Z. (2026). Deep-Research Eval: An Automated Framework for Assessing Quality and Reliability in Long-Form Reports. Applied Sciences, 16(5), 2546. https://doi.org/10.3390/app16052546

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep-Research Eval: An Automated Framework for Assessing Quality and Reliability in Long-Form Reports

Abstract

1. Introduction

2. Related Work

2.1. Deep Research Agents

2.2. Retrieval-Augmented Generation

2.3. Long-Form Text Evaluation

3. Methodology

3.1. Structural Alignment

3.2. Content Quality Assessment

3.2.1. Dynamic Dimension and Criteria Generation

3.2.2. Scoring and Quantitative Aggregation

3.3. Source Reliability Assessment

3.3.1. Configurable Reference Database

3.3.2. Paged-RAG: Page-Aligned Retrieval-Augmented Verification

3.4. Complexity Analysis

4. Experiments and Results

4.1. Adaptive Page-Level Scoring System

4.1.1. Dataset Construction

4.1.2. Experimental Settings

Positive Correlation Score (PCS)

Accuracy Score (AS)

Composite Consistency Index (CCI)

Dimension-Wise Aggregation

4.1.3. Results and Analysis

Content Quality Analysis

Source Reliability Analysis

Topic-Level Distribution Analysis

Ablation Study

Dimension-Level Evaluation Comparison

Consistency Analysis Across Dimensions

4.2. Paged-RAG

4.2.1. Dataset and Preprocessing

4.2.2. Experimental Settings

4.2.3. Results and Analysis

5. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI