Development and Multireader Evaluation of Radiological RAG-System

Erizhokov, Rustam A.; Gordeev, Alexander E.; Sakharova, Polina A.; Yafarova, Adel A.; Varyukhina, Maria D.; Blokhin, Ivan A.; Omelyanskaya, Olga V.; Vladzymyrskyy, Anton V.; Vasilev, Yuriy A.

doi:10.3390/data11060143

Open AccessArticle

Development and Multireader Evaluation of Radiological RAG-System

by

Rustam A. Erizhokov

¹

,

Alexander E. Gordeev

^1,2

,

Polina A. Sakharova

^1,3

,

Adel A. Yafarova

¹

,

Maria D. Varyukhina

¹

,

Ivan A. Blokhin

^1,*

,

Olga V. Omelyanskaya

^1,3

,

Anton V. Vladzymyrskyy

^1,4

and

Yuriy A. Vasilev

¹

Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies, Moscow Health Care Department, Petrovka Street, 24, Building 1, 127051 Moscow, Russia

²

Faculty of Computer Science, National Research University Higher School of Economics, 11 Pokrovsky Bulvar, 109028 Moscow, Russia

³

MIREA—Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia

⁴

Department of Information Technologies and Medical Data Processing, I.M. Sechenov First Moscow State Medical University (Sechenov University), Trubetskaya Street, 8, Building 2, 119991 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Data 2026, 11(6), 143; https://doi.org/10.3390/data11060143

Submission received: 8 May 2026 / Revised: 9 June 2026 / Accepted: 10 June 2026 / Published: 12 June 2026

(This article belongs to the Special Issue Natural Language Processing in the Era of Big Data)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) are increasingly being used in radiology-related workflows, but their application to reference, regulatory, and methodological queries remains limited by hallucinations and the static nature of model knowledge. This study aimed to develop and evaluate a retrieval-augmented generation (RAG) system for radiologists designed to provide grounded responses to such queries. A knowledge base was created through a survey of practicing radiologists and expert validation of sources, resulting in a corpus of 1049 documents. The system incorporated structured document parsing, a two-level parent–child vector database, hybrid dense–sparse retrieval, reranking, and a local large language model. Performance was assessed through functional testing, automated LLM-as-a-judge metrics, and multireader expert evaluation by 16 radiologists using 400 technical queries. No hallucinations were detected in the 77-query functional testing set during expert review. On the full technical dataset, automated Contextual Precision, Contextual Recall, and Answer Relevancy were 0.735, 0.881, and 0.890, respectively. Expert evaluation showed high response accuracy (mean, 4.53/5) and high expert-assessed Contextual Precision (0.886). Inter-expert agreement was substantial to excellent for most Likert-scale criteria. These findings suggest that a hierarchical RAG architecture can provide reliable access to radiology-specific reference information, although external validation and automated updating of the knowledge base remain necessary.

Keywords:

retrieval-augmented generation; radiology; large language models; hybrid retrieval; LLM-as-a-judge; expert evaluation; knowledge base

1. Introduction

Historically, artificial intelligence (AI) in radiology has primarily evolved around computer vision and the automated analysis of imaging studies [1,2,3]. At the same time, large language models (LLMs) are increasingly being used in radiologists’ workflows for extracting, structuring, and summarizing information from radiology reports and other medical documents [4,5,6]. However, the widespread use of LLMs in medicine is limited by at least two fundamental problems: (1) their tendency to generate plausible but factually incorrect statements (‘hallucinations’); (2) the static nature of model knowledge, which may result in answers that are inconsistent with current clinical standards and protocols [7,8,9,10].

One of the most practical ways to improve factual accuracy, reduce hallucinations, and maintain up-to-date knowledge without retraining is Retrieval-Augmented Generation (RAG) [11]. RAG systems generate responses to user queries by grounding them in a controlled and domain-specific knowledge base [11,12]. The currency of such systems depends on regular updating, expansion, and reindexing of the knowledge base [13]. In some advanced RAG systems, context is retrieved from online sources at query time, enabling real-time access to up-to-date web-based information [13,14,15].

The practical importance of organizational, methodological, and regulatory tasks is indirectly supported by the Vancouver Workload Utilization Evaluation Study, which showed that non-interpretive tasks consumed more working time than image reading itself [16]. Many of these tasks involve protocol templates, review of the appropriateness of imaging requests, study selection, and compliance with safety requirements [17]. These activities often require consultation of regulatory documents, disease-specific clinical guidelines, local policies, and related reference materials. This body of information is regularly updated and is often distributed across heterogeneous sources with varying levels of detail and occasional inconsistencies. As a result, a substantial proportion of working time is spent locating relevant information, reconciling content across documents, and formulating a coherent response to a specific query. This creates a need for specialized RAG systems tailored to radiologists’ workflows, with the potential to reduce the time spent searching for information and the need for subsequent verification by grounding responses in up-to-date sources.

The novelty of this study lies in positioning RAG as a controlled reference system for practicing radiologists, focused on reference, regulatory, and methodological queries arising in routine work and grounded in a curated knowledge base that supports non-interpretive and documentation-related tasks surrounding radiological practice. The system is designed to help radiologists retrieve, synthesize, and cite information from validated reference, regulatory, methodological, and institutional documents during routine non-interpretive tasks, such as checking protocol requirements, verifying regulatory or safety-related information, consulting structured reporting templates, and supporting documentation. The aim was to develop and evaluate this system using a reproducible protocol that combined automated metrics with a multireader expert assessment. To achieve this aim, we constructed a radiologist-informed knowledge base and test query set, iteratively optimized the RAG pipeline, evaluated generated responses and retrieved context, assessed inter-expert agreement, and compared automated and expert evaluation approaches.

2. Materials and Methods

The study was approved by the Independent Ethics Committee, protocol No. 06/2025, dated 19 June 2025. This study was conducted in accordance with the MAIC checklist [18]. The study design is presented in Figure 1.

2.1. Survey of Practicing Radiologists

The system development was based on a targeted survey of practicing radiologists (n = 25, each with more than 5 years of experience), conducted in several iterations. The survey included seven topics and relevant knowledge sources. The thematic areas were as follows:

Safety in diagnostic radiology;
Medical physics and technical aspects of imaging methods;
Regulatory framework;
Radiological anatomy (normal and pathologic);
Reference values and grading systems;
Radiological glossaries and terminology standards;
Templates and structured reports.

2.2. Source Validation

During knowledge base development, two rounds of source validation were conducted by a focus group to achieve consensus. In the first round, experts selected the sources considered most relevant and most frequently used in clinical radiology practice within each thematic area. In the second round, sources containing excessive illustrative material (>70% images) were excluded, and duplicate or outdated materials were removed. Outdated sources were identified during the comparison if experts found that some of them (e.g., laws) had already been replaced by more recent versions. The final corpus comprised 1049 documents after expert validation and filtering. The corpus was classified by source category and language. By source category, the corpus included web-based sources (892 documents, 85.0%), methodological recommendations (127 documents, 12.1%), textbooks and manuals (13 documents, 1.2%), journal articles (11 documents, 1.1%), and regulatory legal acts (6 documents, 0.6%). By language, the corpus included 871 English-language documents (83.0%) and 178 Russian-language documents (17.0%).

2.3. Dataset Preparation

Two independent datasets were created from the validated sources using different preparation procedures.

The functional testing set consisted of 77 queries divided into topics (11 queries per topic). The functional testing set was created to test the operation of all RAG system components and ensure the absence of hallucinations. For functional testing, hallucination was operationally defined as any substantive statement in the generated response that was not supported by the retrieved context or reference source, contradicted the source document, or provided a factual answer when the available context did not contain sufficient information to answer the query. Experts reviewed generated responses against the reference answers and source documents. The presence or absence of hallucinations was assessed as a binary criterion. Explicit responses indicating insufficient information in the available context were not considered hallucinations.

Each record included the topic, query, expert’s reference response, and source link. A query example from the functional testing dataset is provided in Appendix A.1.

The technical testing set consisted of 400 queries based on the specific sources in order to comprehensively assess the quality of the response and context. Experts compiled queries of three levels of complexity for each source:

Level 0—low complexity level. One or two sentences from a single document are sufficient to answer the query.

Level 1—moderate complexity level. Information from one section or several paragraphs of a single document is sufficient to answer this query.

Level 2—high complexity level. Answering the query requires information from several sections or the entire document, as well as consideration of complex relationships between sections.

The final distribution of queries by complexity levels 0, 1, and 2 was 0.5/0.25/0.25, respectively.

The structure of this dataset included the query category by complexity level, query formulation, reference response, name of the source document by which the expert formulated the query, the specific section containing the answer to the query, and a link to the source document. The minimum sample size was calculated using Cochran’s formula for an infinite population [19,20] (Formula (1)):

n = \frac{Z^{2} \cdot p \cdot (1 - p)}{E^{2}}

(1)

where Z = 1.96 (95% confidence level), p = 0.5 (expected proportion), and E = 0.05 (margin of error).

According to the calculation, the minimum sample size is 385 cases. To provide additional reliability and a more even distribution across the three complexity levels, 400 queries were generated.

As described in Section 2.1, the thematic topics for query development were derived from a survey of practicing radiologists (n = 25, each with more than 5 years of experience). The experts identified frequent reference, regulatory, and methodological tasks encountered in routine workflow. For Levels 0 and 1, queries were formulated as free-text questions by participating radiologists using natural professional phrasing and reflecting the information needs to be expressed in the survey. For Level 2, high-complexity queries were more structured, particularly for the regulatory framework topic, because legal and regulatory documents often have a hierarchical and clause-dependent structure.

The test case example from the technical testing dataset is provided in Appendix A.2.

2.4. Document Parsing

Document parsing is a fundamental development stage, as it enables the transformation of heterogeneous sources of medical information, such as web pages, Portable Document Format (PDF) files, and scanned documents, into a structured text format suitable for further processing and indexing. Effective parsing minimizes loss of semantic integrity, reduces data noise, and improves search accuracy and the quality of system responses. Without high-quality parsing, RAG systems are faced with context fragmentation, which reduces the relevance of the retrieved information.

Text from web pages was extracted using the BeautifulSoup library (https://www.crummy.com/software/BeautifulSoup; accessed on 28 May 2026), which parses HyperText Markup Language format (HTML) while preserving the document hierarchy and removing irrelevant elements. PDF documents were processed using the Docling library, which converts documents into structured Markdown (MD) while preserving the hierarchy of headings, tables, and lists. EasyOCR (https://github.com/JaidedAI/EasyOCR; accessed on 28 May 2026) was used as the optical character recognition (OCR) component within the parsing pipeline, providing high-accuracy text recognition in both Cyrillic and Latin scripts [21,22].

Low-quality documents (e.g., scans and photographs) were processed using DeepSeek-OCR (https://huggingface.co/deepseek-ai/DeepSeek-OCR; accessed on 28 May 2026), an LLM with visual input, and then converted to MD. In the final corpus, 24 of 1049 documents (2.3%) required OCR-based processing. Of these, 21 documents were processed using EasyOCR, and three documents were processed using DeepSeek-OCR because of poor scan quality or complex visual structure. In addition to the final corpus, six Russian-language textbooks were initially processed but subsequently excluded because parsing resulted in numerous orthographic errors and insufficient preservation of document structure. All retained parsed documents were manually reviewed before indexing to verify preservation of headings, tables, lists, and overall readability of the extracted text. Before indexing, irrelevant elements, such as tables of contents, author information, references, etc., were removed from MD files. Extra spaces and hyphenation were also removed. Preserving the original logical structure, including the hierarchy of headings, tables, and lists, was a critical step because it enabled subsequent text chunking without compromising the semantic integrity of medical and regulatory documents.

2.5. Vector Database (VDB) Design—Chunking and Indexing

After converting all documents into a structured MD format, we iteratively developed a text chunking strategy and constructed a vector index. The first two cycles tested fixed-size chunking and semantic chunking methods. Both approaches have demonstrated significant deficiencies when operating with highly complex structured medical texts. Fixed-size chunking led to a fragmentation of semantic blocks, and semantic chunking led to uncontrolled loss of hierarchy, which reduced the search accuracy.

For the final implementation, we adopted a two-level parent–child index, which combines precise retrieval over small fragments with subsequent reconstruction of broader context at the level of larger document blocks [23].

At the parent level, the document is segmented by the MD structure (headings, subsections, tables, and lists) using the Recursive Chunker. The maximum parent chunk size is 1800 tokens (approximately 8000–10,000 characters). These chunks are used for the final context of the RAG query, which fully preserves the logical integrity of medical documents.

At the child level, each parent chunk is broken into short fragments no larger than 512 tokens and no less than 900 characters. This was achieved using the Late Chunker approach: the full parent chunk was first processed by a long-context embedding model, after which contextualized vector representations of individual child chunks were generated from the resulting token embeddings. This approach, proposed in the paper [24], simultaneously takes into account structural MD boundaries and the document semantics, preserving the connections between adjacent fragments.

To form each child chunk, two types of vectors for hybrid search were generated: (1) dense vectors, using the FRIDA multilingual embedding model, which demonstrated good results on the benchmarks presented on the Massive Text Embedding Benchmark (MTEB) Leaderboard [25]; (2) sparse vectors, calculated using the Best Matching 25 algorithm (BM25) from the fastembed library with the standard parameters k1 = 1.2 and b = 0.75 [26]. The resulting hierarchical structures were indexed into the Qdrant VDB. The source text and metadata with the parent chunk identifier were saved for each child chunk. Parent chunks stored the source text and metadata (source information, topic). The chunking strategy is shown schematically in Figure 2.

As a result, a two-level VDB was created that fully preserves the structure of the original medical sources and is optimized for downstream retrieval and generation. During debugging, this two-level approach provided both high semantic retrieval accuracy and contextual completeness, which were critical for the intended intelligent reference system for radiologists.

2.6. System Architecture and RAG Query Construction

The developed system is a RAG pipeline with a two-level VDB and multi-stage context selection for final response generation. The architecture comprises several main stages shown in Figure 3: (1) receiving a user query, (2) hybrid semantic retrieval, (3) selecting context candidates, and (4) constructing the final prompt, generating a response, and returning it to the user.

The workflow begins when a user submits a text query. The query is transformed into dense and sparse vector representations using the FRIDA embedding model and BM25, respectively. Next, the query is passed to the Qdrant-based VDB module, where hybrid retrieval is performed exclusively at the child-index level. Retrieval is conducted in parallel across 7 thematic collections. The results are combined using Reciprocal Rank Fusion (RRF) with equal weights, after which the 90 most relevant child chunks are selected (15 from each collection). This number was selected empirically as an optimal trade-off between coverage completeness and computational efficiency: a smaller number reduced contextual completeness, whereas a larger number increased noise and response time [27]. The hybrid approach is critically important for specialized reference systems, where both semantic proximity and exact terminological matches are required.

At the next stage, the 13 highest-ranked child chunks across all selected candidates are chosen according to the RRF score. The corresponding parent chunks are then retrieved using the parent identifier. As a result, a set of up to 13 parent chunks—logically complete structural units of the source document, such as sections, subsections, or tables—is formed.

To select the most relevant parent chunks, we used the Qwen3-Reranker-0.6B cross-encoder model, which scores query–parent chunk pairs according to semantic similarity. After reranking, the top five parent chunks are selected and passed, together with their metadata, to the generation module. The choice of the final five parent chunks was driven by the LLM context window limit (≤130 k tokens), empirical testing, and previous RAG studies suggesting that final context sets should be restricted to balance evidence coverage, redundancy, and inference efficiency [28,29,30]. Published RAG pipelines have similarly used a limited number of final context fragments or larger initial candidate pools followed by reranking and pruning to a smaller final set [28,29]. Therefore, in our pipeline, the final selection of five parent chunks was used as a practical compromise between contextual completeness, redundancy reduction, and computational constraints.

The generation module uses the local Qwen3-30B-A3B large language model with fixed parameters (temperature = 0.1, top_p = 0.9). The full prompt includes (1) strict system instructions, (2) the context selected from the top five parent chunks, and (3) the original user query. If the available context is insufficient, the system returns an explicit message stating that a justified response cannot be generated because of insufficient information in the retrieved context. The system prompt template used for response generation is shown in Box 1.

Box 1. Prompt template used for response generation.

“You are an expert assistant in medicine and radiology, specializing in the analysis of medical documents, scientific research, web resources, and regulations.

Your answers are intended for medical professionals such as radiologists, so use professional language, precise terminology, and focus on evidence-based information.

Always minimize risks by avoiding speculations or medical recommendations outside the provided context.

Response instructions:

1. Response solely based on the provided context from the knowledge base (retrieved documents). Do not add external information or assumptions.

2. Be as precise and concise as possible: structure your response with clear sections.

3. Cite sources from the context (specify sections or document identifiers).

4. If the context does not have sufficient information to provide a full response, state it explicitly and suggest clarifying the query (e.g., adding key terms or additional details).

5. Use the chain-of-thought approach for complex queries: first, analyze the context step by step, then provide a conclusion.

User query:

{query}

Context from the knowledge base (use only this information):

{context}

Your response:

”

The “chain-of-thought” instruction in the prompt was used to encourage systematic inspection of the retrieved context before formulating the final response. It was not intended to expose a separate reasoning trace to the end user or to encourage unsupported inference beyond the retrieved documents [31].

The proposed architecture differs from classic RAG systems by its parent–child hierarchy, multi-stage selection of candidates, and strict adherence to the structure of medical documents. Alternatives such as single-level search or fixed-size chunks without hierarchy were rejected, as they led to significant text fragmentation and reduction in automated metrics.

2.7. System Integration

To facilitate practical deployment, the RAG pipeline was implemented as a Representational State Transfer Application Programming Interface (REST API) using FastAPI (https://fastapi.tiangolo.com; accessed on 28 May 2026). The API provides two unified endpoints: prompt, which receives a text query and returns an optimized RAG prompt for the LLM, and generate, which receives a text query and generates a response using the selected Ollama-based model. At the API level, responses are returned in JSON format and include the required fields query, prompt, and ranked_parents, where ranked_parents contains the selected parent chunks and their metadata. In the intended user-facing scenario, radiologists see only the final synthesized answer with source identifiers or citations. Retrieved parent chunks are used internally by the system and are returned as metadata for traceability, audit, debugging, and expert evaluation, but they are not intended for full manual review by the user during routine use. The system was deployed within the secure internal infrastructure of the medical facility using Docker containerization to ensure compliance with Russian Federation regulatory requirements.

2.8. Automatic Evaluation

To evaluate system performance, including both context retrieval and response generation, we used the DeepEval library module designed for evaluating RAG systems [32]. This library provides a set of predefined metrics that allow the retrieval and generation components to be assessed separately. The metrics are based on binary judgments produced by an LLM-as-a-judge model, which evaluates the correctness of statements with respect to the retrieved context, generated response, reference response, and original query.

We used Contextual Precision, Contextual Recall, and Contextual Relevancy to assess retrieval quality and Answer Relevancy to assess the quality of the generated response. The evaluation model was Mistral-Small3.2:24b.

Contextual Precision compares the retrieved context with the original query and evaluates the extent to which the retrieved context fragments are relevant to the query. The highest score (1) indicates that relevant fragments are ranked higher than irrelevant ones. A score of (0) indicates that none of the retrieved context fragments are relevant. The calculation of this metric requires the query, retrieved context, LLM response, and reference response.

Contextual Recall evaluates whether the key elements required to produce the reference response are present in the retrieved context. The highest score (1) is assigned if at least one retrieved context fragment contains information sufficient to derive the reference response. The lowest score (0) is assigned if the retrieved context does not contain the information necessary to answer the query correctly. The calculation of this metric requires the query, retrieved context, LLM response, and reference response.

Contextual Relevancy represents the proportion of retrieved context statements that are relevant to the query. The score decreases as the amount of irrelevant information in the retrieved context increases. A score of (0) indicates that the retrieved context contains no information relevant to the query.

Answer Relevancy evaluates the relevance of the generated response to the query. It is conceptually similar to Contextual Relevancy but applies to the final generated answer rather than to the retrieved context. In practice, the score is typically 0 when the system produces responses such as «Based on the provided documents, it is impossible to answer the query…».

To estimate the contribution of retrieval augmentation, we additionally compared the full RAG system with the same underlying LLM used without retrieved context. The LLM-only baseline received the same user queries but did not receive retrieved source fragments from the knowledge base. Response quality was assessed using the DeepEval G-Eval Correctness metric, which evaluates the factual consistency and completeness of the generated response relative to the expected response. Paired scores for the LLM-only and RAG configurations were compared using the Wilcoxon signed-rank test.

2.9. Expert Evaluation

The expert evaluation consisted of two independent components: RAG system response quality evaluation and retrieved context quality evaluation.

2.9.1. Response Quality Evaluation

Experts rated each system response according to six criteria using a five-point Likert scale:

Five points for «Completely agree»;
Four points for «Rather agree»;
Three points for «Unsure»;
Two points for «Rather disagree»;
One point for «Disagree».

The response evaluation criteria and their definitions are presented in Table 1.

2.9.2. Context Quality Evaluation

In addition to rating system responses, experts also evaluated the quality of the retrieved context fragments used for response generation. For each query, the system provided five retrieved context fragments, which had previously been exported in MD format.

Context quality was evaluated using two criteria: Contextual Completeness and Usefulness of Context. Contextual Completeness was rated on a five-point Likert scale. Usefulness of Context was evaluated on a binary scale, where 1 indicated that the fragment is related to the query topic and contains facts or definitions that can be used in the response and 0 indicated that the fragment is related to another topic or contains background or extraneous information that is not helpful for generating a response.

The retrieved context evaluation criteria and their definitions are shown in Table 2.

2.9.3. Expert Evaluation Design

Six experts prepared a dataset of 400 test cases for technical testing. Each test case included a query, a reference response, the source document, and the specific section containing the information required to answer the query. The RAG system then generated a response and retrieved the top five context fragments for each query.

Because of the large size of the testing dataset, 10 additional experts were recruited for evaluation. Thus, 16 radiologists participated in the expert evaluation in total: six experts were involved in dataset preparation and there were 10 additional expert evaluators.

All 400 queries were divided into four sets of 100 queries each. The experts were also divided into four groups, with each group including radiologists from different departments of the institution. This design was intended to reduce interaction between experts and thereby minimize potential bias in the ratings.

Each expert group received one set of 100 queries, and each expert within the group evaluated all 100 queries assigned to that group. The only allocation constraint was that the author of a query was not allowed to evaluate that query. Apart from this restriction, query allocation was random.

Thus, each of the 400 test cases received four independent expert evaluations according to the methodology described above.

The design of dataset preparation and expert evaluation is shown in Figure 4.

2.9.4. Assessment of Inter-Expert Agreement

To assess inter-expert agreement, Gwet’s AC1 coefficients were calculated for criteria evaluated on the binary scale, and Gwet’s AC2 coefficients were calculated for criteria evaluated on the Likert scale. This choice was justified by the characteristics of the rating distributions, which showed a marked predominance of scores “4” and “5” on the Likert scale and score “1” on the binary scale (see Appendix B). The selected coefficients also account for the distance between ordinal ratings through a weighted agreement matrix. Other coefficients, such as Fleiss’ kappa and Krippendorff’s alpha, either treat ordinal ratings as nominal categories or are less robust to rating imbalance, which may lead to artificially low agreement estimates [33,34,35].

The coefficients were interpreted according to the scale proposed by S. Pratt et al. [36]: ≤0.20—low agreement, 0.21–0.40—moderate agreement, 0.41–0.60—satisfactory agreement, 0.61–0.80—substantial agreement, and 0.81–1.00—excellent agreement.

2.10. Statistical Analysis

Statistical analysis was performed using Python 3.12.3. Descriptive and inferential statistics were calculated using NumPy 1.26.4, SciPy 1.12.0, and Pingouin 0.5.5. Inter-expert agreement coefficients were calculated using the irrCAC library, version 0.4.4.

Means and standard deviations (SDs) were used as descriptive statistics. Confidence intervals (CIs) were calculated using bootstrap resampling with 2000 samples. The lower and upper bounds of the intervals corresponded to the 2.5th and 97.5th percentiles of the bootstrapped distributions, respectively.

Spearman’s rank correlation coefficient was used to assess correlations between variables with non-normal distributions. The Kruskal–Wallis test was used to compare ordinal variables across three independent groups. The null hypothesis was that the distribution of scores was the same across all groups. The chi-square test was used to compare binary variables across three independent groups. The null hypothesis was that the proportions were the same across all groups. Statistical significance was set at p < 0.05.

Gwet’s AC1 and AC2 coefficients were used to assess inter-expert agreement. The null hypothesis was that inter-expert agreement did not differ from chance agreement.

3. Results

3.1. Functional Testing Results

Functional testing showed that the RAG system returned an output for all user queries. In the case of missing information in the document database, a failed search, or a detection of contradictory judgments, the system notifies the user about it. No instances of hallucinations were detected in the 77-query functional testing set.

3.2. Technical Testing Results

3.2.1. Automated and Expert Evaluation

For the architecture debugging, 100 queries were selected from the technical testing dataset, covering only two document collections. Table 3 shows the changes in automated metrics during the pipeline modernization process, namely, changes in the approach to chunking and ranking sources, as well as reducing the number of chunks sent to the final RAG system context.

Processing errors refer to instances where the LLM evaluating RAG system performance failed to return a score due to timeout errors. To accelerate the evaluation, we imposed a 5 min waiting limit per test case. This interval may have been insufficient, as the LLM requires time to first decompose the answer or context into individual statements and then classify each as relevant or irrelevant. When such an error occurred during the computation of a specific metric, the case was excluded from that metric’s calculation while remaining eligible for all other analyses.

As shown in Table 3, iterative refinement of the RAG system architecture led to improved automated metrics and a reduction in the number of unanswered queries. Table 4 presents the automated metric values obtained for the full technical testing dataset.

An example of a user query and RAG system response is shown in Box 2.

Box 2. Example of RAG system response.

User query: What type of technical defect in AI testing is time-related, and what cutoff is used for imaging studies?

RAG response: The type of technical defect related to time in AI-based software testing is a «Type A» defect.

According to the classification presented in the guidelines for evaluating the maturity of AI technologies in healthcare (Source: AI Technology Maturity Assessment, Section 1.1):

1. Type «A» defects: The processing time for a single study by one AI-based software exceeds a specified time threshold (for imaging studies, more than 6.5 min).

Time Threshold for Diagnostic Studies (for imaging studies):

More than 6.5 min; when this threshold is exceeded, a Type «A» defect is recorded.

Additional Details:

• Analysis time is defined as the interval from the study availability for AI processing to the moment when the results are delivered to the Medical Information System (MIS).

• If the processing time exceeds the threshold, the software should abort the analysis and send the notification “ERROR A.”

• The specific proportion of unprocessed studies due to this defect is an important marker for assessing the need for software optimization or increasing computational resources.

Conclusion:

The technical defect related to time is a Type «A» defect, and the time threshold is more than 6.5 min for imaging studies.

Source document: AI Technology Maturity Assessment for Healthcare, Section 1.1, paragraph 1.

A high Contextual Precision value indicates the system’s ability to rank retrieved context fragments by relevance. A high Contextual Recall value indicates that the retrieved context almost always contains the information necessary for the correct reference response. A low value of the Contextual Relevancy metric is due to the fact that the RAG system tends to retrieve information that may be directly or indirectly related to the query. A high Answer Relevancy value confirms that the RAG system provides a response directly relevant to the user’s query.

This decrease in automated Contextual Relevancy compared with the technical testing dataset was likely related to the transition from retrieval over a limited set of collections to retrieval over the full knowledge base. In the full-scale evaluation, the retriever operated over a broader and more heterogeneous corpus and returned parent chunks that often contained the answer-bearing evidence together with additional surrounding text. Because the automated Contextual Relevancy metric penalizes retrieved context that is not directly required to answer the query, broader parent-level retrieval could lower this score even when the generated answer remained accurate and clinically useful. Thus, the decrease primarily reflects the change in retrieval conditions and context granularity rather than a direct deterioration in answer quality.

To further investigate the low automated Contextual Relevancy score, we analyzed whether this metric differed across query complexity levels. The results are presented in Table 5.

The analysis showed that Contextual Relevancy differed significantly across complexity levels; post hoc comparisons indicated that this difference was driven by lower values for Level 0 queries compared with Levels 1 and 2.

We then assessed whether low automated Contextual Relevancy was associated with lower expert-rated context completeness or response conciseness. The results are presented in Table 6.

Contextual Relevancy showed only a weak association with expert-rated Contextual Completeness and no association with Conciseness. These findings suggest that low automated Contextual Relevancy may partly reflect the retrieval of answer-bearing information together with surrounding text from the same parent chunk.

To quantify the contribution of retrieval augmentation, we compared the RAG system with the same LLM used without retrieved context on the full technical testing dataset. The results are presented in Table 7.

Based on automated Correctness scoring, the RAG configuration showed a significantly higher score than the LLM-only baseline. This suggests that access to the curated knowledge base may improve the factual alignment and completeness of responses. However, this comparison should be interpreted as a preliminary automated baseline analysis because it was not validated by the same multireader expert evaluation procedure used for the final RAG system.

Table 8 and Table 9 show the expert evaluation results for RAG system responses and retrieved context fragments.

Usefulness of Context 1–5 refers to the first to fifth ranked retrieved context fragments, respectively.

Expert-rated Usefulness of Context decreased predictably with the rank position of the retrieved context fragment. The highest-ranked retrieved context fragments received the highest expert ratings, indicating effective operation of the context ranking model (reranker). Although usefulness decreased with rank, the fifth retrieved fragment was still rated as useful in 18.0% of expert evaluations. We therefore retained the top-five parent-chunk configuration to improve recall and reduce the risk of missing supporting evidence. As these fragments are used by the generation module rather than manually reviewed by users, the lower usefulness of the fifth fragment reflects a recall–noise trade-off rather than an additional reading burden.

The mean Contextual Precision value was 0.886 (0.874, 0.898), while the value of this metric calculated automatically with the DeepEval framework was 0.735 (0.698, 0.774). This discrepancy may reflect limitations of the evaluation model or prompt configuration used in the DeepEval framework.

To study the possibility of replacing the expert evaluation with the automated one, we calculated a correlation between Contextual Precision obtained by DeepEval and Contextual Precision obtained in the expert evaluation using the same formula. The correlation coefficient was 0.329 (p < 0.001), which corresponds to a weak positive correlation.

3.2.2. Expert Assessment by Query Complexity

To examine the influence of the query category on the quality of response and context, measured on the Likert scale, the Kruskal–Wallis test was conducted. The test results for criteria assessed on the Likert scale are presented in Table 10.

As shown in Table 10, expert ratings differed significantly by query complexity only for the Conciseness criterion (p = 0.004). No statistically significant differences were observed for the other Likert-scale criteria. To further characterize this finding, Conciseness scores were summarized by query complexity level. The results are presented in Table 11.

Post hoc pairwise comparisons were then performed using Mann–Whitney U tests with Bonferroni correction. Significant differences were observed between Levels 0 and 2 (adjusted p = 0.020) and between Levels 1 and 2 (adjusted p = 0.001), whereas no statistically significant difference was found between Levels 0 and 1 (adjusted p = 1.000). These findings indicate that responses to low- and moderate-complexity queries received higher Conciseness ratings than responses to high-complexity queries, which may reflect the greater amount of information required to answer high-complexity queries.

Figure 5 shows mean Likert-scale expert ratings by query complexity.

To examine the influence of query complexity on the distribution of ratings for the Usefulness of Context criterion, the chi-square test was conducted. The test results are presented in Table 12.

A statistically significant difference by query complexity was observed only for Usefulness of Context 1 (p = 0.047). No significant differences were found for Usefulness of Context 2–5. Figure 6 shows the proportion of relevant context fragments by query complexity.

3.2.3. Consistency of Expert Assessment

Table 13 presents Gwet’s AC2 coefficients for criteria assessed on the Likert scale.

According to Table 13, the obtained coefficients indicate excellent agreement for most response quality criteria, except for Conciseness, for which agreement was substantial. Table 14 presents Gwet’s AC1 coefficients for criteria assessed on the binary scale.

As shown in Table 14, inter-expert agreement for context quality ranged from moderate to excellent. Excellent agreement was observed for Usefulness of Context 1, whereas the lowest agreement was observed for Usefulness of Context 3.

4. Discussion

This study demonstrated that a hierarchical RAG architecture can be adapted to reference, regulatory, and methodological queries in radiology and evaluated using both automated and expert-based methods. Iterative refinement of the pipeline was associated with improved automated retrieval and response metrics and a reduction in the number of queries for which the system could not generate a substantive answer. Expert evaluation also showed high response quality across most criteria, while inter-expert agreement was substantial to excellent for most Likert-scale measures. These findings support the working hypothesis that structure-preserving document processing, parent–child chunking, hybrid retrieval, and reranking can improve access to heterogeneous radiology-specific knowledge sources. At the same time, the moderate correlation between automated and expert-derived Contextual Precision suggests that automated metrics are useful for system debugging but should not fully replace expert evaluation. The discrepancy between automated and expert-derived Contextual Precision highlights an important limitation of LLM-as-a-judge metrics in domain-specific medical RAG systems. Previous studies have also emphasized that automated metrics and LLM-based evaluators can support scalable assessment of medical text generation systems, but their validity depends on the task, evaluation prompt, judge model, and clinical context, and they should be complemented by expert review [37,38,39]. Automated metrics are useful for rapid comparison of pipeline configurations during iterative development, but they are sensitive to the evaluation prompt, the selected judge model, the structure, and length of retrieved context, and domain-specific terminology. Expert reviewers assessed the practical usefulness of retrieved fragments for radiological reference, regulatory, and methodological queries, whereas the automated evaluator assessed textual relevance in a more formalized way. Moreover, retrieved parent chunks often contained both answer-bearing information and adjacent text from the same document section, which may have reduced automated scores despite being useful for expert interpretation. These factors may explain why expert-assessed Contextual Precision was higher than automated Contextual Precision and why the correlation between the two measures was only moderate. Similar observations have been reported in RAG evaluation using institutional nuclear medicine manuals, where automated scoring was useful for comparing RAG configurations but showed limited agreement with human ratings [40]. Thus, automated metrics should be regarded as supportive tools for pipeline debugging and monitoring, rather than as substitutes for multireader expert assessment in medical applications [37,40].

Early applied studies of RAG in radiology have demonstrated several distinct use cases, including general radiology question answering, emergency radiology, contrast media consultation, nuclear medicine protocols, and radiology research support. However, these systems differ substantially in the scope and origin of their knowledge bases, retrieval strategy, evaluation design, and intended use. To clarify the positioning of the present study, we summarized the main characteristics of previously published radiology-focused RAG systems and compared them with our approach (see Appendix C).

Overall, the comparison shows that most previously published radiology-focused RAG systems were designed for either diagnostic question answering, narrow guideline-based consultation, or literature-based research support. In contrast, the present system was developed as a controlled reference engine for routine radiological practice, with a broader emphasis on reference, regulatory, and methodological documentation. Its main distinguishing features are the radiologist-informed construction of the knowledge base, expert validation of heterogeneous sources, preservation of document structure through parent–child indexing, and multireader evaluation of both generated responses and retrieved context. These differences also define the main directions for further development, including knowledge base expansion, semi-automated updating, and integration of additional retrieval and validation mechanisms.

A promising direction for further improvement is the expansion of the knowledge base and development of mechanisms for automated document updates within the VDB. In the current implementation, knowledge base updating requires several sequential steps: identification of new or revised source documents, expert assessment of source authority and relevance, exclusion of outdated or duplicate versions, document parsing and cleaning, chunking, embedding generation, reindexing in the VDB, and functional testing after reindexing. Several stages could be partially automated in future versions, including monitoring of predefined source repositories, detection of newly released or updated documents, metadata extraction, duplicate identification, parsing, and scheduled reindexing. However, full automation is not appropriate for a medical reference system because source authority, currency, and potential contradictions between documents require expert validation before the material can be included in the knowledge base. It may also be important to link text fragments with associated illustrations or to incorporate multimodal models capable of analyzing medical images and using them as part of the response. A transition from a closed RAG pipeline to an agent-based system may also be beneficial, allowing the LLM to decide autonomously when external retrieval is needed. In addition, integrating structured tools beyond purely semantic retrieval may help address cases in which semantically similar documents are not actually relevant and may broaden the range of use cases for intelligent reference systems in radiology.

5. Limitations

This study has several limitations. First, the current system operates only with textual sources and does not analyze illustrations or medical images. Second, knowledge base updating is currently manual and requires expert validation, parsing, cleaning, chunking, and reindexing, which may delay incorporation of new documents. Third, the comparison between the RAG system and the LLM-only baseline was based only on automated metrics and was not assessed by multireader expert evaluation. Fourth, retrieval hyperparameters were selected empirically during iterative debugging, and no full latency–performance ablation was performed. Fifth, the technical testing set was source-grounded and did not fully reproduce naturalistic workstation queries. No separate holdout-source or real-world user-log evaluation was conducted. Therefore, the results should be interpreted as performance under controlled source-grounded benchmark conditions rather than as definitive evidence of performance on unrestricted natural workstation queries. Finally, the system was tested on a single-institution corpus focused on organizational, methodological, and regulatory aspects of radiology, which limits generalizability and requires external validation.

Author Contributions

Conceptualization, I.A.B., A.V.V. and Y.A.V.; methodology, R.A.E., A.E.G., P.A.S. and A.A.Y.; software, R.A.E. and A.E.G.; validation, P.A.S., A.A.Y., M.D.V. and O.V.O.; formal analysis, A.A.Y. and A.E.G.; investigation, R.A.E., P.A.S., A.A.Y. and M.D.V.; data curation, R.A.E., A.E.G. and P.A.S.; writing—original draft preparation, A.A.Y. and P.A.S.; writing—review and editing, R.A.E., A.E.G., P.A.S., A.A.Y., M.D.V., I.A.B., O.V.O., A.V.V. and Y.A.V.; visualization, A.E.G. and R.A.E.; supervision, A.V.V. and Y.A.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was approved by the Independent Ethics Committee of the State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department” (protocol No. 06/2025, 19 June 2025).

Informed Consent Statement

Written informed consent for publication was not applicable, as the study did not involve patients or identifiable patient data.

Data Availability Statement

The source code, configurations, and parameters used in this work are available upon a reasonable request to the corresponding author. The main code for the experimental pipelines was developed in Python 3.12.3. The chonkie library (https://github.com/chonkie-inc/chonkie (accessed on 28 May 2026)), version 1.4.2, was used for the chunking algorithm. Qdrant (https://github.com/qdrant/qdrant; accessed on 28 May 2026), version 1.15, along with the Python qdrant-client (https://github.com/qdrant/qdrant-client; accessed on 28 May 2026), version 1.16.1, served as the vector database. All used open-source models are available on Hugging Face under their original licenses. Below is a complete list of them with corresponding URLs: Qwen/Qwen3-30B-A3B: https://huggingface.co/Qwen/Qwen3-30B-A3B; accessed on 28 May 2026). Qwen/Qwen3-Reranker-0.6B: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B (accessed on 28 May 2026). ai-forever/FRIDA: https://huggingface.co/ai-forever/FRIDA; accessed on 28 May 2026). The LLM and reranker were maintained using Ollama, version 0.17.7 (https://github.com/ollama/ollama; accessed on 28 May 2026). The sentence-transformers library (https://github.com/huggingface/sentence-transformers; accessed on 28 May 2026), version 5.1.2, was used for the FRIDA embedding model, and fastembed (https://github.com/qdrant/fastembed; accessed on 28 May 2026), version 0.7.3, was used for BM25 embeddings. In most experiments, especially using standard large language models, calculations were performed on graphics processors equipped with Nvidia RTX 3090 Ti and Nvidia Quadro RTX 5000 accelerators (NVIDIA Corporation, Santa Clara, CA, USA). The datasets for functional and technical testing, as well as a list of documents included in the knowledge base, are available only upon a reasonable request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
BM25	Best Matching 25
CI	Confidence interval
HTML	HyperText Markup Language
LLM	Large Language Model
MD	Markdown
MTEB	Massive Text Embedding Benchmark
OCR	Optical character recognition
PDF	Portable Document Format
RAG	Retrieval-augmented generation
REST API	Representational State Transfer Application Programming Interface
RRF	Reciprocal Rank Fusion
SD	Standard deviation
TLA	Three-letter acronym
VDB	Vector database

Appendix A

Appendix A.1

Table A1. Sample of dataset for functional testing.

ID	Query	Reference Response	Topic	Sources
1	What is the normal width of the third ventricle in adults of different age groups on CT scan?	The normal width of the third ventricle is <7 mm in adults under 60 years old and <9 mm in adults over 60 years old.	Resources on reference values for the indicators and the degree of described changes.	URL
6	What CT features should be used for the early detection of hydrocephalus in adults?	Increased size of the lateral, third, and fourth ventricles (ventriculomegaly), the Evans index (a ratio of the width of the anterior horns to the width of the skull) >0.3; areas of reduced density around the ventricles indicating CSF transudation through the ependyma with a density of 0–10 HU.	Resources on radiological anatomy (normal and pathologic).	URL

Appendix A.2

Table A2. Sample of dataset for technical testing.

ID	Query	Complexity Level	Reference Response	Source	Section(s)	Fragment of the Original Text
85	What defects are classified as type A?	0	Type A defects refer to cases in which the processing time of an AI service exceeds a predefined threshold (e.g., more than 6.5 min for radiological studies).	Assessment of the maturity of artificial intelligence technologies for healthcare.	1.1. Classification of technological defects during the operation of artificial intelligence-based software	During testing of artificial intelligence-based software, the following classification of the types of technological defects was developed: 1. Type A defects refer to cases in which the processing time of an AI service exceeds a predefined threshold (e.g., more than 6.5 min for radiological studies).
291	How can ultrasound detect carotid artery occlusion?	0	Heterogeneous hypoechoic masses in the lumen, the absence of staining in Color or Power Doppler mapping modes, and blood flow in the Doppler frequency shift spectrum.	ULTRASONIC EXAMINATION OF THE BRACHIOCEPHAL ARTERIES	Diagnosis of occlusive lesions of the carotid arteries	A criterion for occlusion of the internal carotid artery is the presence of heterogeneous hypoechoic masses in the lumen of the vessel, the absence of staining in the Color or Power Doppler mapping modes, and blood flow in the Doppler frequency shift spectrum.

Appendix B

Figure A1. Distribution of ratings by criteria (Likert scale).

Figure A2. Distribution of ratings by criteria (binary scale).

Appendix C

Table A3. Comparison of the proposed system with previously published radiology-focused RAG systems.

Study	Knowledge Base	Main Use Case	Evaluation Design	Key Difference from the Present Study
Wind et al. [12]	Radiopaedia-based evidence retrieved through a multi-step Retrieval and Reasoning (RaR) framework. The pipeline decomposed radiology questions into diagnostic concepts, retrieved targeted evidence, and synthesized structured evidence reports.	Text-based radiology question answering requiring diagnostic reasoning.	Evaluation of 25 LLMs across zero-shot prompting, conventional online RAG, and RaR on 104 expert-curated radiology questions and an independent set of 65 board-exam questions. Accuracy, hallucination/relevance metrics, latency, and the effect of retrieved context on human radiologist performance were assessed.	Multi-step, reasoning-oriented, and agent-like retrieval architecture focused on diagnostic QA.
Tayebi Arasteh et al. [15]	Real-time online retrieval from Radiopaedia. For each query, key phrases were extracted, Radiopaedia articles were retrieved, chunked, embedded, and used to construct a temporary query-specific VDB.	Radiology question answering based on textual descriptions of clinical and imaging findings. Images themselves were not processed.	Evaluation on 80 RSNA Case Collection questions and 24 expert-curated radiology questions. Multiple LLMs were tested with and without RadioRAG; accuracy, factuality, hallucination behavior, and comparison with a human radiologist were assessed using bootstrapping.	Online, source-specific RAG relying on Radiopaedia at query time.
Fukui et al. [40]	Forty Japanese institutional nuclear medicine manuals from a single hospital. Documents were chunked and indexed using dense vector retrieval and hybrid retrieval combining vector search with BM25.	Nuclear medicine procedure support, including examination protocols, patient preparation, image acquisition steps, and radiation safety.	Evaluation using 100 manually created question–answer pairs. GPT-3.5 and GPT-4o were tested with dense or hybrid retrieval. Three certified radiological technologists/medical physicists rated answers and retrieved contexts on four-point scales. Automated metrics included RAGAS factual correctness and context recall, with a comparison between human and automated scoring.	Institution-specific nuclear medicine RAG with Japanese-language manuals and combined human/RAGAS evaluation.
Fink et al. [41]	RadioGraphics Top Ten Reading List for trauma radiology, comprising 70 peer-reviewed educational articles. Sentences with surrounding context were embedded and retrieved to support GPT-4 Turbo.	Emergency/trauma radiology diagnosis, injury classification, and grading based on written radiological report findings.	Prospective proof-of-concept study using 100 synthetic radiology reports generated from 50 traumatic injuries by two radiologists. TraumaCB was compared with generic GPT-4 Turbo; three board-certified radiologists assessed accuracy, explanation quality, trustworthiness, source matching, and interrater reliability.	Task-specific RAG for trauma diagnosis/classification with source transparency.
Komenda et al. [42]	Single-source RAG based on version 10.0 of the ESUR guideline on contrast media.	Guideline-based radiology queries related to contrast media.	Expert evaluation of RAG responses to guideline-related questions.	Narrow single-guideline RAG system.
Wada et al. [43]	Curated contrast-media knowledge base derived from authoritative sources, including ACR Manual on Contrast Media, ESUR guidelines, institutional protocols, and relevant literature. Entries were reviewed and organized by radiologists; retrieval used hybrid semantic and keyword search.	Iodinated contrast media consultation, including risk assessment, contraindications, protocol selection, dosing, and safety-critical decision support.	Evaluation of 100 simulated ICM consultation scenarios. Llama 3.2 11B with and without RAG was compared with GPT-4o mini, Gemini 2.0 Flash, and Claude 3.5 Haiku. A blinded radiologist ranked responses; three LLM judges assessed clinical accuracy, safety, structure, communication, applicability, and latency. Hallucinations and response time were also analyzed.	Safety-critical, contrast-media-specific consultation system with local deployment emphasis.
Welsh et al. [44]	167,028 PubMed abstracts were retrieved using the keyword “radiology” and embedded into a local ChromaDB VDB.	Secure institutional RAG system for radiology research and literature-based question answering.	Single-blinded comparison with GPT-4-Consensus using 20 radiology-related prompts. Participants provided blinded 5-point Likert ratings for factual accuracy, citation relevance, and perceived performance; output preference and hallucination review by a board-certified radiologist were also reported.	Focused on radiology research support using a PubMed abstract corpus.

References

Najjar, R. Redefining Radiology: A Review of Artificial Intelligence Integration in Medical Imaging. Diagnostics 2023, 13, 2760. [Google Scholar] [CrossRef] [PubMed]
Avanzo, M.; Stancanello, J.; Pirrone, G.; Drigo, A.; Retico, A. The Evolution of Artificial Intelligence in Medical Imaging: From Computer Science to Machine and Deep Learning. Cancers 2024, 16, 3702. [Google Scholar] [CrossRef]
Vasiliev, Y.A.; Vladzymyrskyy, A.V. Artificial Intelligence in Radiology: Per Aspera ad Astra; Izdatel’skie Resheniya: Moscow, Russia, 2025. [Google Scholar]
Reichenpfader, D.; Müller, H.; Denecke, K. A Scoping Review of Large Language Model Based Approaches for Information Extraction from Radiology Reports. npj Digit. Med. 2024, 7, 222. [Google Scholar] [CrossRef]
Flanders, A.E.; Wang, X.; Wu, C.C.; Kitamura, F.C.; Shih, G.; Mongan, J.; Peng, Y. The Evolution of Radiology Image Annotation in the Era of Large Language Models. Radiol. Artif. Intell. 2025, 7, e240631. [Google Scholar] [CrossRef]
Vasilev, Y.A.; Reshetnikov, R.V.; Nanova, O.G.; Vladzymyrskyy, A.V.; Arzamasov, K.M.; Omelyanskaya, O.V.; Kodenko, M.R.; Erizhokov, R.A.; Pamova, A.P.; Seradzhi, S.R.; et al. Application of Large Language Models in Radiological Diagnostics: A Scoping Review. Digit. Diagn. 2025, 6, 268–285. [Google Scholar] [CrossRef]
Wu, W.; Xu, X.; Gao, C.; Diao, X.; Li, S.; Salas, L.A.; Gui, J. Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 707–730. [Google Scholar] [CrossRef]
Bluethgen, C.; Van Veen, D.; Zakka, C.; Link, K.E.; Fanous, A.H.; Daneshjou, R.; Frauenfelder, T.; Langlotz, C.P.; Gatidis, S.; Chaudhari, A. Best Practices for Large Language Models in Radiology. Radiology 2024, 315, e240528. [Google Scholar] [CrossRef]
Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval Augmented Generation for Large Language Models in Healthcare: A Systematic Review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef] [PubMed]
Salehi, S.; Singh, Y.; Horst, K.K.; Hathaway, Q.A.; Erickson, B.J. Agentic AI and Large Language Models in Radiology: Opportunities and Hallucination Challenges. Bioengineering 2025, 12, 1303. [Google Scholar] [CrossRef] [PubMed]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33. [Google Scholar]
Wind, S.; Sopa, J.; Truhn, D.; Lotfinia, M.; Nguyen, T.-T.; Bressem, K.; Adams, L.; Rusu, M.; Köstler, H.; Wellein, G.; et al. Multi-Step Retrieval and Reasoning Improves Radiology Question Answering with Large Language Models. npj Digit. Med. 2025, 8, 790. [Google Scholar] [CrossRef]
Shi, Y.; Yang, T.; Chen, C.; Li, Q.; Liu, T.; Li, X.; Liu, N. SearchRAG: Can Search Engines Be Helpful for LLM-Based Medical Question Answering? In Proceedings of the 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Wuhan, China, 15–18 December 2025; pp. 4051–4056. [Google Scholar] [CrossRef]
Zhao, X.; Liu, S.; Yang, S.-Y.; Miao, C. MedRAG: Enhancing Retrieval-Augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot. In Proceedings of the ACM Web Conference 2025 (WWW ‘25); ACM: New York, NY, USA, 2025; pp. 4442–4457. [Google Scholar] [CrossRef]
Arasteh, S.T.; Lotfinia, M.; Bressem, K.; Siepmann, R.; Adams, L.; Ferber, D.; Kuhl, C.; Kather, J.N.; Nebelung, S.; Truhn, D. RadioRAG: Online Retrieval-Augmented Generation for Radiology Question Answering. Radiol. Artif. Intell. 2025, 7, e240476. [Google Scholar] [CrossRef]
Dhanoa, D.; Dhesi, T.S.; Burton, K.R.; Nicolaou, S.; Liang, T. The Evolving Role of the Radiologist: The Vancouver Workload Utilization Evaluation Study. J. Am. Coll. Radiol. 2013, 10, 764–769. [Google Scholar] [CrossRef]
Pierson, C.S.; Kennedy, T.A.; Bruce, R.J.; Yu, J.-P.J. Workflow Interruptions in an Era of Instant Messaging: A Detailed Analysis. Clin. Imaging 2024, 108, 110117. [Google Scholar] [CrossRef]
Cerdá-Alberich, L.; Solana, J.; Mallol, P.; Ribas, G.; García-Junco, M.; Alberich-Bayarri, A.; Marti-Bonmati, L. MAIC-10 Brief Quality Checklist for Publications Using Artificial Intelligence and Medical Images. Insights Imaging 2023, 14, 11. [Google Scholar] [CrossRef]
Cochran, W.G. Sampling Techniques; John Wiley & Sons: New York, NY, USA, 1977. [Google Scholar]
Kauermann, G.; Küchenhoff, H. Stichproben: Methoden und Praktische Anwendungen mit R; Springer: Berlin, Germany, 2011. [Google Scholar]
Docling. Available online: https://github.com/docling-project/docling (accessed on 28 May 2026).
Cheema, M.D.A.; Shaiq, M.D.; Mirza, F.; Kamal, A.; Naeem, M.A. Adapting Multilingual Vision Language Transformers for Low-Resource Urdu Optical Character Recognition (OCR). PeerJ Comput. Sci. 2024, 10, e1964. [Google Scholar] [CrossRef] [PubMed]
Chonkie. Available online: https://github.com/chonkie-inc/chonkie (accessed on 28 May 2026).
Merola, C.; Singh, J. Reconstructing Context. In Knowledge-Enhanced Information Retrieval; Wang, Z., Fang, J., Frisoni, G., Dai, Z., Meng, Z., Moro, G., Eds.; KEIR 2025; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2026; Volume 16086, pp. 3–18. [Google Scholar] [CrossRef]
Enevoldsen, K.; Chung, I.; Kerboua, I.; Kardos, M.; Mathur, A.; Stap, D.; Gala, J.; Siblini, W.; Krzemiński, D.; Winata, G.I.; et al. MMTEB: Massive Multilingual Text Embedding Benchmark. In Proceedings of the 13th International Conference on Learning Representations (ICLR 2025), Singapore, 24–28 April 2025. [Google Scholar]
FastEmbed. Available online: https://github.com/qdrant/fastembed (accessed on 28 May 2026).
Aljohani, B.; Alsanoosy, T. Enhancing Medical Question Answering with LLMs via a Hybrid Retrieval-Augmented Generation Framework. Information 2026, 17, 133. [Google Scholar] [CrossRef]
Chernogorskii, F.; Averkiev, S.; Kudraleeva, L.; Martirosian, Z.; Tikhonova, M.; Malykh, V.; Fenogenova, A. DRAGOn: Designing RAG on Periodically Updated Corpus. In Proceedings of the 21st Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop (EACL SRW 2026); Association for Computational Linguistics: Stroudsburg, PA, USA, 2026; Available online: https://aclanthology.org/2026.eacl-srw.48/ (accessed on 28 May 2026).
Wei, Q.; Yang, M.; Han, C.; Wei, J.; Zhang, M.; Shi, F.; Ning, H. QCG-Rerank: Chunks Graph Rerank with Query Expansion in Retrieval-Augmented LLMs for Tourism Domain. arXiv 2024, arXiv:2411.08724. [Google Scholar] [CrossRef]
Singh, I.S.; Aggarwal, R.; Allahverdiyev, I.; Taha, M.; Akalin, A.; Zhu, K.; O’Brien, S. ChunkRAG: A Novel LLM-Chunk Filtering Method for RAG Systems. arXiv 2024, arXiv:2410.19572. [Google Scholar] [CrossRef]
Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The Prompt Report: A Systematic Survey of Prompt Engineering Techniques. arXiv 2025, arXiv:2406.06608. [Google Scholar] [CrossRef]
Jeffrey, I.; Kritin, V. DeepEval. Available online: https://github.com/confident-ai/deepeval (accessed on 28 May 2026).
Torsiello, B.; Giammarino, M.; Quatto, P.; Battini, M.; Mattiello, S.; Battaglini, L.; Renna, M. Evaluation of Inter-Observer Reliability in the Case of Trichotomous and Four-Level Animal-Based Welfare Indicators with Two Observers. Ital. J. Anim. Sci. 2024, 23, 938–960. [Google Scholar] [CrossRef]
Geijer, M.; Båth, M.; Wessman, C. Some Common Statistical Methods for Assessing Rater Agreement in Radiological Studies. Acta Radiol. 2025, 66, 675–683. [Google Scholar] [CrossRef]
Brummerloh, T.; Carnot, M.L.; Lange, S.; Pfänder, G. Boromir at Touché 2022: Combining Natural Language Processing and Machine Learning Techniques for Image Retrieval for Arguments. In Proceedings of the Touché Lab on Argument Retrieval at CLEF 2022, Bologna, Italy, 5–8 September 2022. [Google Scholar]
Pratt, S.; Bowen, I.; Hallowell, G.; Shipman, E.; Redpath, A. Assessment of Agreement Using the Equine Glandular Gastric Disease Grading System in 84 Cases. Vet. Med. Sci. 2022, 8, 1472–1477. [Google Scholar] [CrossRef]
Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review. npj Digit. Med. 2024, 7, 258. [Google Scholar] [CrossRef] [PubMed]
Vasilev, Y.; Raznitsyna, I.; Pamova, A.; Burtsev, T.; Bobrovskaya, T.; Kosov, P.; Vladzymyrskyy, A.; Omelyanskaya, O.; Arzamasov, K. Evaluating Medical Text Summaries Using Automatic Evaluation Metrics and LLM-as-a-Judge Approach: A Pilot Study. Diagnostics 2026, 16, 3. [Google Scholar] [CrossRef] [PubMed]
Croxford, E.; Gao, Y.; Pellegrino, N.; Wong, K.; Wills, G.; First, E.; Liao, F.; Goswami, C.; Patterson, B.; Afshar, M. Current and Future State of Evaluation of Large Language Models for Medical Summarization Tasks. npj Health Syst. 2025, 2, 6. [Google Scholar] [CrossRef] [PubMed]
Fukui, Y.; Kawata, Y.; Kobashi, K.; Nagatani, Y.; Iguchi, H. Evaluation of a Retrieval-Augmented Generation System Using a Japanese Institutional Nuclear Medicine Manual and Large Language Model-Automated Scoring. Radiol. Phys. Technol. 2025, 18, 861–876. [Google Scholar] [CrossRef]
Fink, A.; Nattenmüller, J.; Rau, S.; Rau, A.; Tran, H.; Bamberg, F.; Reisert, M.; Kotter, E.; Diallo, T.; Russe, M.F. Retrieval-Augmented Generation Improves Precision and Trust of a GPT-4 Model for Emergency Radiology Diagnosis and Classification: A Proof-of-Concept Study. Eur. Radiol. 2025, 35, 5091–5098. [Google Scholar] [CrossRef]
Komenda, A.; Makowski, M.; Can, E.; Prucker, P.; Busch, F.; Wachter, A.; Weller, D.; Kim, S.H.; Ziegelmayer, S.; Bressem, K.; et al. Development and Evaluation of a Retrieval-Augmented Generation System for Radiology Guidelines. J. Imaging Inform. Med. 2026. [Google Scholar] [CrossRef]
Wada, A.; Tanaka, Y.; Nishizawa, M.; Yamamoto, A.; Akashi, T.; Hagiwara, A.; Hayakawa, Y.; Kikuta, J.; Shimoji, K.; Sano, K.; et al. Retrieval-Augmented Generation Elevates Local LLM Quality in Radiology Contrast Media Consultation. npj Digit. Med. 2025, 8, 395. [Google Scholar] [CrossRef]
Welsh, M.; Lopez-Rippe, J.; Alkhulaifat, D.; Khalkhali, V.; Wang, X.; Sinti-Ycochea, M.; Sotardi, S. Custom-Tailored Radiology Research via Retrieval-Augmented Generation: A Secure Institutionally Deployed Large Language Model System. Inventions 2025, 10, 55. [Google Scholar] [CrossRef]

Figure 1. Development and multireader evaluation framework of the proposed radiological RAG system.

Figure 2. Parent–child chunking and vector indexing strategy.

Figure 3. RAG retrieval and generation pipeline.

Figure 4. Dataset preparation and expert evaluation design. Solid arrows indicate the sequence of data preparation and evaluation stages. The dashed arrow indicates that the six experts involved in dataset preparation were also included among the 16 radiologists participating in the expert evaluation.

Figure 5. Mean values of criteria assessed by the query complexity on the Likert scale.

Figure 6. Proportion of relevant retrieved context fragments by query complexity.

Table 1. Criteria for evaluating RAG system responses.

No.	Criterion	Definition
1	Relevance	The RAG system’s response corresponds to the specific content of the query, regardless of whether it is correct.
2	Accuracy	The response is consistent with the provided source document or retrieved source fragment and does not contain factual inaccuracies within the scope of the available context.
3	Safety and Harm	The response does not contain the slightest risk of harm to the patient. Even with the full trust of a doctor, the response will not lead to negative consequences.
4	Comprehensiveness	The model’s response fully addresses all key aspects of the query; it is complete.
5	Conciseness	The response does not contain redundant or irrelevant information.
6	Correctness of Language and Terminology	The text is clear, logical, well-structured, and consistent with language standards and professional terminology.

Table 2. Criteria for evaluating retrieved context in the RAG system.

No.	Criterion	Definition
7	Contextual Completeness	The extent to which the retrieved context contains all the information necessary to generate a correct and complete response to the query without relying on external knowledge.
8	Usefulness of Context	The retrieved context contains information directly relevant to the user’s query and potentially useful for generating a response.

Table 3. Results of debugging the RAG system during iterative development.

Metric	DeepEval Metric Values Depending on the RAG System Configuration
	Testing 1	Testing 2	Testing 3
	Fixed Chunk Size (2000 Tokens); Hybrid Search (FRIDA, BM25); 15 Chunks for Context	Fixed Chunk Size (2000 Tokens); Hybrid Search (FRIDA, BM25); 7 Chunks for Context	Parent–Child Semantic Chunks; Hybrid Search (FRIDA, BM25); Reranking; 5 Chunks for Context
Contextual Precision Processing errors	0.824	0.805	0.938
Contextual Precision Processing errors	2	5	0
Contextual Recall Processing errors	0.888	0.877	0.980
Contextual Recall Processing errors	6	9	0
Contextual Relevancy Processing errors	0.337	0.328	0.685
Contextual Relevancy Processing errors	8	8	4
Answer Relevancy Processing errors	0.676	0.778	0.910
Answer Relevancy Processing errors	0	0	25
Number of queries the system could not answer	27	16	4

Table 4. Results of automated testing on the full technical testing dataset.

Metric	95% CI	Processing Errors
Contextual Precision	0.735 (0.698–0.774)	2
Contextual Recall	0.881 (0.850–0.910)	8
Contextual Relevancy	0.188 (0.173–0.204)	41
Answer Relevancy	0.890 (0.868–0.910)	97
Number of queries the system could not answer	11	N/A

Table 5. Contextual Relevancy by query complexity level.

Complexity Level	Contextual Relevancy, Mean (95% CI)
0	0.146 (0.128–0.163)
1	0.233 (0.202–0.267)
2	0.220 (0.187–0.255)

Table 6. Correlation between automated Contextual Relevancy and expert-rated criteria.

Expert-Rated Criterion	Spearman r	p
Contextual Completeness	0.234	<0.001
Conciseness	−0.020	0.697

Table 7. Comparison of the Correctness metric between the LLM-only and RAG configurations on the full technical testing dataset.

Approach	Correctness, Mean (95% CI)	Test Statistic	p
LLM only	0.684 (0.663–0.704)	5132	<0.001
RAG	0.798 (0.782–0.812)	5132	<0.001

Table 8. Expert assessment of response quality.

Criterion	Mean	SD	95% CI
Relevance	4.793	0.651	(4.762, 4.823)
Accuracy	4.531	0.979	(4.485, 4.579)
Safety and Harm	4.714	0.804	(4.672, 4.754)
Comprehensiveness	4.618	0.879	(4.576, 4.661)
Conciseness	4.231	1.073	(4.177, 4.283)
Correctness of Language and Terminology	4.782	0.626	(4.753, 4.812)

Abbreviations: SD, standard deviation; CI, confidence interval.

Table 9. Expert assessment of retrieved context quality.

Criterion	Mean	SD	95% CI
Contextual Completeness	4.407	1.107	(4.353, 4.461)
Usefulness of Context 1	0.869	0.337	(0.852, 0.886)
Usefulness of Context 2	0.669	0.471	(0.646, 0.692)
Usefulness of Context 3	0.442	0.497	(0.418, 0.464)
Usefulness of Context 4	0.290	0.462	(0.267, 0.313)
Usefulness of Context 5	0.180	0.384	(0.161, 0.199)
Contextual Precision	0.886	0.253	(0.874, 0.898)

Abbreviations: SD, standard deviation; CI, confidence interval.

Table 10. Kruskal–Wallis test results for criteria assessed on the Likert scale.

Criterion	Test Statistic	p
Relevance	0.178	0.915
Accuracy	0.384	0.825
Safety and Harm	0.142	0.931
Comprehensiveness	0.982	0.612
Conciseness	11.184	0.004
Correctness of Language and Terminology	0.856	0.652
Contextual Completeness	5.287	0.071

Table 11. Conciseness by query complexity level.

Complexity Level	Conciseness, Mean (95% CI)
0	4.251 (4.162–4.339)
1	4.323 (4.218–4.429)
2	4.066 (3.957–4.178)

Table 12. Chi-square test results for criteria assessed on the binary scale.

Criterion	Test Statistics	p
Usefulness of Context 1	6.114	0.047
Usefulness of Context 2	5.099	0.078
Usefulness of Context 3	4.356	0.113
Usefulness of Context 4	4.116	0.391
Usefulness of Context 5	4.438	0.109

Table 13. Inter-expert agreement for criteria assessed on the Likert scale.

Criterion	Gwet’s AC2
Criterion	Value	95% CI
Relevancy	0.930 *	(0.915, 0.946)
Accuracy	0.826 *	(0.797, 0.855)
Safety	0.901 *	(0.881, 0.921)
Comprehensiveness	0.874 *	(0.852, 0.896)
Conciseness	0.656 *	(0.616, 0.697)
Correctness of Language and Terminology	0.930 *	(0.915, 0.944)
Contextual Completeness	0.835 *	(0.812, 0.859)

* p < 0.001.

Table 14. Inter-expert agreement for the Usefulness of Context criterion.

Criterion	Proportion of Absolute Agreement		Gwet’s AC1
Criterion	Value	95% CI	Value	95% CI
Usefulness of Context 1	0.734	(0.691, 0.776)	0.810 *	(0.771, 0.848)
Usefulness of Context 2	0.503	(0.455, 0.553)	0.507 *	(0.447, 0.567)
Usefulness of Context 3	0.402	(0.352, 0.452)	0.350 *	(0.294, 0.406)
Usefulness of Context 4	0.495	(0.447, 0.545)	0.653 *	(0.616, 0.691)
Usefulness of Context 5	0.562	(0.514, 0.612)	0.665 *	(0.616, 0.715)

* p < 0.001.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Erizhokov, R.A.; Gordeev, A.E.; Sakharova, P.A.; Yafarova, A.A.; Varyukhina, M.D.; Blokhin, I.A.; Omelyanskaya, O.V.; Vladzymyrskyy, A.V.; Vasilev, Y.A. Development and Multireader Evaluation of Radiological RAG-System. Data 2026, 11, 143. https://doi.org/10.3390/data11060143

AMA Style

Erizhokov RA, Gordeev AE, Sakharova PA, Yafarova AA, Varyukhina MD, Blokhin IA, Omelyanskaya OV, Vladzymyrskyy AV, Vasilev YA. Development and Multireader Evaluation of Radiological RAG-System. Data. 2026; 11(6):143. https://doi.org/10.3390/data11060143

Chicago/Turabian Style

Erizhokov, Rustam A., Alexander E. Gordeev, Polina A. Sakharova, Adel A. Yafarova, Maria D. Varyukhina, Ivan A. Blokhin, Olga V. Omelyanskaya, Anton V. Vladzymyrskyy, and Yuriy A. Vasilev. 2026. "Development and Multireader Evaluation of Radiological RAG-System" Data 11, no. 6: 143. https://doi.org/10.3390/data11060143

APA Style

Erizhokov, R. A., Gordeev, A. E., Sakharova, P. A., Yafarova, A. A., Varyukhina, M. D., Blokhin, I. A., Omelyanskaya, O. V., Vladzymyrskyy, A. V., & Vasilev, Y. A. (2026). Development and Multireader Evaluation of Radiological RAG-System. Data, 11(6), 143. https://doi.org/10.3390/data11060143

Article Menu

Development and Multireader Evaluation of Radiological RAG-System

Abstract

1. Introduction

2. Materials and Methods

2.1. Survey of Practicing Radiologists

2.2. Source Validation

2.3. Dataset Preparation

2.4. Document Parsing

2.5. Vector Database (VDB) Design—Chunking and Indexing

2.6. System Architecture and RAG Query Construction

2.7. System Integration

2.8. Automatic Evaluation

2.9. Expert Evaluation

2.9.1. Response Quality Evaluation

2.9.2. Context Quality Evaluation

2.9.3. Expert Evaluation Design

2.9.4. Assessment of Inter-Expert Agreement

2.10. Statistical Analysis

3. Results

3.1. Functional Testing Results

3.2. Technical Testing Results

3.2.1. Automated and Expert Evaluation

3.2.2. Expert Assessment by Query Complexity

3.2.3. Consistency of Expert Assessment

4. Discussion

5. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI