Next Article in Journal
Guided Versus Freehand Dental Implant Placement: Where We Stand? A Narrative Review Based on a Systematic Literature Search
Previous Article in Journal
Urinary Catheterization Training for Nursing Students Using Traditional Instruction, Simulation, and Augmented Reality: A Randomized Controlled Trial
Previous Article in Special Issue
When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

by
José Guilherme Marques dos Santos
1,†,
Ricardo Yang
1,†,
Rui Humberto Pereira
2,3,
Alexandre Sousa
3,4,
Brígida Mónica Faria
3,5,
Henrique Lopes-Cardoso
1,3,
José Duarte
3,4,
José Luís Reis
2,3,
Luís Paulo Reis
1,3,
Pedro Pimenta
3,6 and
José Paulo Marques dos Santos
2,3,*
1
Faculty of Engineering, University of Porto, 4200-465 Porto, Portugal
2
Department of Business Administration, University of Maia, 4475-690 Maia, Portugal
3
LIACC—Artificial Intelligence and Computer Science Laboratory, University of Porto, 4200-465 Porto, Portugal
4
Department of Communication Sciences and Information Technologies, University of Maia, 4475-690 Maia, Portugal
5
School of Health, Polytechnic of Porto, 4200-072 Porto, Portugal
6
School of Technology and Management, Polytechnic Institute of Maia, 4475-690 Maia, Portugal
*
Author to whom correspondence should be addressed.
These two authors contributed equally to the study and are listed in alphabetical order.
Appl. Sci. 2026, 16(10), 5069; https://doi.org/10.3390/app16105069
Submission received: 2 April 2026 / Revised: 26 April 2026 / Accepted: 16 May 2026 / Published: 19 May 2026

Featured Application

A software platform for answering domain-specific queries over collections of PDF documents using Retrieval-Augmented Generation (RAG), applicable to administrative, legal, and regulatory document management in organizations handling sensitive or non-English documentation.

Abstract

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen’s d effect sizes. Two baselines bounded the results: naïve PDFLoader (86.2%) and manually curated Markdown (91.3%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1 ± 1.6%), surpassing even manual curation. A per-question-type analysis revealed that table-dependent questions drive the largest accuracy differences, with a 33-percentage-point gap between basic and hierarchical splitting. Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework alone. An exploratory GraphRAG implementation underperformed basic RAG (82% vs. 94.1%). These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.
Keywords: Retrieval-Augmented Generation; RAG; PDF conversion; document preprocessing; data quality; chunking strategy; Docling; knowledge graph; GraphRAG; LLM Retrieval-Augmented Generation; RAG; PDF conversion; document preprocessing; data quality; chunking strategy; Docling; knowledge graph; GraphRAG; LLM

Share and Cite

MDPI and ACS Style

Marques dos Santos, J.G.; Yang, R.; Pereira, R.H.; Sousa, A.; Faria, B.M.; Lopes-Cardoso, H.; Duarte, J.; Reis, J.L.; Reis, L.P.; Pimenta, P.; et al. From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering. Appl. Sci. 2026, 16, 5069. https://doi.org/10.3390/app16105069

AMA Style

Marques dos Santos JG, Yang R, Pereira RH, Sousa A, Faria BM, Lopes-Cardoso H, Duarte J, Reis JL, Reis LP, Pimenta P, et al. From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering. Applied Sciences. 2026; 16(10):5069. https://doi.org/10.3390/app16105069

Chicago/Turabian Style

Marques dos Santos, José Guilherme, Ricardo Yang, Rui Humberto Pereira, Alexandre Sousa, Brígida Mónica Faria, Henrique Lopes-Cardoso, José Duarte, José Luís Reis, Luís Paulo Reis, Pedro Pimenta, and et al. 2026. "From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering" Applied Sciences 16, no. 10: 5069. https://doi.org/10.3390/app16105069

APA Style

Marques dos Santos, J. G., Yang, R., Pereira, R. H., Sousa, A., Faria, B. M., Lopes-Cardoso, H., Duarte, J., Reis, J. L., Reis, L. P., Pimenta, P., & Marques dos Santos, J. P. (2026). From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering. Applied Sciences, 16(10), 5069. https://doi.org/10.3390/app16105069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop