AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks

Kim, Min-Woo; Park, Hyo-Bin; Ahn, Hee-Jin; Park, Woo-Ram; Jeon, Jae-Wan; Lee, Kyong-Ha; Lee, Ryong; Choi, Dong-Geol

doi:10.3390/electronics14061175

Open AccessArticle

AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks

by

Min-Woo Kim

^1,†

,

Hyo-Bin Park

^1,†,

Hee-Jin Ahn

¹,

Woo-Ram Park

¹,

Jae-Wan Jeon

¹,

Kyong-Ha Lee

²

,

Ryong Lee

^2,* and

Dong-Geol Choi

^1,*

¹

Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea

²

Department of Large-Scale AI Research Group, Korea Institute of Science and Technology Information, Daejeon 34141, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(6), 1175; https://doi.org/10.3390/electronics14061175

Submission received: 28 February 2025 / Revised: 14 March 2025 / Accepted: 15 March 2025 / Published: 17 March 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

AutoPaperBench proposes a benchmark generation system to automatically evaluate the comprehensibility of papers in a Multimodal Large Language Model (MLLM). The proposed system efficiently structures the content of a paper through semantic parsing and automatically generates text-based QAs and visual-based VQAs. To ensure the quality of the generated QA, we introduce a reviewer system that evaluates six criteria such as logic and appropriateness. In our experiments on 60 research papers from the medical, natural, and engineering fields, the generated benchmarks demonstrate comparable performance rankings to those of previous benchmarks, and the performance improvements achieved through semantic parsing are validated. The system can run on a single GPU environment and provides a framework for efficiently evaluating LLM thesis comprehension.

Keywords:

large language model; deep learning; benchmark; research paper evaluation system

1. Introduction

Large Language Models (LLMs) [1,2,3] are trained on datasets containing millions to billions of words, enabling them to understand the structure, context, and meaning of language. This allows them to perform various language-related tasks, such as text generation, sentence interpretation, question answering, and translation [4,5].

In addition, multimodal technologies [6,7,8] that integrate visual information processing are gaining significant attention. Research is actively being conducted on tasks such as visual question answering (VQA) [9,10,11], visual content generation [12,13,14], and object recognition [15,16,17] through the integration of images and text. More recently, LLMs have been applied in specialized fields such as medical literature analysis in healthcare [18,19,20] and legal document review and advisory tasks [21,22,23]. Research has demonstrated that LLMs can fully understand a paper, including its problem statement, hypothesis, experimental results, and conclusion [24,25].

As LLMs continue to evolve, their applications have diversified, leading to increased use in academic research. This has exponentially accelerated research development, resulting in a growing number of papers from which LLMs can learn. However, converting all papers into training data and training LLMs requires substantial human resources, time, and cost. While the recently introduced RAG (Retrieval-Augmented Generation) [26] enables question answering on paper contents without direct training, the complexity and specificity of academic papers can sometimes prevent RAG from providing adequate answers. In such cases, converting papers into training datasets and training LLMs becomes necessary, along with a systematic method to evaluate whether LLMs accurately understand the specialized content of papers before this process begins.

To overcome these limitations, our research proposes an automated benchmark generation system. Thoroughly evaluating paper comprehension requires not only understanding the overall content but also the elements explained in each section, particularly the visual materials supporting the research findings. To automate the assessment of paper comprehension, we have developed a system that automatically generates QA benchmarks for selected papers. Our proposed system leverages MLLMs to automatically generate QA sets for verifying paper comprehension and evaluates LLMs’ understanding of the papers using the generated benchmarks.

Our system comprises five components, including semantic parsing, QA generation, VQA generation, binary choice generation, and reviewing loop. First, it extracts text and visual materials from each section of the paper and converts them into structured content. Second, based on the paper’s content, initial QA pairs are automatically generated. Third, initial VQA pairs are automatically generated based on the paper’s content and visual materials. Fourth, to evaluate the model’s accuracy, a binary (yes/no) VQA dataset is created based on the paper’s content in a 50:50 ratio. Fifth, a reviewer system assesses QA pair quality through six criteria that include logical, relevance, completeness, accuracy, step-by-step, and clarity and evidence. Finally, the QA set is iteratively improved based on feedback from the reviewer, refining the QA pairs to generate the final benchmark.

The main contributions of our research are as follows. First, we developed an automated benchmark generation system using the MLLMs, which automates the process of generating QA sets to evaluate paper comprehension. Second, we proposed semantic parsing as a method for extracting paper content to capture detailed information and generate QA pairs. This approach addresses the “lost in the middle” phenomenon and enables the generation of comprehensive QA and VQA on specific paper content and visual materials. Third, we implemented a self-improving quality loop by introducing a reviewer LLM to enhance QA quality. Through these three core contributions, we can automatically generate high-quality benchmarks and systematically evaluate the MLLMs’ understanding of research papers.

2. Related Work

2.1. Visual Question Answering

Visual question answering (VQA) refers to the task of receiving both an image and a question about the image as input, and then providing the correct answer to the question. It is used in related applications such as image captioning and video captioning. VQA is a multimodal model that builds upon the existing question answering (QA) framework by adding visual capabilities to process visual materials as input. The types of questions typically include those that can be answered with what, where, when, whose, how, how much/how many, and yes/no responses. ScienceQA [27], a benchmark dataset in the scientific domain, consists of science-related questions and answers, comprising 21,208 multimodal multiple-choice problems. VQA-RAD [28], a benchmark in the medical field, contains 3515 question–answer pairs for 315 radiology images, enabling interpretation of medical images and assisting in diagnosis.

2.2. Large Language Model

Large Language Models (LLMs) are artificial intelligence models trained on extensive datasets that can understand and generate human language. Prominent examples include OpenAI’s ChatGPT based on GPT-4o, Google’s Bard, and Meta’s Llama. These models are commonly used as generative models that produce text responses when given prompts from users. However, LLMs have limitations. They occasionally provide incorrect information called hallucination and they do not support multiple modalities. Researchers are proposing various prompting and architectural techniques to address hallucinations, such as In-Context Learning [29], Chain-of-Thought [30], and Retrieval-Augmented Generation [26]. The single modality limitation has been overcome with the emergence of MLLMs (Multimodal Large Language Models), which can process and understand diverse modality data including text, images, and speech.

2.3. Multimodal Large Language Model

Multimodal Large Language Models (MLLMs) are Large Language Models capable of simultaneously processing inputs from various modalities such as text, images, and speech. Unlike LLMs that are limited to a single modality, MLLMs have the ability to understand and process two or more types of data. Recently, various MLLM models are continuously being developed, including OpenAI’s ChatGPT-o1 and xAI’s Grok3, and MLLMs are being applied across diverse fields rather than being limited to one area, finding applications in medicine, law, science, and other domains. ChatGPT-o1, recently announced by OpenAI, demonstrates exceptional performance even with complex user prompts by incorporating Chain-of-Thought reasoning, enabling it to think and respond like humans, along with web search functionality. Research on various MLLMs is also progressing, such as Stanford’s “STORM” [31] and SakanaAI’s “AI Scientist” [24], which can generate various documents when provided with a topic and outline. Recent studies also include research using MLLMs to understand paper content and visual materials [32,33,34]. These studies used DocVQA [35] and VisualMRC [36] as evaluation datasets, employing DocVQA, a question answering dataset for documents, and VisualMRC, a dataset based on webpage screenshots, to evaluate the models’ ability to understand both textual and visual information. Recently, through RAG (Retrieval-Augmented Generation) [26], question answering about research papers has become possible without additional model training, eliminating the need to convert all data into training material. However, this approach is not a panacea; when processing entire research papers, QA performance may deteriorate due to various factors including paper formatting, specialized terminology, and domain-specific knowledge requirements. If generation fails despite successful retrieval, such cases should be considered candidates for additional training. Currently, methods to verify an LLM’s comprehension of specialized knowledge in research papers remain ambiguous. To overcome these limitations, we propose a framework for evaluating LLM’s understanding of research papers without relying on RAG.

3. Methodology

We created an automated benchmark generation system to evaluate an MLLM’s ability to understand a paper and select the best model. To ensure that the MLLM understands a paper, the MLLM needs to know exactly what the core problem is and why it is important. It also needs to be able to clearly explain the solution and how it works, in addition to explaining the key points and experimental details from the research. Therefore, we can say that a model that scores well when it answers questions about the following three aspects has a good understanding of the paper. First, the ability to clearly identify the problem underlying the research, the solution to the problem, and the scholarly contribution through the abstract. Second, the ability to analyze the overall structure of the paper into sections and insightfully summarize the key contents of each section in conjunction with the abstract. Third, the ability to interpret the figures in the paper in the context of the research and explain how they relate to the overall thesis of the paper. For these evaluations, as shown in Figure 1, the system performs structural parsing of the paper and automatically generates question answering (QA) and visual question answering (VQA) datasets based on it to systematically evaluate the performance of each model.

Using the entire paper as the QA-generated content can lead to the “lost in the middle” phenomenon [37], where QA generation for parts of the paper is missed. Furthermore, since the system was developed based on the 11B model to reduce the reliance on MLLMs with large context lengths and to run on a single A6000 GPU, a more efficient content extraction approach is needed. To achieve this, we used the Mathpix API with OCR capabilities to obtain page-by-page descriptions of the text and images in PDF documents with the position of each line specified. This allowed us to organize the papers structurally by separating the abstract and the rest of the paper by title and content, and furthermore, we utilized an extraction approach that clearly divides the content of the paper into three key elements: ‘problem definition’, ‘solution’, and ‘main results’. We utilized a keyword-based extraction approach during the semantic parsing process. To extract ‘problem definition’, we identified relevant sections based on about 15 keywords related to the problem definition in the paper, such as ‘challenge’, ‘limitation’, ‘problem’, ‘issue’, ‘drawback’, etc. When extracting ‘solution’, we utilized about 20 keywords such as ‘propose’, ‘present’, ‘introduce’, ‘develop’, ‘design’, etc., to identify the methodology presented by the author. For ‘main results’, we extracted content related to the achievements of the paper through about 15 keywords such as ‘result’, ‘show’, ‘demonstrate’, ‘achieve’, ‘improve’, etc.

3.1. Semantic Parsing

While a traditional abstract provides an overview of the document as a whole, Extract uses these keyword-based extracts to present relevant sections corresponding to the three key elements in a structured format. In addition to summarizing the content of the document, it leverages the content of each section to generate benchmarks. In particular, for images, it matches images with the textual descriptions of the sections they belong to, ensuring the cleanliness of the images. It then uses this content to generate a benchmark dataset and evaluates MLLM with the generated benchmarks.

Figure 2 shows a comparative analysis of the results of generating answers using different parsing methods, and shows that the semantic parsing technique generates significantly more comprehensive and contextually relevant answers than either full parsing or normal parsing. These results show that semantic parsing can extract more accurate and domain-specific information from research articles, making it an effective way to improve the quality of VQA dataset generation for article comprehension assessment.

3.2. QA Generation

The QA generation process is performed in a manner that reflects the structural characteristics of the paper, in which the LLM generates answers to fixed template questions. In the Abstract section, a QA is generated to capture the overall outline of the paper and its main findings, and in the subsequent sections, a QA is generated to summarize the entire contents of each section. Specialized prompts have been designed for these section-specific QAs, enabling assessment of the understanding of the paper’s core concepts, methodological approach, key findings, and each section. The Abstract section generates QA sets that capture the overall overview of the paper and its key findings, and the subsequent sections generate QA sets that provide a good summary of the entire contents of each section. The design of specialized prompts for these section-specific QAs facilitates the assessment of the paper’s key concepts, methodological approach, key findings, and each section.

The Abstract section generates answers to the following three fixed questions:

“What is the main research problem or objective addressed in this paper?”;
“What are the key methods and approaches used in this research?”;
“What are the main findings and contributions of this paper?”.

Following this, the section invites the creation of an answer to the fixed command “Summarize the contents of this section”.

The Abstract prompt [38] focuses on understanding the overall context of the research paper and extracting the key content. It is designed to provide a concise summary of the main purpose of the study, the methodology used, and the key findings in 500 characters or less. The full set of prompts with input can be seen in Figure 3. These prompts were designed to eliminate meta-descriptions and unnecessary introductions to ensure that key information is communicated directly and clearly, and the subsequent prompts in the sections are carefully designed to facilitate the analysis of specific sections such as methods, results, and discussion. As shown in Figure 4, these prompts systematically summarize the key content by presenting each section with its own unique problem and objectives, and extract important information from each section, such as the main methodology, experimental results, and discussion, and present it in a clear and concise form. Both prompting systems share a similarity, in which they eliminate unnecessary introductions or meta-descriptions and lead the reader to the key information. In addition, all answers must end with a clear conclusion and accurately maintain the context of the document. In this way, QA is generated from the content of the parsed paper.

3.3. VQA Generation

In the context of VQA, the system integrates the visual elements of the passage with the content to create questions that allow for a more accurate measure of comprehension. The prompting system is designed to analyze visuals and text together. It takes as input the content of the section containing each figure and the caption for the figure. At this stage, the system does not pre-classify the type of image; however, it automatically determines the nature of the image based on the input and generates appropriate questions. The question generation prompts are carefully designed to generate many meaningful questions about academic content; for example, a figure describing the structure of a system will naturally generate questions about structural understanding if the same prompt structure is used, and a graph showing the results of an experiment will naturally generate questions about data interpretation. The question generation prompts in Figure 5 are designed to take into account the given image, the section content, and the image description. They leverage this to generate a large number of questions that cover many aspects of all question types, from basic understanding to complex analysis.

3.4. Binary Choice Generation

In particular, each question refers to specific details of the content while avoiding duplication and containing original and practical content. In order to ensure the efficacy of the answer generation process, it is imperative that prompts are concise, complete sentences are employed, and concrete content is utilized, thus avoiding the use of ambiguous directives. The quality of responses is contingent upon the prohibition of direct repetition of words used in the question, and the requirement of active and informative sentence structure.

Frequently used phrases such as “the point is” and “the main takeaway” and the use of incomplete endings with colons are strictly limited. Instead, respondents are expected to write their answers using concise, clear language while including specific details. These considerations are taken into account when constructing prompts for generating answers. Specific sentence construction can be seen in Figure 6. The prompts for generating VQAs are designed to take into account the visual elements and related textual context so that one can assess the connection between the visual information and the content of the answer. The result of carrying out this process to generate a VQA is shown in Figure 7, which contains significant information.

Binary choice evaluation is introduced to create an additional dataset for accuracy evaluation. This is achieved by effecting a partial alteration to the prevailing VQA dataset construction method, with a view to evaluating the effectiveness of the proposed system. When utilizing this system to generate data for the purpose of accuracy evaluation, a dataset consisting of binary QA pairs with a 50% ratio of yes or no responses is ultimately derived. The objective of this dataset is to facilitate an integrated analysis of visual elements and text, consistent with the construction of the general VQA dataset. To this end, the prompt system generates a QA pair by taking the contents of the section containing each figure and the image description (caption) as input. The questions are designed to reflect the professional knowledge and context of the paper as much as possible, and include advanced domain terms and methodologies, as well as the relationship between experimental conditions, parameters, and variables.

In addition, the questions reflect specific numerical criteria, statistical measurements, and analysis criteria. QA pairs in the dataset for accuracy evaluation are composed of an equal ratio of yes or no responses to minimize bias in the dataset and increase the objectivity of model evaluation. The quality of the QA pairs is enhanced during the review process, leading to an acceleration in the creation of VQA datasets, as the answers are binary yes or no. The accuracy evaluation dataset is designed to reflect professional knowledge and context to the greatest extent possible, incorporating advanced domain terms and methodologies, and covering experimental conditions, parameters, and relationships between variables. Furthermore, questions are designed to reflect specific numerical criteria, statistical measurements, and analysis criteria.

The creation of questions for the accuracy assessment is informed by a set of prompts, which are divided into two categories: yes or no responses. Questions for the yes response are meticulously crafted based on the following criteria: Firstly, questions are generated based on the experimental results or methodology in the paper’s content. For instance, questions that inquire about the impact of changes in a specific variable on the results or that exactly match the experimental conditions specified in the paper are included. Secondly, questions should be crafted to align with specific numerical standards, statistical analysis results, or theoretical claims in the paper, thereby ensuring a positive response. Thirdly, questions should be formulated to demand an in-depth understanding that extends beyond basic fact checking by incorporating advanced domain terms and a professional research context.

Questions that elicit non-responses are designed to intentionally misrepresent the content, methodology, definitions, etc., of the paper in question. This is accomplished by including false information to elicit non-responses. The prompts for generating binary VQA data are shown in Figure 8. This process results in a binary VQA dataset for accuracy evaluation consisting of QA pairs, and the generated binary VQA sample can be seen in Figure 9. This binary VQA dataset provides a useful basis for evaluating the accuracy of the model.

3.5. Reviewing Loop

The primary issues encountered during the process of generating initial QA using MLLM are outlined below.

First, the lack of information in the responses. Rather than demonstrating a deep understanding of the entire text, QA pairs often provide general and vague descriptions, making them unreliable as meaningful assessment metrics. Second, there is a lack of clarity in connecting the images to the content of the paper. Many of the QA pairs generated describe only individual elements of the image and fail to provide the full contextual connection to the paper, limiting their effectiveness as an assessment tool. Third, they lack logic. Answers often simply list information or lack logical flow, reducing their quality as evaluation data.

To address these issues, we introduce an automated reviewer process utilizing MLLM. This process consists of two phases: an evaluation phase and a feedback phase. The evaluation phase of the automated reviewer MLLM process provides feedback on the generated Initial QA based on six evaluation items: logic, appropriateness, completeness, accuracy, step-by-step explanation, and clarity and evidence. Each item is designed to help the user systematically improve the aforementioned issues. Logic assesses the logical structure of the answer and whether it makes sense, while fit checks whether the QA pair fits the intent and context of the question. Completeness prompts one to address and elaborate on all aspects of the question, and accuracy validates the factual accuracy of the answer and whether the user has interpreted the data correctly. Step-by-step explanations assess whether complex concepts are developed in an organized manner, and clarity and evidence items assess whether the answer is clearly worded and presents the correct evidence. The detailed evaluation criteria can be found in the Table 1.

Based on the six evaluation criteria, each item is rated with a minimum score of 1 and a maximum score of 5. All six items must score 4 or higher to be rated as ‘good’. If a single item scores more than 2 but less than 4, it is categorized as ‘fair’; if it scores more than 0 but less than 2, it is categorized as ‘poor’; and if the sum of all item scores is 7 or less, it is judged to be a creation error and excluded from the dataset. As shown in Figure 10, detailed prompts are provided to the reviewer MLLM to elicit the reasoning behind their rating along with the results of their evaluation.

These evaluation criteria are designed to address key issues identified during the initial QA creation process. The logic criterion checks for logical structure and non-contradictory arguments in the answer, addressing simple listicle answers or lack of logical flow. Relevance criteria address the lack of relevance to the paper’s content by assessing whether it directly responds to the question requirements. Completeness items prevent incomplete responses by ensuring that all aspects of the question are addressed and sufficient detail is provided. Accuracy items ensure factual accuracy and correct interpretation of the paper’s content. Step-by-step explanation criteria validate understanding of the paper by assessing whether it systematically explains complex concepts. Clarity and evidence items check for clear use of terminology and appropriate presentation of evidence.

In instances where the evaluation is not “good” but “normal” or “bad”, the feedback stage of the automatic reviewer process is initiated. During the feedback stage, the answer to the question is regenerated by reflecting the results derived from the evaluation stage. It is noteworthy that when the feedback process is repeated, the reviewer system is designed to assign greater weight to the feedback reason and the evaluation result. The regenerated answer undergoes the evaluation stage once more. This process is repeated iteratively until the re-evaluation is classified as ‘good’. For QA sets that are rated as ‘good’, a brief summary is generated for long answers consisting of 100 words or more, thereby creating the final QA data. The introduction of the automated reviewer process contributes to improving data quality by overcoming the aforementioned issues of missing article information, lack of validity of images and article content, and lack of logic. By scoring the generated VQA against clear criteria and providing specific feedback on items that do not meet the criteria, the completeness of the QA is continuously improved through an iterative improvement process. In addition, the problem of generating uncertain expressions such as “likely to be” and “would be better to be” from existing data has been effectively mitigated. This system ensures the consistency and reliability of QA data while reducing the time and cost of human review and correction of all data. In conclusion, the automated reviewer system solves the inefficiency and quality issues in the VQA dataset generation process and enables the learning model to perform in a more sophisticated and reliable manner. The datasets built in this way contribute to increasing the reliability of MLLM’s article comprehension assessment by providing information that includes expertise from different fields. By building a set of evaluations in this way, we can perform a detailed analysis of the entire article and increase the confidence in MLLM’s article comprehension evaluation. The results in Figure 11, Figure 12 and Figure 13 show the improvement before and after the introduction of the reviewer system.

4. Experiments

This paper provides reliable information and presents detailed and systematic experimental methods and results along with visual materials. In particular, the core content of this paper, such as methodology and experimental results, is explained with various visual materials, including formulas, graphs, and images. To evaluate the effectiveness of our system, we conducted experiments using 60 research papers from scientific fields that include visual materials, graphs, and formulas. Specifically, we selected 20 papers from each of the three fields of science: medical, natural, and engineering. Each paper was required to include at least two pairs of VQA.

4.1. Experiment Details

To validate the effectiveness of the benchmarks created by the proposed system, we performed zero-shot evaluations using GPT-4o [39], LLaVA 1.5-13B [40], Gemini 1.5 Pro [3], and BLIP2 [41]. GPT-4o is capable of providing a comprehensive assessment of both language and visual contexts. LLaVA 1.5 maximizes user-friendly QA performance through Visual Instruction Tuning. Gemini 1.5 Pro stands out for its visual–linguistic integration capabilities, while BLIP2 excels in retrieval and response accuracy through efficient image-to-text representation learning. We validated each of these models using the benchmark created by our system.

The MLLM for QA generation and review on the benchmark generation system utilized the LLama-3.2-11B-Vision-Instruct model. All experiments were performed on an NVIDIA RTX A6000 GPU, with a GPU memory utilization of 0.8. These settings were chosen to achieve an optimal balance between inference speed and memory efficiency for an 11B-scale model. The context length of the model was set to 8192 tokens, and the generation parameters were temperature, 0.7; top-k, 50; top-p, 0.9; presence penalty, 0.2; and frequency penalty, 0.2. These sampling parameters were tuned to generate a variety of QA patterns while maintaining academic accuracy. We capped the maximum number of tokens in generating responses at 1024. Qualitative analysis was used to ensure that the QA samples generated were suitable for evaluating article comprehension, and to validate that state-of-the-art MLLMs can understand the content of articles and provide accurate responses using benchmarks. We also validated the effectiveness of the benchmark by demonstrating that it produces generalized performance rankings similar to well-established benchmarks. In a zero-shot setup, the MLLM takes as input the parsed content from each paper, generates answers to generated questions, and compares them to the correct answers.

We selected ROUGE score, BLEU score, and accuracy as evaluation metrics. ROUGE and BLEU are widely used metrics in Natural Language Generation (NLG), measuring n-gram overlap and similarity between generated answers and ground truth to evaluate the linguistic quality and relevance of generated responses. Accuracy serves as a metric for determining the true/false validity of answers. To prevent bias in accuracy evaluation, we ensured an equal and unbiased distribution of yes and no responses (50% each) in the QA set. Our system-generated benchmark includes yes- or no-format QA to evaluate model comprehension capabilities, while ROUGE and BLEU evaluations were conducted on QA requiring descriptive answers to measure the precision of paper understanding. Using these three metrics in combination, we enabled the comprehensive assessment of paper comprehension levels.

4.2. Quantitative Results

It is noteworthy from the zero-shot evaluation results that the performance rankings of the models in the benchmarks generated by our system are similar to the performance rankings of the models reported in the existing benchmarks MM-Vet v2 [42] and ViP-Bench [43]. In particular, the relative performance ranking of each model in the existing benchmarks remained consistent in our benchmarks, demonstrating that the evaluation criteria of the benchmarks generated by our proposed system have a similar level of validity and reliability to the evaluation criteria of existing benchmarks built through a systematic curation process. Furthermore, the fact that this consistency in performance rankings was observed across all three evaluation domains demonstrates that our proposed auto-generated benchmarks can serve as a reliable evaluation tool for evaluating the performance of multimodal models. Specific numbers for the evaluation results can be found in the Table 2, Table 3 and Table 4.

4.3. Qualitative Results

Figure 14, Figure 15 and Figure 16 represent a sample of the benchmarks used in the experimental evaluation. The examples presented demonstrate the system’s robust ability to accurately extract important information and generate appropriate answers in domains containing specialized knowledge. These results validate the system’s ability to thoroughly understand academic articles, generate appropriate questions to validate article comprehension, and produce accurate answers. They also confirm the system’s proficiency in accurately interpreting technical terms and complex article concepts, especially in its ability to generate QA pairs that effectively integrate textual and visual elements, such as graphs, numerical data, and images.

4.4. Human Evaluation Results

To assess the quality of the generated benchmarks, we conducted a comprehensive human evaluation study with 10 experts with at least a master’s degree in a relevant field. We sampled 5k–10k data points from each domain benchmark in medicine, engineering, and natural sciences to form a total of five independent evaluation sets, each of which was independently reviewed by two evaluators to ensure the reliability of the evaluation.

The evaluation was conducted across four main dimensions to validate the quality of the article comprehension evaluation datasets. First, accuracy assessed whether the questions and answers matched the content of the articles. Second, consistency examined whether consistent information was provided for similar inputs. Third, relevance determined if the data points were appropriate to the topic or context. Finally, diversity assessed whether the dataset reflects different aspects of the paper. Specific numerical results for Human Evaluation can be found in the Table 5.

The evaluation showed that the benchmarks generated were of high quality overall. These results show that our system is able to generate benchmarks with high accuracy, consistency, and high relevance to the content of the paper. However, the diversity score was lower than the other evaluation metrics. This suggests that our system tends to focus on certain parts or aspects of the paper. We discuss this in more detail in Section 6, “Limitations”.

4.5. Ablation Study

When creating the benchmark, we generate questions while examining each section of the paper, during which we extract the content necessary for answering the paper’s questions. We then conduct an ablation study on this content extraction method. We evaluate and compare scores for each case using full paper, parsing, and semantic parsing techniques for content extraction. Through the scores in Table 6, we can confirm that across all domains, using contents through the semantic parsing method demonstrates better paper comprehension capability.

5. Conclusions

This paper proposes a framework for assessing the level of understanding that an LLM can achieve for a particular paper. The benchmark created by our system can evaluate whether an LLM has read and fully understood a certain paper. It extends beyond asking the LLM to read the content of the paper and ask about the overall content, including the problems and solutions of the paper, and includes detailed information on each section; it also requires QA for visual materials that are necessary for understanding the paper. The benchmark generation system converts the content of the paper into contents for QA generation through the semantic parsing method we proposed, which improves the existing “lost in the middle” phenomenon and enables the generation of detailed QA and VQA. Introducing a reviewer improves QA generation quality by addressing issues such as a lack of causality and failure to reflect the paper’s content.

Our benchmark generation system is developed based on the LLaMA3.2-11B Vision model, which can be run in a single A6000 GPU environment, and is a system that can efficiently input the contents of a paper to create a QA benchmark without relying on a model with an excessive huge context length. Our framework allows for convenient evaluation of an in-depth understanding of LLM papers with minimal human intervention. However, if content extraction fails due to errors in the parsing tool or if an image-related section exceeds three pages, the VQA output quality declines. Even with a review, improvements are insufficient, indicating the need for further research. This research provides a more efficient solution by enabling users to assess their understanding of a paper and interact with it through document-based methods such as RAG, even before converting raw data into training data.

6. Limitations

Despite the effectiveness of our benchmark generation system, there are several limitations that should be addressed in future research. In the current implementation, we developed a system optimized for arXiv-formatted papers. This approach allowed us to create a standardized parsing process for our experiments. However, this is a significant limitation because papers from different conferences and journals often use different formats. These format differences can have a significant impact on the content parsing process, leading to incorrect extraction of section structures, mismatches between visual and textual descriptions, or complete parsing failure. To make the system more flexible and widely applicable across the scientific literature, future research should focus on developing more adaptive parsing methods that can handle different article formats from different publishers and societies.

In addition, the low diversity scores of the generated benchmarks, as revealed by the human evaluation, are an important limitation. This suggests that the system tends to generate questions that focus on specific parts or aspects of the paper. This lack of diversity may limit the ability to comprehensively evaluate the generated benchmarks. Future research should optimize the prompts to consider different aspects and content of the papers, and develop more intelligent content extraction mechanisms to improve the diversity of the questions and answers generated. In particular, developing diversity enhancement strategies that take into account domain-specific characteristics will help generate more comprehensive and effective assessment benchmarks.

Author Contributions

Conceptualization, M.-W.K., H.-B.P. and H.-J.A.; methodology, M.-W.K., H.-B.P. and H.-J.A.; software, M.-W.K., H.-B.P. and H.-J.A.; validation, M.-W.K., H.-B.P., H.-J.A. and J.-W.J.; formal analysis, M.-W.K., H.-B.P., H.-J.A., W.-R.P. and K.-H.L.; investigation, M.-W.K., H.-B.P. and H.-J.A.; resources, R.L., D.-G.C. and K.-H.L.; writing—original draft preparation, M.-W.K., H.-B.P., W.-R.P., H.-J.A. and J.-W.J.; writing—review and editing, M.-W.K., H.-B.P., W.-R.P. and H.-J.A.; visualization, M.-W.K., H.-B.P., H.-J.A. and and J.-W.J.; supervision, R.L. and D.-G.C.; project administration, R.L. and D.-G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Research and Development Project, “Pilot Development of a Specialized Large Language Model for Scientific Information” of the Korea Institute of Science and Technology Information (KISTI), under Grant No. J24JR041-241.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Lample, G. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Wu, Y. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Batsaikhan, B.O. Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
Hudson, G.T.; Moubayed, N.A. MULD: The Multitask Long Document Benchmark. arXiv 2022, arXiv:2202.07362. [Google Scholar]
Guo, J.; Lu, S.; Cai, H.; Zhang, W.; Yu, Y.; Wang, J. Long Text Generation via Adversarial Training with Leaked Information. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5141–5148. [Google Scholar]
Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv 2023, arXiv:2306.14824. [Google Scholar]
Zeng, A.; Du, Z.; Liu, M.; Wang, K.; Jiang, S.; Zhao, L.; Tang, J. GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot. arXiv 2024, arXiv:2412.02612. [Google Scholar]
Zhu, Y.; Zhu, M.; Liu, N.; Ou, Z.; Mou, X.; Tang, J. LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model. arXiv 2024, arXiv:2401.02330. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Simonyan, K. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Qiu, J.; Madotto, A.; Lin, Z.; Crook, P.A.; Xu, Y.E.; Dong, X.L.; Moon, S. SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM. arXiv 2024, arXiv:2403.04735. [Google Scholar]
He, R.; Xu, M.; Das, A.; Khan, D.Z.; Bano, S.; Marcus, H.J.; Islam, M. PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 1–5 October 2024; pp. 488–498. [Google Scholar]
Fei, H.; Wu, S.; Zhang, H.; Chua, T.S.; Yan, S. Vitron: A Unified Pixel-Level Vision LLM for Understanding, Generating, Segmenting, Editing. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2024; pp. 1–12. [Google Scholar]
Yang, Z.; Teng, J.; Zheng, W.; Ding, M.; Huang, S.; Xu, J.; Tang, J. CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer. arXiv 2024, arXiv:2408.06072. [Google Scholar]
Wang, Y.; Chen, X.; Ma, X.; Zhou, S.; Huang, Z.; Wang, Y.; Liu, Z. LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models. Int. J. Comput. Vis. 2024, 1–20. [Google Scholar] [CrossRef]
Fu, R.; Liu, J.; Chen, X.; Nie, Y.; Xiong, W. Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning. arXiv 2024, arXiv:2403.11401. [Google Scholar]
Zhang, A.; Zhao, L.; Xie, C.W.; Zheng, Y.; Ji, W.; Chua, T.S. NEXT-Chat: An LMM for Chat, Detection and Segmentation. arXiv 2023, arXiv:2311.04498. [Google Scholar]
Zhang, H.; Li, X.; Bing, L. Video-LLaMA: An Instruction-Tuned Audio-Visual Language Model for Video Understanding. arXiv 2023, arXiv:2306.02858. [Google Scholar]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Natarajan, V. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023, arXiv:2305.09617. [Google Scholar] [CrossRef]
Chen, Z.; Cano, A.H.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Bosselut, A. Meditron-70B: Scaling Medical Pretraining for Large Language Models. arXiv 2023, arXiv:2311.16079. [Google Scholar]
Wu, C.; Lin, W.; Zhang, X.; Zhang, Y.; Xie, W.; Wang, Y. PMC-LLaMA: Toward Building Open-Source Language Models for Medicine. J. Am. Med. Inform. Assoc. 2024, ocae045. [Google Scholar] [CrossRef] [PubMed]
Colombo, P.; Pires, T.P.; Boudiaf, M.; Culver, D.; Melo, R.; Corro, C.; Desa, M. SaulLLM-7B: A Pioneering Large Language Model for Law. arXiv 2024, arXiv:2403.03883. [Google Scholar]
Shu, D.; Zhao, H.; Liu, X.; Demeter, D.; Du, M.; Zhang, Y. LawLLM: Law Large Language Model for the US Legal System. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Washington, DC, USA, 19–23 October 2024; pp. 4882–4889. [Google Scholar]
Cui, J.; Li, Z.; Yan, Y.; Chen, B.; Yuan, L. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. arXiv 2023, arXiv:2306.16092. [Google Scholar]
Lu, C.; Lu, C.; Lange, R.T.; Foerster, J.; Clune, J.; Ha, D. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv 2024, arXiv:2408.06292. [Google Scholar]
Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous Chemical Research with Large Language Models. Nature 2023, 624, 570–578. [Google Scholar] [CrossRef] [PubMed]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kiela, D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.W.; Zhu, S.C.; Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Adv. Neural Inf. Process. Syst. 2022, 35, 2507–2521. [Google Scholar]
Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 180251. [Google Scholar] [CrossRef]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Sui, Z. A survey on in-context learning. arXiv 2022, arXiv:2301.00234. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Shao, Y.; Jiang, Y.; Kanell, T.A.; Xu, P.; Khattab, O.; Lam, M.S. Assisting in writing wikipedia-like articles from scratch with large language models. arXiv 2024, arXiv:2402.14207. [Google Scholar]
Liao, W.; Wang, J.; Li, H.; Wang, C.; Huang, J.; Jin, L. Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding. arXiv 2024, arXiv:2408.15045. [Google Scholar]
Xu, Y.; Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 1192–1200. [Google Scholar]
Hu, A.; Xu, H.; Ye, J.; Yan, M.; Zhang, L.; Zhang, B.; Zhou, J. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv 2024, arXiv:2403.12895. [Google Scholar]
Mathew, M.; Karatzas, D.; Jawahar, C.V. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2200–2209. [Google Scholar]
Tanaka, R.; Nishida, K.; Yoshida, S. Visualmrc: Machine reading comprehension on document images. Proc. AAAI Conf. Artif. Intell. 2021, 35, 13878–13888. [Google Scholar] [CrossRef]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar]
Wu, Y.; Hu, X.; Fu, Z.; Zhou, S.; Li, J. GPT-4o: Visual Perception Performance of Multimodal Large Language Models in Piglet Activity Understanding. arXiv 2024, arXiv:2406.09781. [Google Scholar]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. LLaVA-1.5: Improved Baselines with Visual Instruction Tuning. arXiv 2023, arXiv:2310.03744. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Yu, W.; Yang, Z.; Ren, L.; Li, L.; Wang, J.; Lin, K.; Wang, X. MM-VET v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities. arXiv 2024, arXiv:2408.00765. [Google Scholar]
Biernacki, L.; Demissie, M.Z.; Workneh, K.B.; Namomsa, G.B.; Gebremedhin, P.; Andargie, F.A.; Austin, T. VIP-Bench: A Benchmark Suite for Evaluating Privacy-Enhanced Computation Frameworks. In Proceedings of the International Symposium on Secure and Private Execution Environment Design, Virtual Event, 20–21 September 2021; pp. 139–149. [Google Scholar]

Figure 1. AutoPaperBench overview: Our automatic framework for generating a benchmark.

Figure 2. A comparison of answers generated according to the parsing method for the same question. Full: The method of entering the entire paper. Parsing: The method of entering the abstract and image-related sections. Semantic: The method of entering the contents extracted by the proposed semantic parsing.

Figure 3. Prompt for generating QA.

Figure 4. Generated QA sample.

Figure 5. Prompt for generating VQA’s question.

Figure 6. Prompt for generating VQA’s answer.

Figure 7. Generated VQA sample.

Figure 8. Prompt for binary choice VQA.

Figure 9. Generated binary choice sample.

Figure 10. Reviewing loop prompt for evaluating question answering pair.

Figure 11. Sample 1: Transformation of response quality after implementing the reviewing loop system. The evaluation rating changed from “bad” to “good” with the total score increasing from 5 to 29.

Figure 12. Sample 2: Dramatic quality improvement in question answering after implementing the reviewing loop system. This example showcases significant enhancement across all evaluation metrics, with the total score increasing from 5 to 29. Particularly notable improvements were observed in logical, relevance, completeness, accuracy, and clarity and evidence, all rising from minimal scores (1) to near-perfect ratings (5).

Figure 13. Sample 3: Exceptional improvement in response quality following the implementation of the reviewing loop system. The evaluation rating elevated from “normal” to “good” with the total score nearly doubling from 15 to 30.

Figure 14. Medical benchmark samples built with the proposed system.

Figure 15. Engineering benchmark samples built with the proposed system.

Figure 16. Natural benchmark samples built with the proposed system.

Table 1. Evaluation criteria for performing a reviewing loop.

Criterion	Description
Logical	Does the answer have clear argumentation without contradictions?
	Do conclusions logically follow from evidence?
	Does it include only relevant content?
	Is the explanation flow understandable?
	Score: Minimum 1, Maximum 5
Relevance	Does the answer directly address the question’s requirements?
	Does it include key information and exclude unrelated content?
	Does it meet the question’s intent?
	Score: Minimum 1, Maximum 5
Completeness	Does the answer address all aspects of the question?
	Does it include all key elements and provide thorough coverage?
	Is the level of detail sufficient?
	Score: Minimum 1, Maximum 5
Accuracy	Is the answer factually correct?
	Does it appropriately use paper content, interpret data correctly, and cite sources properly?
	Score: Minimum 1, Maximum 5
Step-by-Step	Does the answer show a clear progression of ideas?
	Does it break down complex concepts appropriately?
	Does it provide sufficient detail at each step and maintain a logical sequence?
	Score: Minimum 1, Maximum 5
Clarity and Evidence	Does the answer use a clear writing style and appropriate terminology?
	Does it provide strong supporting evidence and well-documented sources?
	Score: Minimum 1, Maximum 5

Table 2. Performance comparison for the medical domain.

Model	Rouge	BLEU	ACC
Gemini-1.5_pro	0.21	0.088	50.0
GPT-4o	0.21	0.087	47.3
LLaVA 1.5-13B	0.19	0.071	46.3
BLIP2-2.7B	0.15	0.006	30.7

Table 3. Performance comparison for the engineering domain.

Model	Rouge	BLEU	ACC
Gemini-1.5_pro	0.20	0.090	49.1
GPT-4o	0.18	0.088	43.9
LLaVA 1.5-13B	0.19	0.069	40.0
BLIP2-2.7B	0.16	0.005	27.8

Table 4. Performance comparison for the engineering domain.

Model	Rouge	BLEU	ACC
Gemini-1.5_pro	0.17	0.084	47.1
GPT-4o	0.20	0.086	49.3
LLaVA 1.5-13B	0.17	0.071	39.8
BLIP2-2.7B	0.12	0.006	29.7

Table 5. Human evaluation results for generated benchmarks across domains.

Domain	Accuracy	Consistency	Relevance	Diversity
Medical	4.0	4.5	4.5	3.75
Engineering	4.2	4.65	4.75	2.6
Natural	4.4	4.85	4.75	3.6

Table 6. Performance comparison of different content extraction methods (full paper, parsing, and semantic parsing) across various models on medical, engineering, and natural domains. Higher scores indicate better performance for Rouge, BLEU, and ACC metrics.

Domain	Score	Gemini-1.5_pro			GPT-4o			LLaVA 1.5-13B			BLIP2-2.7B
Domain	Score	Rouge	BLEU	ACC	Rouge	BLEU	ACC	Rouge	BLEU	ACC	Rouge	BLEU	ACC
Medical	Full Paper	0.19	0.079	44.1	0.18	0.082	37.6	0.17	0.06	32.6	0.09	0.004	25.6
	Parsing	0.20	0.086	47.2	0.20	0.084	45.2	0.19	0.065	35.5	0.13	0.005	27.1
	Semantic Parsing	0.21	0.088	50.0	0.21	0.087	47.3	0.19	0.071	46.3	0.15	0.006	30.7
Engineering	Full Paper	0.19	0.082	43.2	0.17	0.081	35.5	0.16	0.06	31.3	0.09	0.003	23.2
	Parsing	0.20	0.090	46.0	0.19	0.083	41.2	0.18	0.064	34.5	0.12	0.005	24.2
	Semantic Parsing	0.20	0.090	49.1	0.18	0.088	43.9	0.19	0.069	40.0	0.16	0.005	27.8
Natural	Full Paper	0.16	0.075	44.1	0.17	0.080	38.6	0.15	0.060	30.1	0.06	0.003	22.1
	Parsing	0.16	0.081	46.2	0.19	0.082	44.2	0.16	0.065	36.5	0.09	0.004	25.5
	Semantic Parsing	0.17	0.084	47.1	0.20	0.086	49.3	0.17	0.071	39.8	0.12	0.006	29.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, M.-W.; Park, H.-B.; Ahn, H.-J.; Park, W.-R.; Jeon, J.-W.; Lee, K.-H.; Lee, R.; Choi, D.-G. AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks. Electronics 2025, 14, 1175. https://doi.org/10.3390/electronics14061175

AMA Style

Kim M-W, Park H-B, Ahn H-J, Park W-R, Jeon J-W, Lee K-H, Lee R, Choi D-G. AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks. Electronics. 2025; 14(6):1175. https://doi.org/10.3390/electronics14061175

Chicago/Turabian Style

Kim, Min-Woo, Hyo-Bin Park, Hee-Jin Ahn, Woo-Ram Park, Jae-Wan Jeon, Kyong-Ha Lee, Ryong Lee, and Dong-Geol Choi. 2025. "AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks" Electronics 14, no. 6: 1175. https://doi.org/10.3390/electronics14061175

APA Style

Kim, M.-W., Park, H.-B., Ahn, H.-J., Park, W.-R., Jeon, J.-W., Lee, K.-H., Lee, R., & Choi, D.-G. (2025). AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks. Electronics, 14(6), 1175. https://doi.org/10.3390/electronics14061175

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks

Abstract

1. Introduction

2. Related Work

2.1. Visual Question Answering

2.2. Large Language Model

2.3. Multimodal Large Language Model

3. Methodology

3.1. Semantic Parsing

3.2. QA Generation

3.3. VQA Generation

3.4. Binary Choice Generation

3.5. Reviewing Loop

4. Experiments

4.1. Experiment Details

4.2. Quantitative Results

4.3. Qualitative Results

4.4. Human Evaluation Results

4.5. Ablation Study

5. Conclusions

6. Limitations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI