Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria

Appl. Sci. 2025, 15(15), 8387; https://doi.org/10.3390/app15158387

by Yerin Nam^†, Hansun Choi^†, Jonggeun Choi and Hyukjin Kwon^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2025, 15(15), 8387; https://doi.org/10.3390/app15158387

Submission received: 1 July 2025 / Revised: 25 July 2025 / Accepted: 26 July 2025 / Published: 29 July 2025

(This article belongs to the Special Issue Innovations in Artificial Neural Network Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript presents a LoRA-tuned multimodal Retrieval-Augmented Generation (RAG) system for domain-specific question answering based on structured technical manuals, using Hyundai Staria as a case study. The proposed framework is technically sound, the system is well-structured, and the results are promising. However, there are few things can be improved before it is ready for publish：

The description of the dataset construction process in Section 3 lacks a visual pipeline or architecture diagram. Given the multi-stage nature of the process, a figure would help convey the overall structure and clarify the workflow for readers.
Figure 2 includes Non-English text, which is inappropriate for an English-language academic journal. All figures must be in English to ensure accessibility and consistency. This figure should be redrawn or modified accordingly.
The qualitative results reported from expert evaluation—specifically the average satisfaction score of 4.4 from 20 maintenance personnel—lack necessary methodological detail. The manuscript does not specify the evaluation criteria, scoring rubric, or the dimensions assessed (e.g., usability, accuracy, clarity). Without this information, the results are difficult to interpret and cannot be independently validated. The authors should clearly describe how the qualitative evaluation was conducted, what aspects were measured, and how the scores were aggregated.
The system relies on 200 manually annotated image-text pairs, which presents a clear scalability limitation. This is a critical issue when considering application to larger or more complex domains. The discussion section should address this limitation directly and explain what strategies could be used to mitigate the manual annotation burden in future work (e.g., automated alignment methods).
In addition, while the paper repeatedly claims that the system can be extended to other domains such as defense, aerospace, and industrial maintenance, no experiments or adaptation analysis are provided beyond the Hyundai Staria case. The generalization claim is currently unsubstantiated and should be either supported with additional results or clearly qualified.

Author Response

Comment 1: The description of the dataset construction process in Section 3 lacks a visual pipeline or architecture diagram. Given the multi-stage nature of the process, a figure would help convey the overall structure and clarify the workflow for readers.

Response 1: We thank the reviewer for this insightful suggestion. In response, we have added a visual pipeline diagram to Section 3 (see Figure 3) that illustrates the full multi-stage process for dataset construction, including PDF extraction, text-image mapping, QA generation, and quality control stages. This visual aid clarifies the overall workflow and aligns with the reviewer’s emphasis on structural transparency.

Comment 2: Figure 2 includes Non-English text, which is inappropriate for an English-language academic journal. All figures must be in English to ensure accessibility and consistency. This figure should be redrawn or modified accordingly.

Response 2: We appreciate the reviewer’s feedback regarding figure consistency. We have redrawn Figure 2 to ensure that all labels, annotations, and content are now presented in English. The revised figure is compliant with academic publishing standards and improves clarity for international readers.

Comment 3: The qualitative results reported from expert evaluation—specifically the average satisfaction score of 4.4 from 20 maintenance personnel—lack necessary methodological detail. The manuscript does not specify the evaluation criteria, scoring rubric, or the dimensions assessed (e.g., usability, accuracy, clarity). Without this information, the results are difficult to interpret and cannot be independently validated. The authors should clearly describe how the qualitative evaluation was conducted, what aspects were measured, and how the scores were aggregated.

Response 3: We fully agree with the reviewer’s concern. Accordingly, we have revised Section 5.5 to include a detailed description of the qualitative evaluation methodology. This includes participant demographics, the evaluation phases (orientation, survey, interviews), the four assessment dimensions (response speed, information clarity, system reliability, usability), and how the Likert scale scores were aggregated. These additions significantly enhance the methodological transparency of our evaluation.

Comment 4: The system relies on 200 manually annotated image-text pairs, which presents a clear scalability limitation. This is a critical issue when considering application to larger or more complex domains. The discussion section should address this limitation directly and explain what strategies could be used to mitigate the manual annotation burden in future work (e.g., automated alignment methods). In addition, while the paper repeatedly claims that the system can be extended to other domains such as defense, aerospace, and industrial maintenance, no experiments or adaptation analysis are provided beyond the Hyundai Staria case. The generalization claim is currently unsubstantiated and should be either supported with additional results or clearly qualified.

Response 4: We acknowledge the reviewer’s important observations. In Section 5.4, we now explicitly discuss the scalability challenges posed by manual annotation and propose automated alignment strategies using BLIP-2 and ImageBind as promising future solutions. Furthermore, we have qualified the generalization claims made in the manuscript and clarified that while the system design is domain-extensible, empirical validation across domains is a direction for future research, not a conclusion drawn from the current results.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper did not provide a horizontal comparison with other mainstream RAG systems (such as LangChain, RAG with GPT-4, MuRAG, etc.), making it difficult to evaluate the relative advantages of the proposed system. Suggest adding comparative experiments with existing systems, especially performance comparisons on the same dataset, to enhance the persuasiveness of the paper.
What are the criteria for selecting parameters such as rank, alpha, dropout, etc. in LoRA fine-tuning? Have ablation experiments been conducted to verify the impact of these parameters on performance?
How is image information integrated into the generator? Is it through image feature vector concatenation, attention mechanism, or just as a contextual description? Please provide a detailed explanation of the fusion method.
Currently, manual review is still required for image text matching. Is it considered to introduce SOTA multimodal models such as BLIP and ImageBind to achieve automatic image text matching? Is the system performance stable under fully automatic alignment conditions?
In response to the requirements for military offline deployment, the author mentioned model lightweighting solutions such as quantization and distillation. Is there any experimental verification of the impact of different model compression methods on system response speed and accuracy?

Author Response

Comment 1: The paper did not provide a horizontal comparison with other mainstream RAG systems (such as LangChain, RAG with GPT-4, MuRAG, etc.), making it difficult to evaluate the relative advantages of the proposed system. Suggest adding comparative experiments with existing systems, especially performance comparisons on the same dataset, to enhance the persuasiveness of the paper.

Response 1: Thank you for your insightful comment. We agree that comparative experiments with existing mainstream RAG systems (e.g., LangChain, GPT-4 RAG, MuRAG) would strengthen the persuasiveness of our work. Due to differences in model accessibility, architecture, and input modalities, direct performance benchmarking was not feasible within the current scope. However, to address this limitation, we have explicitly discussed the need for standardized benchmarking and future comparative evaluations in the revised manuscript. Specifically, we added the following statement to Section 5.5 (Limitations and Future Directions):

“Future work will prioritize the development of standardized evaluation protocols and benchmark datasets specifically designed for technical documentation question-answering, enabling systematic comparison across different architectural approaches including LangChain, GPT-4 RAG, and MuRAG.”

Comment 2: What are the criteria for selecting parameters such as rank, alpha, dropout, etc. in LoRA fine-tuning? Have ablation experiments been conducted to verify the impact of these parameters on performance?

Response 2: Thank you for your comment. We added clarification in Section 4.4, stating that LoRA-related parameters (rank=64, alpha=128, dropout=0.1) were empirically selected based on exploratory tuning, and aligned with prior studies (e.g., Hu et al. 2021) that recommend a 2:1 alpha-to-rank ratio for stability. While full ablation was outside the scope of this study, we highlighted this as a future direction in model optimization and performance trade-off analysis.

Comment 3: How is image information integrated into the generator? Is it through image feature vector concatenation, attention mechanism, or just as a contextual description? Please provide a detailed explanation of the fusion method.

Response 3: Thank you for your question. In Section 3.5 and 4.1, we now provide a detailed explanation of our multimodal fusion method. Our system incorporates image-text pairs retrieved at inference time and concatenates them into the generator's input using a fixed formatting template (e.g., `[Image: {description}]`). This textual context includes references to the linked image content, which has been pre-annotated and mapped. We also plan to explore more advanced fusion strategies such as cross-modal attention in future work.

Comment 4: Currently, manual review is still required for image text matching. Is it considered to introduce SOTA multimodal models such as BLIP and ImageBind to achieve automatic image text matching? Is the system performance stable under fully automatic alignment conditions?

Response 4: Thank you for raising this important point. We now address this issue explicitly in Section 5.5. The current implementation relies on 200 manually annotated image-text pairs. We have added a discussion on exploring automated alignment techniques such as BLIP-2 and ImageBind for future scalability. We note that while preliminary trials were encouraging, large-scale validation of performance stability under fully automated alignment is left as future work.

Comment 5: In response to the requirements for military offline deployment, the author mentioned model lightweighting solutions such as quantization and distillation. Is there any experimental verification of the impact of different model compression methods on system response speed and accuracy?

Response 5: Thank you for your valuable comment. We acknowledge this limitation and have added clarifying remarks in Section 5.5. Although we discussed quantization, distillation, and pruning as viable compression strategies for military deployment scenarios, detailed ablation studies or empirical validations are yet to be conducted. This is highlighted as an important direction for future empirical evaluation and deployment-oriented optimization.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have made revisions in response to the reviewers' comments, and the manuscript is now suitable for publication.

Author Response

We thank the reviewer for the positive evaluation and constructive feedback. We have carefully revised the manuscript in accordance with the suggestions and are pleased that it is now deemed suitable for publication.

Article Menu

LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria

Further Information

Guidelines

MDPI Initiatives

Follow MDPI