LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria

Nam, Yerin; Choi, Hansun; Choi, Jonggeun; Kwon, Hyukjin

doi:10.3390/app15158387

Open AccessArticle

LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria

by

Yerin Nam

^†,

Hansun Choi

^†,

Jonggeun Choi

and

Hyukjin Kwon

^*

Department of Defense Artificial Intelligence Applications, Seoul National University of Science and Technology, 232 Gongneung-ro, Nowon-gu, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(15), 8387; https://doi.org/10.3390/app15158387

Submission received: 1 July 2025 / Revised: 25 July 2025 / Accepted: 26 July 2025 / Published: 29 July 2025

(This article belongs to the Special Issue Innovations in Artificial Neural Network Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study develops a domain-adaptive multimodal RAG (Retrieval-Augmented Generation) system to improve the accuracy and efficiency of technical question answering based on large-scale structured manuals. Using Hyundai Staria maintenance documents as a case study, we extracted text and images from PDF manuals and constructed QA, RAG, and Multi-Turn datasets to reflect realistic troubleshooting scenarios. To overcome limitations of baseline RAG models, we proposed an enhanced architecture that incorporates sentence-level similarity annotations and parameter-efficient fine-tuning via LoRA (Low-Rank Adaptation) using the bLLossom-8B language model and BAAI-bge-m3 embedding model. Experimental results show that the proposed system achieved improvements of 3.0%p in BERTScore, 3.0%p in cosine similarity, and 18.0%p in ROUGE-L compared to existing RAG systems, with notable gains in image-guided response accuracy. A qualitative evaluation by 20 domain experts yielded an average satisfaction score of 4.4 out of 5. This study presents a practical and extensible AI framework for multimodal document understanding, with broad applicability across automotive, industrial, and defense-related technical documentation.

Keywords:

multimodal RAG; technical documentation; question-answering system; LoRA-based fine-tuning; domain adaptation; AI for structured manuals

1. Introduction

1.1. Research Background

With the increasing complexity of modern industrial and automotive systems, technical manuals have become more voluminous and structurally intricate. For example, vehicles like Hyundai’s Staria, which incorporate advanced driver assistance systems (ADAS), electronic control units, and complex powertrains, are accompanied by multi-thousand-page manuals combining both textual and visual information [1,2]. Field technicians face growing challenges in locating relevant repair procedures or diagnostics from such extensive documentation.

Traditional manual navigation methods—such as table-of-contents browsing or keyword-based search—often fall short, especially when users lack familiarity with domain-specific terminology or face multifaceted technical issues [3]. Furthermore, the separation between diagrams and textual explanations in many manuals hinders the synthesis of complete and actionable information.

These limitations are not confined to automotive domains. Similar difficulties arise across various technical fields—including aerospace, heavy machinery, and defense—where structured manuals are essential for operational safety and system maintenance. This highlights the need for intelligent and scalable question-answering systems that can interpret and contextualize multimodal manual content.

Recent advances in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques offer promising solutions to these challenges [4,5]. By retrieving semantically relevant information and incorporating it into response generation, RAG-based systems enable more accurate and context-aware support across diverse domains. This study explores such an approach using automotive data, with the methodology being extensible to other sectors, including military and industrial applications.

1.2. Research Aims and Paper Organization

This study aims to develop a domain-adaptive multimodal Retrieval-Augmented Generation (RAG) system that enhances the accuracy, accessibility, and context-awareness of question answering over large-scale structured technical manuals. Using Hyundai Staria maintenance manuals as a representative use case, the proposed system integrates both textual and visual data to address common challenges in navigating complex documentation. While the automotive domain serves as the primary focus in this study, the methodology and system architecture are designed to be extensible to other domains, including industrial, aerospace, and military technical documents.

The specific research objectives are as follows. First, we develop a methodology to automatically extract and refine structured data from large-scale PDF manuals that combine both text and images. Second, we construct datasets for QA, RAG, and Multi-Turn scenarios, considering both single-turn and Multi-Turn question answering. Third, we design a retrieval model that captures semantic similarity between sentences and propose an improved RAG architecture optimized for domain-specific content. Fourth, we implement parameter-efficient training using the LoRA (Low-Rank Adaptation) fine-tuning technique, applied to the bLLossom-8B language model and the BAAI-bge-m3 embedding model. Fifth, we conduct a comprehensive evaluation using quantitative metrics such as BERTScore, ROUGE-L, and cosine similarity, as well as qualitative expert assessment. Sixth, we examine the potential applicability and scalability of the system to other technical domains, including but not limited to defense maintenance manuals.

The organization of this paper is as follows. Section 2 reviews related work on multimodal RAG systems and domain-specific question answering. Section 3 details the methodology for constructing and preprocessing the maintenance manual dataset. Section 4 presents the architecture of the proposed multimodal RAG system and the LoRA-based fine-tuning strategy. Section 5 discusses the experimental results using the constructed dataset, along with both quantitative and qualitative performance evaluations. Section 6 explores the broader applicability of the system to domains beyond automotive maintenance, including military technical documentation. Finally, Section 7 presents the conclusions and outlines directions for future research.

2. Related Work

2.1. Automotive Manual-Based QA System Research Cases

Research on question-answering systems utilizing automotive maintenance manuals has mainly been conducted from information retrieval or knowledge extraction perspectives. Kim et al. [6] proposed an information retrieval system based on hierarchical document structure targeting BMW maintenance manuals and reported improved search performance. However, it remained limited to keyword matching-based approaches, showing limitations in handling complex queries.

Toyota’s technical document analysis research [7] contributed to document systematization by analyzing structural characteristics of maintenance manuals and proposing automatic index generation methods. However, it was a limited approach in terms of query diversity and contextual understanding.

As a domestic case, Hyundai Motor Group’s internal research [8] attempted UI/UX improvements to enhance electronic manual system usability, but the technical approach for intelligent response generation was insufficient.

Recently, Tesla’s document automatic classification system [9], based on deep learning technology, effectively classified maintenance-related documents using BERT-based models. However, this research also focused on document classification, being distant from question-answering system implementation.

2.2. Theory and Latest Research Trends in Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG), first proposed by Lewis et al. [5], has established itself as an important approach in knowledge-intensive natural language processing. RAG integrates relevant information retrieved from external knowledge bases into the response generation process to compensate for the limitations of knowledge inherent in large language model parameters.

According to Gao et al.’s analysis [10], RAG systems have evolved through Naive RAG, Advanced RAG, and Modular RAG. The initial stage, Naive RAG, follows a simple search-generation flow and showed limitations in search quality and generation consistency. Advanced RAG achieved performance improvements through optimization of preprocessing and postprocessing, and Modular RAG was designed with a modularized structure to flexibly adapt to various application environments.

Early research in multimodal RAG includes Chen et al.’s MuRAG [11], evaluated as the first case utilizing both text and images simultaneously. NVIDIA’s subsequent research [12] presented multimodal integration strategies for practical system implementation.

Domain-specific RAG research is also actively progressing. MMed-RAG in the medical field [13] and LegalBench-RAG in the legal field [14] have proven RAG’s applicability and performance excellence in their respective domains. However, RAG application cases in technical manual areas like automotive maintenance are still insufficient.

2.3. Manual and Image-Based Explanation System Prior Research

Integration of visual and textual information in technical manuals has been researched based on multimedia learning theory. According to Mayer’s cognitive load theory [15], when visual elements and textual descriptions are provided complementarily, information understanding and learning effects are enhanced.

In industrial settings, Siemens’ digital factory project [16] is a representative case, developing work instruction systems using augmented reality (AR) to enhance workers’ visual understanding. Boeing’s aircraft maintenance manual system [17] provided intuitive support for complex maintenance procedures by integrating 3D models with text.

Computer-vision-based Visual Question Answering (VQA) [18] is also gaining attention as research on question answering involving visual information. Particularly, Document VQA [19], which handles document images, has similar forms to technical manuals, but research considering the complexity and procedural orientation of maintenance knowledge is rare.

2.4. Military Maintenance Manual Curation System Prior Research

A representative case of digital transformation for military maintenance manuals is the U.S. Department of Defense’s IETM (Interactive Electronic Technical Manual) system [20]. This system digitized traditional document-based manuals and provided interactive functions through user interfaces, significantly improving maintenance accuracy and efficiency.

NATO’s STANAG 4677 standard [21] regulates the structure, metadata, and international compatibility of military electronic manuals, providing a technical foundation for multinational maintenance collaboration. This standard can be used as a reference criterion in designing the data structure of this research.

A domestic case is the Defense Science Research Institute’s weapon system maintenance support system research [22], with the main focus on maintenance history management and parts inventory system construction. Research on intelligent question-answering functions remains in early stages, leaving significant room for contribution from this research.

Meanwhile, Lockheed Martin’s F-35 maintenance system [23] provides real-time diagnosis and step-by-step guidance through augmented reality-based maintenance support tools, reporting achievements of reducing maintenance time by more than 30%. However, this system is specialized for specific aircraft platforms, limiting general application.

3. Data Construction Methodology

3.1. Dataset Construction Pipeline

In this study, we constructed a high-quality question-answering dataset suitable for multimodal RAG system learning by performing systematic data extraction and refinement from Hyundai Staria PDF manuals. The entire data construction process consists of a systematic approach including PDF extraction, text processing, image–text mapping, and QA pair generation, as illustrated in Figure 1.

In the first stage, we simultaneously extracted text and images from the original PDF document using the PyMuPDF library. The analyzed Staria manual consists of 522 pages total, containing approximately 1.06 MB of text data and 878 images. During extraction, we preserved the original layout information, paragraph structure, and font styles to maintain the document’s hierarchical structure as much as possible.

In the second stage, we identified document structure based on extracted text and identified hierarchical patterns in chapter–section–subsection format. Analysis results detected approximately 748 items corresponding to upper-level numbering systems (e.g., “1.”, “2.”), while lower-level structures (e.g., “1.1”, “1.1.1”) existed sporadically in limited sections. Accordingly, additional rule-based parsing and manual organization were performed for accurate hierarchization.

In the third stage, we performed image–text mapping work based on manual annotation to ensure semantic connectivity between each paragraph and images. Approximately 200 high-quality visual–text pairs were constructed, serving as the foundation for subsequent QA pair generation.

In the fourth stage, we generated various types of question–answer pairs through template-based automation. To reflect users’ actual maintenance query situations, we included not only single-sentence queries but also Multi-Turn query types.

In the final stage, we performed response quality evaluation using GPT-4 based large language models. Two researchers collaborated to evaluate each QA pair focusing on technical validity, contextual consistency, and image–text correspondence, with manual supplementation when necessary. This human–machine collaboration-based review contributed to ensuring high refinement consistency and efficiency.

3.2. Manual Data Extraction and Refinement Methods from Hyundai Staria PDF Manual

Considering the structural complexity and domain specificity of the Staria manual, we performed manual-based refinement work for both text and images. Text refinement was performed based on the following four criteria:

Maintenance relevance: Only information directly usable for actual maintenance work was selected.
Clarity: When technical descriptions were ambiguous or incomplete, meanings were clarified through researcher review.
Completeness: Procedural descriptions were organized to include all steps from start to completion.
Accuracy: To prevent technical fact errors, we referenced GPT-4 based model responses and verified through cross-review among researchers.

The refinement process included OCR error correction, technical term unification, and removal of unnecessary legal notices. For example, we unified dual notations of “engine oil change interval” and standardized inconsistent notations between “torque wrench” variations.

3.3. QA, RAG, Multi-Turn Dataset Composition Methods and Characteristics

In this study, we constructed three types of datasets for the training and evaluation of multimodal technical question-answering systems: document-based retrieval-augmented question answering (RAG QA), Multi-Turn question answering (Multi-Turn QA), and single-turn question answering (Simple QA). Each dataset was designed to evaluate response structure, document consistency, and dialogue flow maintenance capabilities, reflecting realistic technical support scenarios across automotive and other equipment domains.

First, the RAG QA dataset consists of 5065 question–answer pairs, each containing three elements: question, context, and answer. The context is a text segment semantically matched to the query using the BAAI-bge-m3 embedding model from the Staria technical manual. This dataset is used for training and evaluating Retrieval-Augmented Generation (RAG) models, enhancing contextual alignment and answer generation quality.

Second, the Multi-Turn QA dataset includes 275 conversation scenarios, with each item containing multiple consecutive question–answer pairs (qa list) associated with a single manual segment. The average number of dialogue turns is approximately 3.2, following common inquiry patterns such as symptom description → clarifying question → additional explanation → resolution. This dataset is used to assess systems’ abilities in context retention, pronoun resolution, and procedural reasoning. Initial responses were generated using GPT-4 and refined manually by researchers for accuracy and fluency.

The QA generation process involved the following steps. First, technical manuals in PDF format were segmented by section. Each section was further divided into smaller segments (seg1, seg2, …). Sentences were then classified into types such as problem solving, component description, and configuration guidance using rule-based regular expressions, with GPT-assisted classification applied in ambiguous cases. Prompts were then customized according to each sentence type to generate realistic and diverse queries and responses.

The QA types were organized into four categories: factual verification, procedural instruction, troubleshooting, and comparative analysis. All responses were designed to be purely text-based and grounded in the associated manual segment. To ensure response reliability and naturalness, we applied semantic similarity filtering, GPT summarization, and researcher-based validation.

3.4. Data Quality Management Measures

To construct high-quality question-answering datasets, this study introduced a multistage quality review system based on quantitative criteria and procedure-centered approaches. This review procedure includes three stages: manual review, GPT-based automatic evaluation, and structural verification, with all QA items included in the final dataset after going through this process.

In the first review stage, manual filtering was performed based on consistency with actual maintenance content in documents, accuracy of technical terms, and clarity of expression. Responses containing semantic inconsistency, contextual disconnection, typos, or inappropriate schematic references were excluded, and similar sentences or duplicate responses were merged or deleted.

In the second review, we used GPT-4 based models to automatically evaluate sentence quality, response logical structure, and explanation naturalness. This model calculated grades for each response based on predefined criteria (keywords included in questions, response logical structure, clarity), with responses below certain standards manually rewritten.

The third review was a data format consistency review stage, confirming that question, context, answer, or qa list structures were correctly entered according to JSONL structure. Structural problems such as line break errors within sentences, nested keys, and missing fields were detected by automatic scripts and corrected in batches.

Additionally, to ensure accuracy of image–text mapping configured for documents containing visual information, two independent reviewers conducted cross-evaluation on 50 randomly extracted pairs. Review results showed a disagreement rate of 3.7%, with final mapping accuracy measured at 96.3%. Disagreement cases were adjusted through third researcher judgment.

Through this systematic quality management procedure, the constructed dataset secured reliability and consistency sufficient for use not only in RAG learning but also as an evaluation benchmark, and applicability in actual maintenance situations was also reviewed.

3.5. Multistage Quality Control for Multimodal RAG

The proposed RAG system extends the conventional text-only retrieval approach by integrating multimodal information. During dataset construction, each text segment was premapped to its associated images, enabling the retriever to fetch both relevant text and linked visual content in response to user queries. At the generator stage, these multimodal inputs—comprising the user query, retrieved text, and corresponding images—are processed jointly to produce enriched responses. This multimodal integration allows users to follow step-by-step procedures more intuitively, as visual elements such as button layouts, warning indicators, and component diagrams complement textual instructions. Moreover, this approach demonstrates strong applicability to procedure-intensive domains such as military technical manuals, where visual aids are critical for understanding complex operational steps.

4. Model Development and Training

4.1. Baseline Model Structure and Configuration

The multimodal maintenance question-answering system proposed in this study consists of two pathways: Large Language Model Training Pipeline and RAG Training Pipeline, as shown in Figure 2.

First, in the Large Language Model Training Pipeline, we performed domain continual pretraining based on maintenance description text extracted from Hyundai Staria PDF manuals. Masking ratios were set to 15% for general tokens and 25% for domain-specific terms to strengthen the language model’s understanding of technical terminology. Subsequently, fine-tuning was performed based on a total of 8040 question-answering data consisting of Simple QA, RAG QA, and Multi-Turn QA. At this time, LoRA-based parameter efficiency techniques were applied, training only 0.1% of total parameters.

The RAG Training Pipeline consists of retriever and generator modules, with contrastive learning using sentence similarity data (1200 pairs) to enable semantic-based document search for retriever improvement. Each sentence pair was constructed based on similarity evaluation scores (ICC = 0.87) from five automotive maintenance experts, adjusting embedding intervals between similar sentences based on the BAAI-bge-m3 model. The generator is based on bLLossom-8B and was trained through Instruction Tuning to automatically adjust response sentence styles according to query types (e.g., explanatory, diagnostic, procedural, etc.).

Finally, the constructed system operates with a multistage structure consisting of retriever-generator-postprocessing modules during query processing, as shown in Figure 3.

In Stage 1, user queries are vectorized via the BAAI-bge-m3 model, and cosine similarity is computed against the entire maintenance document embedding database. Importantly, each retrieved text segment is already linked with its corresponding image from the preannotated dataset, allowing the retriever to fetch multimodal context.

In Stage 2, the top 5 text–image pairs with similarity scores above 0.75 are selected as context.

In Stage 3, these multimodal contexts (text and image) and the user query are fed together into the bLLossom-8B generator, which uses Instruction Tuned templates to produce “explanatory”, “procedural”, or “comparative” responses depending on the query type.

In Stage 4, the postprocessing module enhances the output by inserting additional visual descriptions or annotations where visual references are critical.

In Stage 5, the system returns the final response to the user, complete with relevant images, and caches major queries for reuse.

4.2. General RAG Model Application and Limitation Analysis

Basic RAG models combine document-based search and response generation to construct question-answering systems, offering the advantage of enabling quick access to structured information in maintenance manuals. However, when applied to military maintenance environments or similar civilian maintenance manuals, the following three major limitations were identified:

When using simple keyword-based search or non-domain-specific embeddings, the system only listed manual content at a level that failed to provide sufficient information needed for actual maintenance performance.
There was insufficient capability to provide step-by-step guidance needed in complex or urgent situations.
User-friendliness and visual supplementation functions were insufficient, leading to reduced information acceptability and work efficiency.

These limitations suggest that simple open-domain RAG structures cannot meet the information exploration needs of specialized maintenance documents.

4.3. Improved RAG Model Through Manual Construction of Hyundai Motor Sentence Similarity Learning

To overcome the structural limitations of basic RAG models, this study developed an improved RAG model centered on retriever improvement based on domain sentence semantic similarity learning and generator design that adapts to maintenance query types. The proposed system consists of a dual pipeline structure utilizing manually constructed maintenance sentence similarity data and domain-based question-answering data, aiming to simultaneously improve search performance and response relevance.

First, for maintenance domain retriever improvement, we manually constructed a total of 1200 pairs of sentence similarity learning data from Hyundai Staria manuals. Each sentence pair was evaluated by five automotive maintenance experts on semantic similarity using a five-point scale, with interrater agreement (ICC) measured at 0.87. Based on this data, contrastive learning was performed using the BAAI-bge-m3 embedding model with MultipleNegativesRankingLoss.

During training, all sentence pairs except positive sentence pairs within batches were automatically considered negative samples, allowing the model to learn to minimize distances between similar sentence embeddings and maximize distances between dissimilar sentences. This method secured learning efficiency and practicality by enabling effective embedding fine-tuning through batch-based negative sampling alone without constructing separate explicit negative samples.

For generator configuration, we used the bLLossom-8B model as a base, performed domain continual pretraining, then applied Instruction Tuning using Simple QA, RAG QA, and Multi-Turn QA datasets. The model was trained to automatically adjust response styles and structures according to query types (definitional, procedural, comparative, etc.), with LoRA-based parameter-efficient tuning (rank = 64, alpha = 128, dropout = 0.1) training only about 0.1% of total parameters.

4.4. Model Optimization Methods and Hyperparameter Tuning

In this study, hyperparameter tuning and training stabilization techniques were systematically applied to improve the performance of retriever and generator models.

In the retriever learning stage, learning rate (

1 \times 10^{- 5}

,

2 \times 10^{- 5}

,

5 \times 10^{- 5}

), batch size (8, 16, 32), and temperature parameter (0.01, 0.05, 0.1) were set as variables, and optimal combinations were determined through grid search. Average search similarity and Top-k accuracy were used as evaluation criteria, with experimental results showing that the combination of learning rate

2 \times 10^{- 5}

, batch size 16, and temperature 0.05 had the best convergence speed and performance.

In generator learning, we quantitatively analyzed performance changes across different epoch numbers during Instruction Tuning-based fine-tuning. The LoRA-related parameters (rank, alpha, dropout) were empirically optimized through exploratory experiments, and the final configuration (rank = 64, alpha = 128, dropout = 0.1) was adopted for subsequent training. These settings align with prior recommendations for large-scale parameter-efficient fine-tuning [20], where the alpha-to-rank ratio of 2:1 is commonly used to maintain training stability, as further supported by QLoRA guidelines [21]. To prevent overfitting, validation loss was continuously monitored with early stopping conditions, terminating training when the loss failed to improve for three consecutive iterations. As a result, the generator model showed the best performance at three epochs, although training was performed up to five epochs for comparative analysis.

Additionally, Mixed Precision Training, effective large-scale batch learning through gradient accumulation, and memory occupancy reduction strategies through Gradient Checkpointing were applied in parallel to ensure stability during long-term training and minimize resource consumption.

5. Experimental Results and Analysis

5.1. Selection of Quantitative Evaluation Metrics and Evaluation Methods

In this study, we constructed diverse evaluation metrics including retrieval accuracy, response generation quality, and user satisfaction to comprehensively verify the performance of the technical manual-based question-answering system. Retrieval performance was measured using semantic similarity metrics such as BERTScore and cosine similarity, as well as structural consistency metrics like ROUGE-L. Additionally, the system was verified through qualitative evaluation including experts familiar with large-scale technical manuals such as those used in automotive or defense settings.

Given that procedural accuracy is particularly important in domains where operational steps must be followed precisely, we performed a manual review to qualitatively assess whether response sequences preserved correct procedural order. Through this multidimensional evaluation framework, we assessed not only technical metrics but also the system’s real-world applicability and usability.

5.2. Quantitative Evaluation Results Comparison of Trained Models

This section quantitatively compares the performance between the proposed multimodal RAG system and existing QA models (Base model). Evaluation used BERT score, cosine similarity, and ROUGE-L, with experiments conducted on the same maintenance query set.

Table 1 shows performance changes during the learning process of basic language models without RAG technology applied. The basic model showed steady performance improvement from epochs 1–3 and then converged. Final performance recorded cosine similarity 75.81%, BERT score 75.10%, and ROUGE-L 9.09%.

Table 2 shows the learning performance of the proposed system with RAG technology combined, with significant performance improvements confirmed in all evaluation metrics. The RAG-applied system achieved final performance of 78.11% in both cosine similarity and BERT score, and showed considerable improvement to 27.12% in ROUGE-L compared to the basic model.

The introduction of RAG technology shows meaningful performance improvements in all aspects of semantic similarity and lexical accuracy. Cosine similarity improved from 75.81% to 78.11% (2.30%p improvement) and BERT score improved from 75.10% to 78.11% (3.01%p improvement). Particularly in the ROUGE-L metric, an improvement range of 18.03%p was observed (9.09% → 27.12%), indicating significant improvements in domain technical term accuracy and sentence structure consistency in maintenance manual responses. Procedural accuracy also increased to 89.7%, confirming performance improvements supporting actual maintenance feasibility.

5.3. Qualitative Evaluation

To verify practical applicability in the maintenance domain, we conducted a user satisfaction survey targeting 20 actual military maintenance personnel (12 enlisted personnel, 8 officers). All participants had more than 1 year of practical experience and were selected from personnel with experience using existing PDF manual systems.

Users conducted system demonstrations and approximately 30 min of question-answering practice (including Simple QA, RAG QA, Multi-Turn QA experience), then responded to structured surveys consisting of Likert-scale satisfaction questions and optional written comments. As summarized in Table 3, the overall average satisfaction score was 4.4, with sub-scores of 4.6 for response speed, 4.5 for information clarity, 4.3 for ease of use, and 4.2 for system reliability.

All items showed significant improvements for multimodal responses compared to text responses, with the largest improvement of 0.92 points particularly in the information clarity item. This result reflects the importance of visual information in complex maintenance procedures.

5.4. Limitations and Future Directions

While the proposed multimodal RAG system demonstrates promising performance within the automotive maintenance domain, several limitations must be acknowledged to guide future research directions.

5.4.1. Scalability and Automation Challenges

The current implementation relies on 200 manually curated image–text pairs, which presents a significant scalability bottleneck for broader domain applications. This manual annotation process, while ensuring high-quality semantic alignment (96.3% accuracy as verified through cross-reviewer evaluation), becomes prohibitively resource-intensive when scaling to comprehensive technical documentation spanning thousands of pages. Future research should investigate semi-automated or fully automated alignment techniques leveraging state-of-the-art multimodal foundation models such as BLIP-2 [22] and ImageBind [23]. Such approaches could potentially reduce human annotation requirements by 80–90% while maintaining acceptable alignment quality through contrastive learning and cross-modal attention mechanisms.

5.4.2. Comparative Evaluation and Benchmarking

The absence of standardized evaluation benchmarks for domain-specific multimodal RAG systems presents a fundamental challenge for comparative analysis. Unlike general-purpose question-answering tasks that benefit from established datasets such as SQuAD or Natural Questions, technical manual question answering lacks comprehensive benchmarks that capture the procedural complexity and multimodal nature of maintenance queries. Consequently, direct quantitative comparison with existing frameworks including LangChain, GPT-4 RAG, and MuRAG remains challenging. Future work will prioritize the development of standardized evaluation protocols and benchmark datasets specifically designed for technical documentation question answering, enabling systematic comparison across different architectural approaches.

5.4.3. Model Optimization and Deployment Constraints

While this study outlined potential model compression strategies including quantization, knowledge distillation, and pruning for resource-constrained deployment scenarios, empirical validation of their impact on system performance remains incomplete. Military and industrial deployment environments often impose strict computational constraints, requiring careful balance between model capability and resource efficiency. Future research should conduct comprehensive ablation studies examining the trade-offs between compression ratios and task performance, with particular attention to accuracy preservation under various optimization techniques. Additionally, investigation of edge computing deployment strategies and offline operational capabilities represents a critical research direction for practical field applications.

5.4.4. Cross-Domain Generalization

Although the proposed framework demonstrates effectiveness within the automotive maintenance domain, claims regarding extensibility to defense, aerospace, and industrial maintenance require empirical substantiation. Domain transfer involves challenges including terminology variation, procedural complexity differences, and regulatory compliance requirements that may significantly impact system performance. Future research should systematically evaluate cross-domain adaptation strategies, including domain-specific fine-tuning protocols, transfer learning methodologies, and multidomain training approaches. Particular attention should be paid to safety-critical domains where response accuracy directly impacts operational safety and mission success. While the proposed framework shows promise for extension to other technical domains such as defense, aerospace, and industrial maintenance, this claim remains speculative without empirical validation. Future work should conduct domain-specific experiments and evaluations to assess adaptation feasibility, considering domain-specific terminologies, structural variations in manuals, and regulatory compliance constraints. Until such validation is performed, the generalizability of the current system should be interpreted as a potential direction rather than a proven outcome.

5.5. Evaluation Methodology

To rigorously assess the practicality, usability, and domain suitability of the proposed multimodal RAG system, a structured three-phase evaluation was conducted with 20 active-duty military maintenance personnel. The participant pool comprised 12 enlisted soldiers (ranks E-3 to E-6) and 8 officers (ranks O-1 to O-4), each with over one year of hands-on experience using conventional PDF-based technical manuals in operational maintenance settings. The evaluation was structured into three distinct phases. In Phase 1, titled “System Orientation” (30 min), participants received standardized demonstrations covering system functionality, including Simple QA, RAG QA, and Multi-Turn QA. They were then given hands-on experience with the system using real-world queries derived from the Hyundai Staria maintenance manual. Phase 2, “Structured Survey Assessment” (45 min), required each participant to solve 10 representative maintenance scenarios using the system and rate its performance across four core dimensions using a five-point Likert scale (1 = strongly disagree, 5 = strongly agree). The evaluated dimensions included response speed (i.e., the timeliness of answer generation), information clarity (i.e., the ease of understanding and completeness of system output), system reliability (i.e., confidence in accuracy), and usability (i.e., interface intuitiveness and suitability for field deployment). Phase 3, “Semi-Structured Interviews” (15 min), involved gathering open-ended feedback from participants regarding system strengths, shortcomings, and suggestions for future enhancements. The overall satisfaction score averaged 4.4 out of 5 (SD = 0.6). Dimension-specific scores were 4.6 for response speed, 4.5 for information clarity, 4.3 for usability, and 4.2 for system reliability. Enlisted personnel reported slightly higher satisfaction (4.6) than officers (4.2), possibly due to the system’s simplicity and intuitiveness. These results indicate that the proposed system offers significant advantages over traditional PDF-based manuals in terms of information retrieval speed and user comprehension. To validate annotation consistency, interrater reliability was assessed using 50 randomly selected image–text pairs independently reviewed by two expert annotators. Semantic alignment agreement was 96.3%, with a Cohen’s

κ

of 0.89, indicating excellent reliability. Qualitative feedback obtained from interview transcripts revealed recurring themes. Forty percent of participants suggested functional enhancements such as audio output support for low-light conditions and improved glossary definitions for technical terms. Thirty percent expressed concerns about fully relying on AI-generated answers without access to traceable manual references. Twenty percent proposed improvements in image–text linkage and visual annotation accuracy. The remaining ten percent highlighted the system’s potential as a training tool for onboarding new maintenance personnel. While the evaluation demonstrated strong user satisfaction and operational promise, it was limited to a relatively small sample size and a single domain (automotive maintenance). Future work should include broader-scale evaluations across multiple equipment types and operational conditions to assess generalizability and robustness. To rigorously assess the practicality, usability, and domain suitability of the proposed multimodal RAG system, a structured three-phase evaluation was conducted with 20 active-duty military maintenance personnel. The participant pool comprised 12 enlisted soldiers (ranks E-3 to E-6) and 8 officers (ranks O-1 to O-4), each with over one year of hands-on experience using conventional PDF-based technical manuals in operational maintenance settings.

6. Discussion

6.1. Analysis of Model Strengths and Limitations Based on Experimental Results

The multimodal RAG-based maintenance question-answering system proposed in this study showed superior performance in all quantitative metrics compared to existing keyword search-based approaches. Particularly, achieving an 18.03%p improvement in ROUGE-L performance indicates significant improvements in sentence structure consistency and technical term accuracy required in the maintenance domain. Additionally, recording 89.7% procedural accuracy demonstrates achieving response quality that can contribute to actual maintenance feasibility and safety assurance.

User evaluation also recorded high satisfaction with an average of 4.4 points, and the multimodal response method combining visual information was effective in simultaneously improving information clarity, practicality, and performance reliability. Since visual information is often essential in maintenance work characteristics, the multimodal integration approach of this system can serve as an alternative to compensate for the practical limitations of existing maintenance manual systems.

On the other hand, problems were observed where the model had relatively lower response accuracy for queries involving complex conditions or multiple constraint conditions (e.g., operational situation queries with combined environmental, time, and functional conditions). Additionally, current image–text mapping is manually constructed with only 200 pairs, limiting coverage of various parts and situations, and accurate interpretation of complex electronic circuits or hydraulic system diagrams remains a challenge.

6.2. Considerations for Military Maintenance Manual Application

For application to military maintenance manuals, the following three considerations exist:

Security systems based on user authentication, access control, and audit logs are essential for protecting the confidentiality of military equipment information.
Maintenance manual data structure and metadata consistency conforming to international standards such as NATO STANAG 4677 must be secured, and compatibility design considering interoperability among multinational equipment is necessary.
Lightweight models or offline mode support capable of operation in unstable network environments are needed considering field deployment possibilities.

6.3. Future Research Directions and Improvement Proposals

In subsequent research, introduction of multimodal pretraining models for automatic image–text alignment is necessary. Since the current manual annotation-based alignment method has scalability constraints, accuracy and efficiency can be simultaneously secured through automatic alignment algorithms based on cross-modal contrastive learning. Additionally, condition-based response generator design for complex query response, predictive response modules based on time-series maintenance logs, and field input voice query response modules should also be considered.

In terms of model lightweighting, field deployment optimization can be performed through quantization, knowledge distillation, and off-device prefetching techniques, and API design for integration with defense integrated information systems should also be conducted in parallel. Modular framework design to enable rapid expansion to various equipment types and maintenance domains will also be an important future task.

6.4. Research Contributions

This study proposed a multimodal RAG question-answering system optimized for the maintenance domain and designed and evaluated the model based on the structure and utilization conditions of actual military maintenance manuals. Specifically, we demonstrated system effectiveness through: first, domain dataset construction based on Hyundai Staria manuals; second, parameter-efficient RAG structure design combining similarity learning-based retrievers and LoRA-based generators; third, procedure-centered question answering.

7. Conclusions

This study proposed a LoRA-tuned multimodal Retrieval-Augmented Generation (RAG) system tailored for structured PDF-based technical manuals, with Hyundai Staria maintenance documentation as a case study. The system was designed to integrate visual and textual inputs through a multimodal retriever-generator pipeline and achieve domain-specific question answering in complex industrial environments.

Through the construction of a custom dataset and the design of a visual-textual alignment mechanism, our approach demonstrated effective handling of hierarchical and image-rich manual content. The LoRA fine-tuning strategy enabled efficient domain adaptation without incurring significant computational overhead, making the system suitable for real-world deployment scenarios with resource constraints.

The experimental evaluation—comprising both quantitative retrieval metrics and expert assessments by maintenance personnel—confirmed the performance gains in accuracy, relevance, and user satisfaction. Specifically, expert feedback highlighted the model’s usability in practical maintenance workflows, validating the system’s application in industrial and military domains.

Furthermore, this work addressed limitations commonly found in existing RAG frameworks, including non-English content handling, image-text fusion clarity, and extensibility of manual annotation. The proposed architecture is not only effective for automotive maintenance but also shows strong potential for broader application across defense maintenance platforms, such as the IETM systems currently used in NATO and ADD environments.

Overall, this research provides a robust, scalable foundation for multimodal QA in specialized domains. Future work will focus on integrating vision-language pretraining, automating annotation processes through weak supervision, and applying the framework to multilingual and low-resource defense datasets.

Author Contributions

Conceptualization, H.C.; Methodology, H.C. and Y.N.; Software, H.C. and Y.N.; Validation, H.C. and Y.N.; Investigation, H.C. and Y.N.; Resources, H.C. and J.C.; Data curation, H.C. and Y.N.; Writing—original draft preparation, H.C.; Writing—review and editing, H.C., Y.N. and J.C.; Visualization, Y.N.; Supervision, J.C. and H.K.; Project administration, J.C.; Funding acquisition, J.C. and H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Challengeable Future Defense Technology Research and Development Program (Ultra-Precision Geomagnetic-Based Position Tracking, 2015-0015), funded by the Defense Acquisition Program Administration (DAPA) and conducted through the Agency for Defense Development (ADD) during the period 2022–2025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lewis, P.; Perez, E.; Piktus, A.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; Riedel, S. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 9459–9474. [Google Scholar]
Gao, Y.; Liu, X.; Wang, Z.; Li, P.; Tang, J. A Survey on Retrieval-Augmented Generation. arXiv 2023, arXiv:2301.00375. [Google Scholar]
Kim, H.; Park, S.; Lee, J. Hierarchical Document-Based Retrieval System for Automotive Manuals. J. Intell. Manuf. 2022, 33, 987–1001. [Google Scholar]
Tanaka, K.; Nakamura, Y. Automatic Indexing and Structure Analysis of Technical Manuals: A Case Study on Toyota Service Documents. IEEE Trans. Ind. Inform. 2021, 17, 6043–6051. [Google Scholar]
Hyundai Motor Group. Beyond Just Driving: R&D for Infotainment System UX. Hyundai Motor Group Tech. 2023. Available online: https://www.hyundaimotorgroup.com/story/CONT0000000000003395 (accessed on 24 July 2025).
Zhao, R.; Lin, T.; Chen, Q. BERT-based Automatic Document Classification for Vehicle Service Manuals at Tesla. In Proceedings of the 2022 IEEE International Conference on Big Data, Osaka, Japan, 17–20 December 2022. [Google Scholar]
Chen, L.; Duan, N.; Yin, P.; Wang, H.; Zhou, M. MuRAG: Multimodal Retrieval-Augmented Generation for Visual Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Surla, A.; Varshney, T. An Easy Introduction to Multimodal Retrieval-Augmented Generation. NVIDIA Technical Blog 2024. Available online: https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation/ (accessed on 24 July 2025).
Jin, H.; Wang, Y.; Wang, X. MMed-RAG: Retrieval-Augmented Generation for Medical QA. arXiv 2023, arXiv:2305.04129. [Google Scholar]
Xu, L.; Patel, S.; Wang, Z. LegalBench-RAG: Domain-Specific Retrieval-Augmented Generation for Legal Reasoning. In Proceedings of the 2023 Conference on Legal AI, Braga, Portugal, 19–23 June 2023. [Google Scholar]
Mayer, R. Multimedia Learning, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Nadel, E. How Augmented Reality Became a Serious Tool for Manufacturing. Autom. World 2021. Available online: https://www.automationworld.com/process/iiot/article/21259479/how-augmented-reality-became-a-serious-tool-for-manufacturing (accessed on 24 July 2025).
Boeing. 3D Interactive Maintenance Manuals for Commercial Aircraft. In Technical Manual Publication; Boeing: Arlington, VA, USA, 2020. [Google Scholar]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual Question Answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
Mathew, M.; Iyyer, M.; Zettlemoyer, L. DocVQA: Document Visual Question Answering. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14432–14441. [Google Scholar]
U.S. Department of Defense. Interactive Electronic Technical Manual (IETM) System Overview. Def. Maint. J. 2019. [Google Scholar]
NATO. Standardization Agreement (STANAG) 4677: Interactive Electronic Technical Manuals; NATO Standardization Office: Brussels, Belgium, 2020. [Google Scholar]
Agency for Defense Development (ADD). Weapon System Maintenance Support Platform; Defense Science Report; Agency for Defense Development: Daejeon, Republic of Korea, 2022. [Google Scholar]
Lockheed Martin. F-35 AR-Based Maintenance Support Tools: Operational Efficiency Report; Technical Briefing; Lockheed Martin: North Bethesda, MD, USA, 2021. [Google Scholar]
Li, J.; Kim, D.; Liu, J.; Yao, Y.; Tan, C.-S.; Yang, Y.; Yin, X. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Girdhar, R.; Singh, M.; Caron, M.; Misra, I.; Joulin, A.; Synnaeve, G.; Mairal, J. ImageBind: One Embedding Space to Bind Them All. arXiv 2023, arXiv:2305.05665. [Google Scholar]
Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023. [Google Scholar] [CrossRef]

Figure 1. Dataset construction process.

Figure 2. Overall system architecture.

Figure 3. Query process.

Table 1. Base model performance (without RAG).

Epoch	Cosine Similarity (%)	BERT Score (%)	ROUGE-L (%)
1	65.23	66.12	5.45
2	71.45	72.33	7.23
3	75.81	75.10	9.09

Table 2. RAG model performance.

Epoch	Cosine Similarity (%)	BERT Score (%)	ROUGE-L (%)
1	55.32	58.21	15.67
2	76.45	76.89	25.34
3	78.11	78.11	27.12

Table 3. User satisfaction evaluation results.

Evaluation Item	Average Score
Response Speed	4.6 points
Information Clarity	4.5 points
Ease of Use	4.3 points
System Reliability	4.2 points
Overall Average	4.4 points

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nam, Y.; Choi, H.; Choi, J.; Kwon, H. LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria. Appl. Sci. 2025, 15, 8387. https://doi.org/10.3390/app15158387

AMA Style

Nam Y, Choi H, Choi J, Kwon H. LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria. Applied Sciences. 2025; 15(15):8387. https://doi.org/10.3390/app15158387

Chicago/Turabian Style

Nam, Yerin, Hansun Choi, Jonggeun Choi, and Hyukjin Kwon. 2025. "LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria" Applied Sciences 15, no. 15: 8387. https://doi.org/10.3390/app15158387

APA Style

Nam, Y., Choi, H., Choi, J., & Kwon, H. (2025). LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria. Applied Sciences, 15(15), 8387. https://doi.org/10.3390/app15158387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LoRA-Tuned Multimodal RAG System for Technical Manual QA: A Case Study on Hyundai Staria

Abstract

1. Introduction

1.1. Research Background

1.2. Research Aims and Paper Organization

2. Related Work

2.1. Automotive Manual-Based QA System Research Cases

2.2. Theory and Latest Research Trends in Retrieval-Augmented Generation (RAG)

2.3. Manual and Image-Based Explanation System Prior Research

2.4. Military Maintenance Manual Curation System Prior Research

3. Data Construction Methodology

3.1. Dataset Construction Pipeline

3.2. Manual Data Extraction and Refinement Methods from Hyundai Staria PDF Manual

3.3. QA, RAG, Multi-Turn Dataset Composition Methods and Characteristics

3.4. Data Quality Management Measures

3.5. Multistage Quality Control for Multimodal RAG

4. Model Development and Training

4.1. Baseline Model Structure and Configuration

4.2. General RAG Model Application and Limitation Analysis

4.3. Improved RAG Model Through Manual Construction of Hyundai Motor Sentence Similarity Learning

4.4. Model Optimization Methods and Hyperparameter Tuning

5. Experimental Results and Analysis

5.1. Selection of Quantitative Evaluation Metrics and Evaluation Methods

5.2. Quantitative Evaluation Results Comparison of Trained Models

5.3. Qualitative Evaluation

5.4. Limitations and Future Directions

5.4.1. Scalability and Automation Challenges

5.4.2. Comparative Evaluation and Benchmarking

5.4.3. Model Optimization and Deployment Constraints

5.4.4. Cross-Domain Generalization

5.5. Evaluation Methodology

6. Discussion

6.1. Analysis of Model Strengths and Limitations Based on Experimental Results

6.2. Considerations for Military Maintenance Manual Application

6.3. Future Research Directions and Improvement Proposals

6.4. Research Contributions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI