1. Introduction
1.1. Research Background
With the increasing complexity of modern industrial and automotive systems, technical manuals have become more voluminous and structurally intricate. For example, vehicles like Hyundai’s Staria, which incorporate advanced driver assistance systems (ADAS), electronic control units, and complex powertrains, are accompanied by multi-thousand-page manuals combining both textual and visual information [
1,
2]. Field technicians face growing challenges in locating relevant repair procedures or diagnostics from such extensive documentation.
Traditional manual navigation methods—such as table-of-contents browsing or keyword-based search—often fall short, especially when users lack familiarity with domain-specific terminology or face multifaceted technical issues [
3]. Furthermore, the separation between diagrams and textual explanations in many manuals hinders the synthesis of complete and actionable information.
These limitations are not confined to automotive domains. Similar difficulties arise across various technical fields—including aerospace, heavy machinery, and defense—where structured manuals are essential for operational safety and system maintenance. This highlights the need for intelligent and scalable question-answering systems that can interpret and contextualize multimodal manual content.
Recent advances in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques offer promising solutions to these challenges [
4,
5]. By retrieving semantically relevant information and incorporating it into response generation, RAG-based systems enable more accurate and context-aware support across diverse domains. This study explores such an approach using automotive data, with the methodology being extensible to other sectors, including military and industrial applications.
1.2. Research Aims and Paper Organization
This study aims to develop a domain-adaptive multimodal Retrieval-Augmented Generation (RAG) system that enhances the accuracy, accessibility, and context-awareness of question answering over large-scale structured technical manuals. Using Hyundai Staria maintenance manuals as a representative use case, the proposed system integrates both textual and visual data to address common challenges in navigating complex documentation. While the automotive domain serves as the primary focus in this study, the methodology and system architecture are designed to be extensible to other domains, including industrial, aerospace, and military technical documents.
The specific research objectives are as follows. First, we develop a methodology to automatically extract and refine structured data from large-scale PDF manuals that combine both text and images. Second, we construct datasets for QA, RAG, and Multi-Turn scenarios, considering both single-turn and Multi-Turn question answering. Third, we design a retrieval model that captures semantic similarity between sentences and propose an improved RAG architecture optimized for domain-specific content. Fourth, we implement parameter-efficient training using the LoRA (Low-Rank Adaptation) fine-tuning technique, applied to the bLLossom-8B language model and the BAAI-bge-m3 embedding model. Fifth, we conduct a comprehensive evaluation using quantitative metrics such as BERTScore, ROUGE-L, and cosine similarity, as well as qualitative expert assessment. Sixth, we examine the potential applicability and scalability of the system to other technical domains, including but not limited to defense maintenance manuals.
The organization of this paper is as follows.
Section 2 reviews related work on multimodal RAG systems and domain-specific question answering.
Section 3 details the methodology for constructing and preprocessing the maintenance manual dataset.
Section 4 presents the architecture of the proposed multimodal RAG system and the LoRA-based fine-tuning strategy.
Section 5 discusses the experimental results using the constructed dataset, along with both quantitative and qualitative performance evaluations.
Section 6 explores the broader applicability of the system to domains beyond automotive maintenance, including military technical documentation. Finally,
Section 7 presents the conclusions and outlines directions for future research.
3. Data Construction Methodology
3.1. Dataset Construction Pipeline
In this study, we constructed a high-quality question-answering dataset suitable for multimodal RAG system learning by performing systematic data extraction and refinement from Hyundai Staria PDF manuals. The entire data construction process consists of a systematic approach including PDF extraction, text processing, image–text mapping, and QA pair generation, as illustrated in
Figure 1.
In the first stage, we simultaneously extracted text and images from the original PDF document using the PyMuPDF library. The analyzed Staria manual consists of 522 pages total, containing approximately 1.06 MB of text data and 878 images. During extraction, we preserved the original layout information, paragraph structure, and font styles to maintain the document’s hierarchical structure as much as possible.
In the second stage, we identified document structure based on extracted text and identified hierarchical patterns in chapter–section–subsection format. Analysis results detected approximately 748 items corresponding to upper-level numbering systems (e.g., “1.”, “2.”), while lower-level structures (e.g., “1.1”, “1.1.1”) existed sporadically in limited sections. Accordingly, additional rule-based parsing and manual organization were performed for accurate hierarchization.
In the third stage, we performed image–text mapping work based on manual annotation to ensure semantic connectivity between each paragraph and images. Approximately 200 high-quality visual–text pairs were constructed, serving as the foundation for subsequent QA pair generation.
In the fourth stage, we generated various types of question–answer pairs through template-based automation. To reflect users’ actual maintenance query situations, we included not only single-sentence queries but also Multi-Turn query types.
In the final stage, we performed response quality evaluation using GPT-4 based large language models. Two researchers collaborated to evaluate each QA pair focusing on technical validity, contextual consistency, and image–text correspondence, with manual supplementation when necessary. This human–machine collaboration-based review contributed to ensuring high refinement consistency and efficiency.
3.2. Manual Data Extraction and Refinement Methods from Hyundai Staria PDF Manual
Considering the structural complexity and domain specificity of the Staria manual, we performed manual-based refinement work for both text and images. Text refinement was performed based on the following four criteria:
Maintenance relevance: Only information directly usable for actual maintenance work was selected.
Clarity: When technical descriptions were ambiguous or incomplete, meanings were clarified through researcher review.
Completeness: Procedural descriptions were organized to include all steps from start to completion.
Accuracy: To prevent technical fact errors, we referenced GPT-4 based model responses and verified through cross-review among researchers.
The refinement process included OCR error correction, technical term unification, and removal of unnecessary legal notices. For example, we unified dual notations of “engine oil change interval” and standardized inconsistent notations between “torque wrench” variations.
3.3. QA, RAG, Multi-Turn Dataset Composition Methods and Characteristics
In this study, we constructed three types of datasets for the training and evaluation of multimodal technical question-answering systems: document-based retrieval-augmented question answering (RAG QA), Multi-Turn question answering (Multi-Turn QA), and single-turn question answering (Simple QA). Each dataset was designed to evaluate response structure, document consistency, and dialogue flow maintenance capabilities, reflecting realistic technical support scenarios across automotive and other equipment domains.
First, the RAG QA dataset consists of 5065 question–answer pairs, each containing three elements: question, context, and answer. The context is a text segment semantically matched to the query using the BAAI-bge-m3 embedding model from the Staria technical manual. This dataset is used for training and evaluating Retrieval-Augmented Generation (RAG) models, enhancing contextual alignment and answer generation quality.
Second, the Multi-Turn QA dataset includes 275 conversation scenarios, with each item containing multiple consecutive question–answer pairs (qa list) associated with a single manual segment. The average number of dialogue turns is approximately 3.2, following common inquiry patterns such as symptom description → clarifying question → additional explanation → resolution. This dataset is used to assess systems’ abilities in context retention, pronoun resolution, and procedural reasoning. Initial responses were generated using GPT-4 and refined manually by researchers for accuracy and fluency.
The QA generation process involved the following steps. First, technical manuals in PDF format were segmented by section. Each section was further divided into smaller segments (seg1, seg2, …). Sentences were then classified into types such as problem solving, component description, and configuration guidance using rule-based regular expressions, with GPT-assisted classification applied in ambiguous cases. Prompts were then customized according to each sentence type to generate realistic and diverse queries and responses.
The QA types were organized into four categories: factual verification, procedural instruction, troubleshooting, and comparative analysis. All responses were designed to be purely text-based and grounded in the associated manual segment. To ensure response reliability and naturalness, we applied semantic similarity filtering, GPT summarization, and researcher-based validation.
3.4. Data Quality Management Measures
To construct high-quality question-answering datasets, this study introduced a multistage quality review system based on quantitative criteria and procedure-centered approaches. This review procedure includes three stages: manual review, GPT-based automatic evaluation, and structural verification, with all QA items included in the final dataset after going through this process.
In the first review stage, manual filtering was performed based on consistency with actual maintenance content in documents, accuracy of technical terms, and clarity of expression. Responses containing semantic inconsistency, contextual disconnection, typos, or inappropriate schematic references were excluded, and similar sentences or duplicate responses were merged or deleted.
In the second review, we used GPT-4 based models to automatically evaluate sentence quality, response logical structure, and explanation naturalness. This model calculated grades for each response based on predefined criteria (keywords included in questions, response logical structure, clarity), with responses below certain standards manually rewritten.
The third review was a data format consistency review stage, confirming that question, context, answer, or qa list structures were correctly entered according to JSONL structure. Structural problems such as line break errors within sentences, nested keys, and missing fields were detected by automatic scripts and corrected in batches.
Additionally, to ensure accuracy of image–text mapping configured for documents containing visual information, two independent reviewers conducted cross-evaluation on 50 randomly extracted pairs. Review results showed a disagreement rate of 3.7%, with final mapping accuracy measured at 96.3%. Disagreement cases were adjusted through third researcher judgment.
Through this systematic quality management procedure, the constructed dataset secured reliability and consistency sufficient for use not only in RAG learning but also as an evaluation benchmark, and applicability in actual maintenance situations was also reviewed.
3.5. Multistage Quality Control for Multimodal RAG
The proposed RAG system extends the conventional text-only retrieval approach by integrating multimodal information. During dataset construction, each text segment was premapped to its associated images, enabling the retriever to fetch both relevant text and linked visual content in response to user queries. At the generator stage, these multimodal inputs—comprising the user query, retrieved text, and corresponding images—are processed jointly to produce enriched responses. This multimodal integration allows users to follow step-by-step procedures more intuitively, as visual elements such as button layouts, warning indicators, and component diagrams complement textual instructions. Moreover, this approach demonstrates strong applicability to procedure-intensive domains such as military technical manuals, where visual aids are critical for understanding complex operational steps.
4. Model Development and Training
4.1. Baseline Model Structure and Configuration
The multimodal maintenance question-answering system proposed in this study consists of two pathways: Large Language Model Training Pipeline and RAG Training Pipeline, as shown in
Figure 2.
First, in the Large Language Model Training Pipeline, we performed domain continual pretraining based on maintenance description text extracted from Hyundai Staria PDF manuals. Masking ratios were set to 15% for general tokens and 25% for domain-specific terms to strengthen the language model’s understanding of technical terminology. Subsequently, fine-tuning was performed based on a total of 8040 question-answering data consisting of Simple QA, RAG QA, and Multi-Turn QA. At this time, LoRA-based parameter efficiency techniques were applied, training only 0.1% of total parameters.
The RAG Training Pipeline consists of retriever and generator modules, with contrastive learning using sentence similarity data (1200 pairs) to enable semantic-based document search for retriever improvement. Each sentence pair was constructed based on similarity evaluation scores (ICC = 0.87) from five automotive maintenance experts, adjusting embedding intervals between similar sentences based on the BAAI-bge-m3 model. The generator is based on bLLossom-8B and was trained through Instruction Tuning to automatically adjust response sentence styles according to query types (e.g., explanatory, diagnostic, procedural, etc.).
Finally, the constructed system operates with a multistage structure consisting of retriever-generator-postprocessing modules during query processing, as shown in
Figure 3.
In Stage 1, user queries are vectorized via the BAAI-bge-m3 model, and cosine similarity is computed against the entire maintenance document embedding database. Importantly, each retrieved text segment is already linked with its corresponding image from the preannotated dataset, allowing the retriever to fetch multimodal context.
In Stage 2, the top 5 text–image pairs with similarity scores above 0.75 are selected as context.
In Stage 3, these multimodal contexts (text and image) and the user query are fed together into the bLLossom-8B generator, which uses Instruction Tuned templates to produce “explanatory”, “procedural”, or “comparative” responses depending on the query type.
In Stage 4, the postprocessing module enhances the output by inserting additional visual descriptions or annotations where visual references are critical.
In Stage 5, the system returns the final response to the user, complete with relevant images, and caches major queries for reuse.
4.2. General RAG Model Application and Limitation Analysis
Basic RAG models combine document-based search and response generation to construct question-answering systems, offering the advantage of enabling quick access to structured information in maintenance manuals. However, when applied to military maintenance environments or similar civilian maintenance manuals, the following three major limitations were identified:
When using simple keyword-based search or non-domain-specific embeddings, the system only listed manual content at a level that failed to provide sufficient information needed for actual maintenance performance.
There was insufficient capability to provide step-by-step guidance needed in complex or urgent situations.
User-friendliness and visual supplementation functions were insufficient, leading to reduced information acceptability and work efficiency.
These limitations suggest that simple open-domain RAG structures cannot meet the information exploration needs of specialized maintenance documents.
4.3. Improved RAG Model Through Manual Construction of Hyundai Motor Sentence Similarity Learning
To overcome the structural limitations of basic RAG models, this study developed an improved RAG model centered on retriever improvement based on domain sentence semantic similarity learning and generator design that adapts to maintenance query types. The proposed system consists of a dual pipeline structure utilizing manually constructed maintenance sentence similarity data and domain-based question-answering data, aiming to simultaneously improve search performance and response relevance.
First, for maintenance domain retriever improvement, we manually constructed a total of 1200 pairs of sentence similarity learning data from Hyundai Staria manuals. Each sentence pair was evaluated by five automotive maintenance experts on semantic similarity using a five-point scale, with interrater agreement (ICC) measured at 0.87. Based on this data, contrastive learning was performed using the BAAI-bge-m3 embedding model with MultipleNegativesRankingLoss.
During training, all sentence pairs except positive sentence pairs within batches were automatically considered negative samples, allowing the model to learn to minimize distances between similar sentence embeddings and maximize distances between dissimilar sentences. This method secured learning efficiency and practicality by enabling effective embedding fine-tuning through batch-based negative sampling alone without constructing separate explicit negative samples.
For generator configuration, we used the bLLossom-8B model as a base, performed domain continual pretraining, then applied Instruction Tuning using Simple QA, RAG QA, and Multi-Turn QA datasets. The model was trained to automatically adjust response styles and structures according to query types (definitional, procedural, comparative, etc.), with LoRA-based parameter-efficient tuning (rank = 64, alpha = 128, dropout = 0.1) training only about 0.1% of total parameters.
4.4. Model Optimization Methods and Hyperparameter Tuning
In this study, hyperparameter tuning and training stabilization techniques were systematically applied to improve the performance of retriever and generator models.
In the retriever learning stage, learning rate (, , ), batch size (8, 16, 32), and temperature parameter (0.01, 0.05, 0.1) were set as variables, and optimal combinations were determined through grid search. Average search similarity and Top-k accuracy were used as evaluation criteria, with experimental results showing that the combination of learning rate , batch size 16, and temperature 0.05 had the best convergence speed and performance.
In generator learning, we quantitatively analyzed performance changes across different epoch numbers during Instruction Tuning-based fine-tuning. The LoRA-related parameters (rank, alpha, dropout) were empirically optimized through exploratory experiments, and the final configuration (rank = 64, alpha = 128, dropout = 0.1) was adopted for subsequent training. These settings align with prior recommendations for large-scale parameter-efficient fine-tuning [
20], where the alpha-to-rank ratio of 2:1 is commonly used to maintain training stability, as further supported by QLoRA guidelines [
21]. To prevent overfitting, validation loss was continuously monitored with early stopping conditions, terminating training when the loss failed to improve for three consecutive iterations. As a result, the generator model showed the best performance at three epochs, although training was performed up to five epochs for comparative analysis.
Additionally, Mixed Precision Training, effective large-scale batch learning through gradient accumulation, and memory occupancy reduction strategies through Gradient Checkpointing were applied in parallel to ensure stability during long-term training and minimize resource consumption.
5. Experimental Results and Analysis
5.1. Selection of Quantitative Evaluation Metrics and Evaluation Methods
In this study, we constructed diverse evaluation metrics including retrieval accuracy, response generation quality, and user satisfaction to comprehensively verify the performance of the technical manual-based question-answering system. Retrieval performance was measured using semantic similarity metrics such as BERTScore and cosine similarity, as well as structural consistency metrics like ROUGE-L. Additionally, the system was verified through qualitative evaluation including experts familiar with large-scale technical manuals such as those used in automotive or defense settings.
Given that procedural accuracy is particularly important in domains where operational steps must be followed precisely, we performed a manual review to qualitatively assess whether response sequences preserved correct procedural order. Through this multidimensional evaluation framework, we assessed not only technical metrics but also the system’s real-world applicability and usability.
5.2. Quantitative Evaluation Results Comparison of Trained Models
This section quantitatively compares the performance between the proposed multimodal RAG system and existing QA models (Base model). Evaluation used BERT score, cosine similarity, and ROUGE-L, with experiments conducted on the same maintenance query set.
Table 1 shows performance changes during the learning process of basic language models without RAG technology applied. The basic model showed steady performance improvement from epochs 1–3 and then converged. Final performance recorded cosine similarity 75.81%, BERT score 75.10%, and ROUGE-L 9.09%.
Table 2 shows the learning performance of the proposed system with RAG technology combined, with significant performance improvements confirmed in all evaluation metrics. The RAG-applied system achieved final performance of 78.11% in both cosine similarity and BERT score, and showed considerable improvement to 27.12% in ROUGE-L compared to the basic model.
The introduction of RAG technology shows meaningful performance improvements in all aspects of semantic similarity and lexical accuracy. Cosine similarity improved from 75.81% to 78.11% (2.30%p improvement) and BERT score improved from 75.10% to 78.11% (3.01%p improvement). Particularly in the ROUGE-L metric, an improvement range of 18.03%p was observed (9.09% → 27.12%), indicating significant improvements in domain technical term accuracy and sentence structure consistency in maintenance manual responses. Procedural accuracy also increased to 89.7%, confirming performance improvements supporting actual maintenance feasibility.
5.3. Qualitative Evaluation
To verify practical applicability in the maintenance domain, we conducted a user satisfaction survey targeting 20 actual military maintenance personnel (12 enlisted personnel, 8 officers). All participants had more than 1 year of practical experience and were selected from personnel with experience using existing PDF manual systems.
Users conducted system demonstrations and approximately 30 min of question-answering practice (including Simple QA, RAG QA, Multi-Turn QA experience), then responded to structured surveys consisting of Likert-scale satisfaction questions and optional written comments. As summarized in
Table 3, the overall average satisfaction score was 4.4, with sub-scores of 4.6 for response speed, 4.5 for information clarity, 4.3 for ease of use, and 4.2 for system reliability.
All items showed significant improvements for multimodal responses compared to text responses, with the largest improvement of 0.92 points particularly in the information clarity item. This result reflects the importance of visual information in complex maintenance procedures.
5.4. Limitations and Future Directions
While the proposed multimodal RAG system demonstrates promising performance within the automotive maintenance domain, several limitations must be acknowledged to guide future research directions.
5.4.1. Scalability and Automation Challenges
The current implementation relies on 200 manually curated image–text pairs, which presents a significant scalability bottleneck for broader domain applications. This manual annotation process, while ensuring high-quality semantic alignment (96.3% accuracy as verified through cross-reviewer evaluation), becomes prohibitively resource-intensive when scaling to comprehensive technical documentation spanning thousands of pages. Future research should investigate semi-automated or fully automated alignment techniques leveraging state-of-the-art multimodal foundation models such as BLIP-2 [
22] and ImageBind [
23]. Such approaches could potentially reduce human annotation requirements by 80–90% while maintaining acceptable alignment quality through contrastive learning and cross-modal attention mechanisms.
5.4.2. Comparative Evaluation and Benchmarking
The absence of standardized evaluation benchmarks for domain-specific multimodal RAG systems presents a fundamental challenge for comparative analysis. Unlike general-purpose question-answering tasks that benefit from established datasets such as SQuAD or Natural Questions, technical manual question answering lacks comprehensive benchmarks that capture the procedural complexity and multimodal nature of maintenance queries. Consequently, direct quantitative comparison with existing frameworks including LangChain, GPT-4 RAG, and MuRAG remains challenging. Future work will prioritize the development of standardized evaluation protocols and benchmark datasets specifically designed for technical documentation question answering, enabling systematic comparison across different architectural approaches.
5.4.3. Model Optimization and Deployment Constraints
While this study outlined potential model compression strategies including quantization, knowledge distillation, and pruning for resource-constrained deployment scenarios, empirical validation of their impact on system performance remains incomplete. Military and industrial deployment environments often impose strict computational constraints, requiring careful balance between model capability and resource efficiency. Future research should conduct comprehensive ablation studies examining the trade-offs between compression ratios and task performance, with particular attention to accuracy preservation under various optimization techniques. Additionally, investigation of edge computing deployment strategies and offline operational capabilities represents a critical research direction for practical field applications.
5.4.4. Cross-Domain Generalization
Although the proposed framework demonstrates effectiveness within the automotive maintenance domain, claims regarding extensibility to defense, aerospace, and industrial maintenance require empirical substantiation. Domain transfer involves challenges including terminology variation, procedural complexity differences, and regulatory compliance requirements that may significantly impact system performance. Future research should systematically evaluate cross-domain adaptation strategies, including domain-specific fine-tuning protocols, transfer learning methodologies, and multidomain training approaches. Particular attention should be paid to safety-critical domains where response accuracy directly impacts operational safety and mission success. While the proposed framework shows promise for extension to other technical domains such as defense, aerospace, and industrial maintenance, this claim remains speculative without empirical validation. Future work should conduct domain-specific experiments and evaluations to assess adaptation feasibility, considering domain-specific terminologies, structural variations in manuals, and regulatory compliance constraints. Until such validation is performed, the generalizability of the current system should be interpreted as a potential direction rather than a proven outcome.
5.5. Evaluation Methodology
To rigorously assess the practicality, usability, and domain suitability of the proposed multimodal RAG system, a structured three-phase evaluation was conducted with 20 active-duty military maintenance personnel. The participant pool comprised 12 enlisted soldiers (ranks E-3 to E-6) and 8 officers (ranks O-1 to O-4), each with over one year of hands-on experience using conventional PDF-based technical manuals in operational maintenance settings. The evaluation was structured into three distinct phases. In Phase 1, titled “System Orientation” (30 min), participants received standardized demonstrations covering system functionality, including Simple QA, RAG QA, and Multi-Turn QA. They were then given hands-on experience with the system using real-world queries derived from the Hyundai Staria maintenance manual. Phase 2, “Structured Survey Assessment” (45 min), required each participant to solve 10 representative maintenance scenarios using the system and rate its performance across four core dimensions using a five-point Likert scale (1 = strongly disagree, 5 = strongly agree). The evaluated dimensions included response speed (i.e., the timeliness of answer generation), information clarity (i.e., the ease of understanding and completeness of system output), system reliability (i.e., confidence in accuracy), and usability (i.e., interface intuitiveness and suitability for field deployment). Phase 3, “Semi-Structured Interviews” (15 min), involved gathering open-ended feedback from participants regarding system strengths, shortcomings, and suggestions for future enhancements. The overall satisfaction score averaged 4.4 out of 5 (SD = 0.6). Dimension-specific scores were 4.6 for response speed, 4.5 for information clarity, 4.3 for usability, and 4.2 for system reliability. Enlisted personnel reported slightly higher satisfaction (4.6) than officers (4.2), possibly due to the system’s simplicity and intuitiveness. These results indicate that the proposed system offers significant advantages over traditional PDF-based manuals in terms of information retrieval speed and user comprehension. To validate annotation consistency, interrater reliability was assessed using 50 randomly selected image–text pairs independently reviewed by two expert annotators. Semantic alignment agreement was 96.3%, with a Cohen’s of 0.89, indicating excellent reliability. Qualitative feedback obtained from interview transcripts revealed recurring themes. Forty percent of participants suggested functional enhancements such as audio output support for low-light conditions and improved glossary definitions for technical terms. Thirty percent expressed concerns about fully relying on AI-generated answers without access to traceable manual references. Twenty percent proposed improvements in image–text linkage and visual annotation accuracy. The remaining ten percent highlighted the system’s potential as a training tool for onboarding new maintenance personnel. While the evaluation demonstrated strong user satisfaction and operational promise, it was limited to a relatively small sample size and a single domain (automotive maintenance). Future work should include broader-scale evaluations across multiple equipment types and operational conditions to assess generalizability and robustness. To rigorously assess the practicality, usability, and domain suitability of the proposed multimodal RAG system, a structured three-phase evaluation was conducted with 20 active-duty military maintenance personnel. The participant pool comprised 12 enlisted soldiers (ranks E-3 to E-6) and 8 officers (ranks O-1 to O-4), each with over one year of hands-on experience using conventional PDF-based technical manuals in operational maintenance settings.
6. Discussion
6.1. Analysis of Model Strengths and Limitations Based on Experimental Results
The multimodal RAG-based maintenance question-answering system proposed in this study showed superior performance in all quantitative metrics compared to existing keyword search-based approaches. Particularly, achieving an 18.03%p improvement in ROUGE-L performance indicates significant improvements in sentence structure consistency and technical term accuracy required in the maintenance domain. Additionally, recording 89.7% procedural accuracy demonstrates achieving response quality that can contribute to actual maintenance feasibility and safety assurance.
User evaluation also recorded high satisfaction with an average of 4.4 points, and the multimodal response method combining visual information was effective in simultaneously improving information clarity, practicality, and performance reliability. Since visual information is often essential in maintenance work characteristics, the multimodal integration approach of this system can serve as an alternative to compensate for the practical limitations of existing maintenance manual systems.
On the other hand, problems were observed where the model had relatively lower response accuracy for queries involving complex conditions or multiple constraint conditions (e.g., operational situation queries with combined environmental, time, and functional conditions). Additionally, current image–text mapping is manually constructed with only 200 pairs, limiting coverage of various parts and situations, and accurate interpretation of complex electronic circuits or hydraulic system diagrams remains a challenge.
6.2. Considerations for Military Maintenance Manual Application
For application to military maintenance manuals, the following three considerations exist:
Security systems based on user authentication, access control, and audit logs are essential for protecting the confidentiality of military equipment information.
Maintenance manual data structure and metadata consistency conforming to international standards such as NATO STANAG 4677 must be secured, and compatibility design considering interoperability among multinational equipment is necessary.
Lightweight models or offline mode support capable of operation in unstable network environments are needed considering field deployment possibilities.
6.3. Future Research Directions and Improvement Proposals
In subsequent research, introduction of multimodal pretraining models for automatic image–text alignment is necessary. Since the current manual annotation-based alignment method has scalability constraints, accuracy and efficiency can be simultaneously secured through automatic alignment algorithms based on cross-modal contrastive learning. Additionally, condition-based response generator design for complex query response, predictive response modules based on time-series maintenance logs, and field input voice query response modules should also be considered.
In terms of model lightweighting, field deployment optimization can be performed through quantization, knowledge distillation, and off-device prefetching techniques, and API design for integration with defense integrated information systems should also be conducted in parallel. Modular framework design to enable rapid expansion to various equipment types and maintenance domains will also be an important future task.
6.4. Research Contributions
This study proposed a multimodal RAG question-answering system optimized for the maintenance domain and designed and evaluated the model based on the structure and utilization conditions of actual military maintenance manuals. Specifically, we demonstrated system effectiveness through: first, domain dataset construction based on Hyundai Staria manuals; second, parameter-efficient RAG structure design combining similarity learning-based retrievers and LoRA-based generators; third, procedure-centered question answering.
7. Conclusions
This study proposed a LoRA-tuned multimodal Retrieval-Augmented Generation (RAG) system tailored for structured PDF-based technical manuals, with Hyundai Staria maintenance documentation as a case study. The system was designed to integrate visual and textual inputs through a multimodal retriever-generator pipeline and achieve domain-specific question answering in complex industrial environments.
Through the construction of a custom dataset and the design of a visual-textual alignment mechanism, our approach demonstrated effective handling of hierarchical and image-rich manual content. The LoRA fine-tuning strategy enabled efficient domain adaptation without incurring significant computational overhead, making the system suitable for real-world deployment scenarios with resource constraints.
The experimental evaluation—comprising both quantitative retrieval metrics and expert assessments by maintenance personnel—confirmed the performance gains in accuracy, relevance, and user satisfaction. Specifically, expert feedback highlighted the model’s usability in practical maintenance workflows, validating the system’s application in industrial and military domains.
Furthermore, this work addressed limitations commonly found in existing RAG frameworks, including non-English content handling, image-text fusion clarity, and extensibility of manual annotation. The proposed architecture is not only effective for automotive maintenance but also shows strong potential for broader application across defense maintenance platforms, such as the IETM systems currently used in NATO and ADD environments.
Overall, this research provides a robust, scalable foundation for multimodal QA in specialized domains. Future work will focus on integrating vision-language pretraining, automating annotation processes through weak supervision, and applying the framework to multilingual and low-resource defense datasets.