Domain-Adaptive Multimodal Large Language Models for Photovoltaic Fault Diagnosis via Dynamic LoRA Routing

Wu, Junjian; Chen, Yiwei; Min, Qihao; Chen, Ming; Zhao, Jie; Ye, Mang

doi:10.3390/pr14040653

Open AccessArticle

Domain-Adaptive Multimodal Large Language Models for Photovoltaic Fault Diagnosis via Dynamic LoRA Routing

by

Junjian Wu

¹,

Yiwei Chen

²,

Qihao Min

³,

Ming Chen

^4,*,

Jie Zhao

⁴

and

Mang Ye

³

¹

State Grid Wenzhou Electric Power Supply Company, Wenzhou 317101, China

²

State Grid Ruian Electric Power Supply Company, Wenzhou 317101, China

³

School of Computer Science, Wuhan University, Wuhan 430072, China

⁴

School of Electrical and Automation, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(4), 653; https://doi.org/10.3390/pr14040653

Submission received: 9 January 2026 / Revised: 30 January 2026 / Accepted: 3 February 2026 / Published: 13 February 2026

(This article belongs to the Special Issue Advancements in Photovoltaic Technologies: Innovations for Enhanced Energy Conversion and Efficiency)

Download

Browse Figures

Versions Notes

Abstract

The reliability of photovoltaic (PV) equipment is vital for ensuring the safe and stable operation of power systems. While multimodal large language models (MLLMs) open up promising avenues for intelligent fault diagnosis, they often falter when confronted with the heterogeneity of PV data—where visual observations come from different sensor modalities (e.g., visible, infrared, and thermal) and display strong domain-dependent variations. Conventional Low-Rank Adaptation (LoRA) is not expressive enough to model such modality-aware differences, which can result in insufficient exploitation of informative patterns. To overcome this limitation, we propose PV-FaultExpert, a domain-adaptive MLLM designed specifically for PV equipment fault analysis. PV-FaultExpert is built upon DyLoRA (Dynamic Expert Routing with LoRA), a dynamic routing strategy that reformulates standard LoRA into a shared low-rank component coupled with multiple expert-specific adapters. A routing module then selects expert paths according to input characteristics, allowing the model to adapt to diverse modalities while maintaining parameter efficiency. Moreover, we construct a PVfault diagnosis dataset via ChatGPT-4o-assisted chain-of-thought reasoning and subsequent expert verification, which both supports model training and enables rigorous evaluation of our method. Extensive experiments demonstrate that PV-FaultExpert consistently surpasses strong baselines, including GPT-4 and Claude-3, across multiple evaluation criteria, producing fault analysis reports that are accurate, interpretable, and aligned with safety-critical requirements.

Keywords:

photovoltaic equipment fault diagnosis; multimodal large model; low-rank adaptation

1. Introduction

Photovoltaic (PV) equipment plays a vital role in ensuring the safe and efficient operation of modern power systems [1]. Faults in PV systems can lead not only to energy yield loss and increased operational cost, but also to significant safety hazards such as fire risks or grid instability [2]. Traditionally, PV fault diagnosis has relied on periodic manual inspections and expert-driven analysis based on accumulated field experience. However, these approaches are inherently limited in scalability, timeliness, and consistency, especially as PV systems become increasingly large-scale, decentralized, and diverse in deployment [3].

Unlike conventional substations, PV systems operate under highly dynamic environmental conditions—fluctuating irradiance, temperature changes, dust accumulation, and partial shading. These factors lead to non-stationary operating behaviors and variable fault manifestations [4].

Compounding this, PV equipment (e.g., modules, inverters, combiner boxes, transformers) is monitored through heterogeneous data sources such as RGB images, synthesized or real thermal imagery, sensor logs, and inspection records—often collected via different sensors with varying resolutions and modalities [5]. This multimodal, cross-device, and cross-condition variability introduces substantial challenges in learning robust fault representations and generating consistent diagnostics [6].

Recent advances in Multimodal Large Language Models [7,8,9,10] (MLLMs) offer a promising new paradigm for intelligent fault analysis. These models are capable of jointly processing visual and textual inputs, allowing them to interpret equipment states holistically and reason across data modalities. When applied to PV diagnosis, MLLMs have the potential to unify image understanding, thermal interpretation, and expert-level reporting into a single, interpretable framework. Compared to rule-based systems, MLLMs can generalize across scenarios, uncover subtle fault patterns, and generate human-readable reports with actionable insights.

However, directly adapting MLLMs to real-world PV fault analysis presents two major challenges. First, the data collected in PV systems is heterogeneous and multimodal, encompassing images from diverse sensor types (e.g., visible light, infrared, thermal simulation) and exhibiting domain-specific variability across component types, vendors, and environmental conditions. While parameter-efficient methods are crucial for adapting large models, existing approaches like Low-Rank Adaptation [11] (LoRA) apply a static, “one-size-fits-all” adaptation logic to all inputs. Specifically, LoRA injects a single pair of low-rank matrices into each target layer, forcing the model to learn a unified adaptation for all data types. This makes it difficult to capture the modality-aware distinctions crucial for accurate diagnosis when facing the diverse data in PV systems. Second, naive instruction tuning on limited PV data can cause catastrophic forgetting [12] and reduce the model’s general reasoning capabilities, especially under noisy supervision.

To overcome these limitations, we propose PV-FaultExpert, a domain-adaptive multimodal fault analysis model designed specifically for PV equipment. At the core of our method is DyLoRA, a dynamic expert routing mechanism that improves upon standard LoRA by introducing a shared adapter matrix and multiple expert-specific branches [13]. Based on the input modality and content, DyLoRA selectively activates expert paths via a lightweight routing module, enabling fine-grained adaptation to diverse input types without increasing parameter overhead. Moreover, to further support our PV-FaultExpert, we construct a PV diagnosis dataset featuring over 35,000 samples with heterogeneous visual inputs and structured diagnostic reports. We employ Chain-of-Thought [14] (CoT) enhanced prompting with ChatGPT-4o [15] to generate pseudo-reports and further refine them through expert validation, ensuring data reliability and domain fidelity.

The contributions of this paper are summarized as follows:

We develop a multimodal large model for PV fault analysis by designing a novel fine-tuning method and constructing a dedicated dataset. To the best of our knowledge, this is the first multimodal large model specifically developed for PV fault analysis.
We propose DyLoRA, a dynamic expert routing mechanism that enables input-aware adaptation. It advances beyond the limitations of standard LoRA by integrating a shared low-rank matrix with multiple, expert-specific branches, significantly improving model flexibility and robustness in complex multimodal PV scenarios.
We construct a PV fault dataset over 35,000 samples with CoT-based pseudo-annotations and multimodal inputs.

Extensive experiments show that PV-FaultExpert outperforms strong baselines, including GPT-4 [7], Claude-3 [8], and several open-source MLLMs, in generating accurate, interpretable, and safety-aware diagnostic reports under real-world PV conditions.

The remainder of this paper is organized as follows: Section 2 reviews the relevant literature and highlights the limitations this work seeks to overcome. Section 3 introduces the proposed method and the constructed dataset in detail. Section 4 presents the experimental results. Finally, Section 5 concludes this paper by summarizing the contributions of the proposed approach.

2. Related Work

2.1. Multimodal Large Language Models

MLLMs have emerged as a powerful paradigm for visual-linguistic reasoning, extending the capabilities of unimodal pre-training through cross-modal alignment. Early models such as BLIP [16] laid the foundation by integrating visual encoders with language understanding. Subsequent frameworks like MiniGPT-4 [17] and LLaVA [9] advance this architecture by connecting pre-trained vision encoders [18,19] to large language models [20,21] via lightweight projection modules [16] or MLPs. This design enables visual features to be mapped into the latent space of LLMs, supporting unified multimodal reasoning and generation. Additionally, negative SFT [22] extracts negative supervision from multimodal RLHF and achieves comparable alignment using a simple SFT-style objective. Arcana [23] boosts MLLM vision ability via LoRA with disentangled vision and language adapters and query ladder to aggregate intermediate visual features from a frozen encoder. InternVL [24] scales a vision foundation model to 6B parameters and progressively aligns it with an LLM using web-scale image–text data for broad visual-linguistic tasks.

However, PV equipment fault diagnosis presents unique challenges for MLLMs. Existing datasets primarily target generic visual or multimodal tasks, with little coverage of industrial fault scenarios. High-quality annotated data for PV-specific anomalies—such as inverter failures, cell degradation, or thermal inconsistencies—remains scarce and fragmented. This data gap restricts MLLMs from learning domain-relevant visual-linguistic patterns, thereby hindering their performance in practical diagnostic settings. To bridge this gap, we propose a novel dataset tailored for PV fault diagnosis. By integrating CoT reasoning from ChatGPT-4o [15] with expert validation, we curate a high-quality, multimodal dataset aligned with real-world PV scenarios. This resource enables more effective fine-tuning of MLLMs for facilitating robust, interpretable, and domain-aware fault analysis.

2.2. Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial strategy for adapting large-scale pre-trained models to downstream tasks [12,25,26]. While full-model fine-tuning remains effective in certain scenarios, it is often computationally expensive, memory-intensive, and prone to instability [27,28,29]. Moreover, it yields task-specific models that replicate the size of the original backbone, limiting scalability in resource-constrained environments.

To overcome these limitations, PEFT methods aim to adapt models by updating only a small subset of parameters while freezing the majority of pre-trained weights. These methods can be broadly categorized into four paradigms, i.e., additive, selective, reparameterized, and hybrid. Additive PEFT introduces lightweight task-specific modules into the model architecture. For example, adapter-based methods insert bottleneck layers within Transformer blocks. Serial Adapters [30] are placed after attention and feedforward layers, while Parallel Adapters [31] act as side branches. Another line of work, soft prompting [32,33], prepends learnable continuous vectors to the input sequence. Prefix-tuning [32], in particular, uses MLPs to generate task-specific prefixes with stable optimization. These approaches have been extended to multimodal settings, such as visual prompt tuning [34], which adapts vision transformers through visual prompts. Selective PEFT focuses on tuning only the most critical parameters of the model. Unstructured pruning applies binary masks to individual weights based on criteria like magnitude or Fisher information [35], while structured pruning [36] targets groups of parameters, such as entire layers or blocks, to align with hardware acceleration. Reparameterized PEFT approximates full weight updates using low-rank structures. LoRA (Low-Rank Adaptation) [11] freezes the original weights and introduces trainable low-rank matrices to simulate parameter shifts. Adaptive extensions such as AdaLoRA [37] dynamically adjust the rank allocation via singular value decomposition. In the multimodal domain, VLCLNet [26] leverages LoRA to fine-tune BLIP-2 while maintaining parameter efficiency. Hybrid PEFT combines multiple adaptation strategies for greater flexibility. UniPELT [38] integrates LoRA, prefix-tuning, and adapters through a unified gating mechanism. LLaVA-MoLE [39] introduces a sparse mixture of LoRA experts, enabling dynamic token routing to domain-specific experts and reducing domain conflict. LoRASculpt [12] further enhances LoRA by enforcing sparsity and introducing conflict mitigation regularization, effectively balancing generalization and domain specialization in multimodal settings. S-LoRA [40] is proposed as a serving system that enables thousands of concurrent LoRA adapters by paging adapters between CPU and GPU with unified memory management and optimized heterogeneous batching.

Despite these advances, there has been limited exploration of PEFT methods tailored to the heterogeneous multimodal data found in PV systems. In this work, we propose a novel fine-tuning framework that combines Mixture-of-Experts [13] (MoE) with a dynamic routing mechanism to better handle modality-aware variations in PV diagnostics.

2.3. Power System with MLLM

The application of MLLMs in power systems has undergone a clear evolution from perception-focused tasks to knowledge-enhanced reasoning. Early research efforts [41] primarily leveraged MLLMs for visual inspection in power grid infrastructure, focusing on tasks such as object localization and defect classification. For instance, MLLMs were fine-tuned on domain-specific datasets—e.g., insulator defect datasets [42]—to improve recognition performance in grid asset monitoring scenarios. Most recently, attention has shifted toward MLLM-based model generation, where foundational models are used to synthesize executable physics-based or structural representations of power system behavior. For example, the study [43] demonstrates the feasibility of using MLLMs to automate the generation of cyber-physical models for fault simulation and diagnostics. In addition, IVMMF [44] proposes an end-to-end industrial monitoring-to-maintenance framework that pairs a vision–language model for defect understanding with a domain knowledge-grounded LLM for maintenance dialogue and recommendations. GridMind [45] proposes an agentic framework that couples LLM planning with power-system tools/data to automate operations-and-planning workflows in an interpretable, modular manner. X-GridAgent [46] contributes a three-layer hierarchical agent architecture plus schema-adaptive hybrid RAG and prompt refinement to generalize power-grid analysis across diverse natural-language queries.

Despite many recent studies exploring the application of MLLMs in various industrial scenarios—ranging from manufacturing diagnostics to power grid inspection—most models are general-purpose and lack specialization for the unique characteristics of PV systems. That is, there is still no dedicated multimodal large language model specifically designed for PV equipment fault analysis.

3. Method

3.1. Overview

PEFT methods have become a cornerstone for adapting large pre-trained models, with Low-Rank Adaptation (LoRA) [11] being one of the most prominent techniques. The core idea of LoRA [11] is to inject trainable, low-rank matrices into the model’s architecture, allowing for efficient adaptation without updating the entire set of original weights. Conventional LoRA [11] performs low-rank adaptation by introducing two trainable matrices, A and B, whose product

A B

serves as a learnable offset to the frozen pre-trained weight

W_{0}

, as described as,

\begin{matrix} y^{'} = y + Δ y = W_{0} x + B A x, \end{matrix}

(1)

where

y \in R^{d}

represents the output vector, and

x \in R^{k}

denotes the input. The matrices

A \in R^{r \times k}

and

B \in R^{d \times r}

define the low-rank adaptation, where the rank r is significantly smaller than both d and k. To ensure that the model behaves identically to the pre-trained baseline at initialization, B is typically initialized with all zeros, while A is initialized using Kaiming Uniform initialization [47], resulting in an initial offset

Δ y = 0

. However, the original LoRA design fine-tunes a unified set of parameters across all inputs, which becomes suboptimal in heterogeneous multimodal settings such as ours, where visual inputs originate from diverse sensor types and exhibit domain-specific variability.

To address this limitation, we propose a dynamic expert routing mechanism that decomposes the standard LoRA structure into a central shared matrix A and multiple specialized matrices

B_{i}

, each dynamically selected based on the input characteristics (as illustrated in Figure 1). This asymmetric structure allows the model to maintain generalizable knowledge through the shared component while enabling adaptive specialization through expert-specific branches, thereby enhancing its ability to learn from heterogeneous multimodal data. We denote this architecture as follows:

\begin{matrix} W = W_{0} + Δ W = W_{0} + \sum_{i = 1}^{N} ω_{i} \cdot B_{i} A, \end{matrix}

(2)

where

W_{0}

represents the frozen pre-trained weight, and

Δ W

is the low-rank offset composed through a weighted aggregation of multiple expert branches. Each

B_{i} \in R^{d \times r}

corresponds to an expert-specific projection, while

A \in R^{r \times k}

is a shared transformation matrix across all experts. The scalar

ω_{i}

serves as a learned gating coefficient that dynamically adjusts the contribution of each expert

B_{i}

based on the input context. The number of expert branches is defined by the hyperparameter N, which controls the model’s capacity for specialization under heterogeneous conditions.

3.2. Dynamic Expert Routing with LoRA

To address the complexity of heterogeneous multimodal data—such as PV equipment images collected from different monitoring devices, imaging modalities (e.g., visible and infrared), sensor brands, and environmental conditions—we first determine the number of expert branches N, either manually based on domain-specific knowledge or through clustering algorithms like k-means applied to input features.

As shown in Figure 2, our DyLoRA is built upon a Mixture-of-Experts structure with dynamic routing, where each expert is implemented as a lightweight LoRA adapter

B_{i}

, and all experts share a common low-rank matrix A. During training, the base model remains frozen, and only the expert modules and the routing mechanism are updated, enabling efficient parameter adaptation. The key innovation lies in the input-dependent dynamic routing mechanism, which allows the model to selectively activate and combine expert branches at inference time. Instead of statically assigning input to a fixed adapter, a gating network predicts soft routing weights conditioned on each input’s modality and characteristics. This enables DyLoRA to adaptively specialize its internal representations for different types of multimodal data while still benefiting from shared global knowledge captured by the common matrix A. The proposed DyLoRA can be expressed as follows:

\begin{matrix} y = W_{0} x + \sum_{i = 1}^{N} ω_{i} E_{i} A x, \end{matrix}

(3)

\begin{matrix} ω_{i} = sigmoid (W_{g}^{⊤} x), \end{matrix}

(4)

where the output y is computed by combining a frozen pre-trained projection

W_{0} x

with dynamically weighted expert contributions. Each expert branch

E_{i} \in R^{d \times r}

operates on a shared matrix

A \in R^{r \times k}

, while

ω_{i}

represents the input-dependent routing weight for expert i, allowing selective specialization. In addition, the routing weights

ω = (ω_{1}, \dots, ω_{N})

are obtained by passing the intermediate token representation x through a trainable transformation

W_{g} \in R^{r \times N}

, followed by a sigmoid layer. This mechanism enables the model to adaptively determine which expert(s) to activate based on the input characteristics.

3.3. Dataset Curation

To support robust multimodal fault diagnosis for PV equipment, we curate a high-quality dataset that combines real-world annotated data and self-constructed pseudo-labeled samples, as illustrated in Figure 3b. Our dataset is primarily built upon PVEL-AD [49], a large-scale benchmark comprising over 30,000 near-infrared images of PV cells with corresponding defect annotations. While PVEL-AD provides valuable bounding box and class-level information, such annotations are not sufficient for downstream report generation tasks that require structured, human-readable diagnostic content. To address this, we leverage ChatGPT-4o [15] with CoT prompting to automatically transform raw detection outputs into rich diagnostic reports. Each generated report includes four key components, i.e., fault type, cause analysis, maintenance recommendation, and a concise summary, enabling better alignment with practical inspection workflows.

Moreover, to further enhance the dataset’s modality diversity and component coverage, we additionally collect several hundred images of PV equipment (e.g., PV transformers and PV panels) and their corresponding thermal infrared views from publicly available online resources. These images capture broader equipment types and more complex visual patterns under varying real-world conditions (e.g., illumination, temperature). For these samples, we again employ ChatGPT-4o, a closed-source multimodal model, to generate pseudo-reports based on RGB and thermal content. As shown in Figure 3a, each report is generated using carefully crafted CoT-style prompts that guide the model to perform appearance analysis, thermal interpretation, and risk assessment. To ensure annotation quality, we adopt a two-stage report construction and validation workflow. First, pseudo-reports are automatically generated using CoT-enhanced prompting with ChatGPT-4o. Second, three domain experts with Master’s degrees in electrical engineering verify the generated reports, with primary focus on the self-curated subset. When a report is judged as incorrect by two or more experts, it is manually corrected and then re-entered into the expert checking process until it passes the review. This step guarantees domain fidelity and factual reliability across the dataset.

As a result, the final dataset includes structured, multimodal samples across diverse PV equipment, with each entry composed of RGB/thermal imagery and a corresponding diagnostic report. This curated dataset serves as the foundation for fine-tuning our PV-FaultExpert model, providing rich supervision signals for learning domain-specific reasoning patterns.

3.4. Loss Function

3.4.1. Pre-Training for Feature Alignment

Given the characteristics of our dataset—which includes pseudo-labeled diagnostic reports and heterogeneous multimodal imagery captured from various sensor types and modalities—establishing robust image–text correspondence is particularly crucial. The inherent noise in pseudo-labels, combined with the distributional shifts across different data sources (e.g., visible and thermal imaging from distinct devices), makes direct instruction tuning suboptimal. To address these challenges, we introduce a contrastive pre-training stage to explicitly align visual and textual representations. This step ensures that the model learns to associate semantically relevant visual cues with their corresponding diagnostic descriptions, even when labels are weak or data is heterogeneous. We follow the vision–language alignment framework of LLaVA [9], which optimizes the following contrastive objective:

\begin{matrix} L_{align} = - \frac{1}{M} \sum_{i = 1}^{M} log \frac{exp (sim (v_{i}, t_{i}) / τ)}{\sum_{j = 1}^{M} exp (sim (v_{i}, t_{j}) / τ)}, \end{matrix}

(5)

where

v_{i}

and

t_{i}

represent the image and text embeddings of the i-th pair,

sim (\cdot, \cdot)

denotes cosine similarity, and

τ

is a learnable temperature parameter. This alignment mechanism provides a strong initialization for downstream expert-aware instruction tuning.

3.4.2. Fine-Tuning on Fault Analysis

We perform diagnostic report generation based on multimodal inputs from PV equipment, where visual and textual modalities must be jointly modeled to enable accurate fault description. Given an input image I, we first encode it using a pre-trained vision encoder (i.e., CLIP ViT-L/14 [18]), obtaining visual features

V = {Encoder}_{vis} (I)

. These features are then projected into the language embedding space through a learnable projection layer,

\begin{matrix} H_{0} = Proj ({Encoder}_{vis} (I)), \end{matrix}

(6)

\begin{matrix} {\hat{y}}_{i j} = {Decoder}_{LLM} ({\hat{y}}_{< t}, H_{0}, T), \end{matrix}

(7)

where

H_{0}

is the projected visual representation and

T = {t_{1}, t_{2}, \dots, t_{n}}

is the tokenized instruction prompt. The decoder [20] generates the output sequence token-by-token, conditioned on both the visual input and the prompt.

To train the model, we use a supervised cross-entropy objective that minimizes the difference between the predicted and ground-truth tokens,

\begin{matrix} L_{gen} = - \frac{1}{l} \sum_{i = 1}^{l} \sum_{j = 1}^{V_{d}} y_{i j} log ({\hat{y}}_{i j}), \end{matrix}

(8)

where l is the length of the report,

V_{d}

is the vocabulary size,

y_{i j}

is the ground-truth one-hot vector, and

{\hat{y}}_{i j}

is the predicted probability for the j-th token at position i.

Overall, our model is trained by jointly minimizing the objective consisting of a report generation loss and a multimodal alignment loss that promotes cross-modal semantic consistency,

\begin{matrix} L_{total} = L_{gen} + L_{align} . \end{matrix}

(9)

4. Experiments

4.1. MLLMs for Fault Analysis

MLLMs sometimes lack sufficient electrical domain expertise and contextual relevance in fault diagnosis. For example, in analyzing faults within PV equipment—such as a damaged transformer breather—the model may generate a generic response like “Breather damage detected, please repair promptly,” without providing root cause analysis or specific repair guidance. By integrating a domain-specific knowledge base tailored for PV equipment, these models can produce more professional, accurate, and actionable fault analysis reports. Specifically, the analysis report includes the fault type, fault cause, and maintenance recommendations. An example report is as follows: (i) Fault Type: Damage to the breather of a PV panel transformer. (ii) Fault Cause: Excessive internal pressure leading to aging or failure of the breather’s sealing material. (iii) Maintenance Recommendations: Inspect the breather for cracks or physical damage, and replace it if necessary. Check the internal pressure of the transformer to ensure it remains within the normal operating range.

MLLMs that have been fine-tuned with domain-specific knowledge are capable of providing more precise fault descriptions, identifying specific causes, and offering targeted maintenance recommendations, thereby greatly enhancing the professionalism and applicability of the analysis. Moreover, our proposed efficient fine-tuning method leverages dynamic routing to effectively integrate heterogeneous multimodal data, further enhancing the LLM’s capability for fault representation learning.

4.2. Datasets

The dataset used in this study is primarily derived from PVEL-AD [49], a large-scale dataset containing over 30,000 near-infrared images of PV cells, along with corresponding defect annotations. ChatGPT-4o [15], with the help of CoT, converts raw detection annotations into structured diagnostic reports that combine fault types, cause analysis, and maintenance recommendations to support more interpretable and actionable downstream applications. Furthermore, to enrich the dataset, we collected several hundred images of PV equipment (e.g., PV transformers and PV panels) and their corresponding thermal images from publicly available online sources. We then employed ChatGPT-4o, the latest and most advanced closed-source multimodal model, to generate pseudo-labels in the form of diagnostic reports based on the visual content. To ensure the accuracy and reliability of these generated reports, three electrical experts were invited to conduct a secondary review, during which any unreasonable or inaccurate content was carefully identified and corrected. All three experts followed the same inspection and evaluation criteria defined in the Chinese national standard Code of Operation for Photovoltaic Power Station (GB/T 38335-2019) [50]. The self-curated dataset in this paper is available upon request by contacting the authors.

4.3. Evaluation Metrics

The evaluation measures model performance on the test set by computing similarity scores between the generated reports and our annotated reference reports. Traditional metrics such as BLEU [51] and ROUGE [52] are primarily designed for tasks like machine translation and text summarization. However, due to the unique characteristics and domain-specific requirements of substation fault analysis, these general-purpose metrics fall short in effectively evaluating the quality of generated diagnostic reports. To structure our assessment, we followed the comprehensive evaluation framework proposed in [53,54]. Drawing insights from discussions with electrical power experts and referencing evaluation criteria from related domains such as healthcare [55,56], education [57], and software engineering [58], we then identified four essential qualities that define a high-quality substation fault analysis report:

Accuracy evaluates whether the generated report correctly identifies the fault type and cause. Let $T_{R}$ be the set of key diagnostic elements (e.g., fault types, causes) from the reference report, and $T_{\hat{R}}$ is from the generated report. Accuracy is defined as follows:

$Accuracy = \frac{| T_{R} \cap T_{\hat{R}} |}{| T_{R} |} .$

(10)
Clarity evaluates the readability and structural coherence of the generated text. Following recent studies [59,60,61], we use ChatGPT-4o to score the syntactic fluency of the predicted reports.
Completeness measures whether the generated report includes all necessary information that appears in the ground truth. Let $C_{R}$ be the set of expected components and analysis items from the reference report, and $C_{\hat{R}}$ is from the generated report. The computation follows the same set-overlap formulation as Accuracy,

$Completeness = \frac{| C_{R} \cap C_{\hat{R}} |}{| C_{R} |} .$

(11)
Practicality assesses whether the report provides clear and actionable maintenance instructions. Let $A_{R}$ be the set of expected repair actions in the reference report, and $A_{\hat{R}}$ is those extracted from the generated report. Similar to Accuracy, it is computed based on the overlap between expected and generated repair instructions:

$Practicality = \frac{| A_{R} \cap A_{\hat{R}} |}{| A_{R} |} .$

(12)
Average Score is a metric for the report’s overall quality, defined as the arithmetic mean of the four criteria above:

$Average Score = \frac{Acc . + Cla . + Com . + Pra .}{4} .$

(13)

These four dimensions form the foundation of our evaluation framework for substation fault reports. To support consistent and objective assessment, we developed a five-point scoring system for each criterion, ranging from 1 (poor) to 5 (excellent), ensuring that report quality can be quantitatively and independently evaluated. To ensure uniformity across all metrics, while Clarity is rated directly on this 1-to-5 scale, our other objective metrics are first calculated as a value x in the range

[0, 1]

. These are then mapped to the final scale using a logarithmic transformation, which better reflects the greater significance of score improvements at the higher end of the spectrum. The scaling function is defined as:

\begin{matrix} S c o r e_{s c a l e d} = 1 + (α - 1) \cdot [\frac{log (x + 1)}{log (2)}], \end{matrix}

(14)

where

α

is the maximum score of the scale (

α = 5

).

4.4. Implementation Details

We adopted LLaVA-1.5-7B as the base architecture for developing our fault analysis model. To adapt it to our constructed dataset, we applied our proposed PEFT method. The training was conducted using four NVIDIA A100 GPUs (Nvidia, Santa Clara, CA, USA), with a learning rate of

1 \times 10^{- 4}

, a batch size of 10, and for 20 epochs. The fine-tuning configuration included a rank of 64, an alpha value of 16, and a dropout rate of

0.05

, enabling efficient adaptation of the model to the substation domain without incurring significant computational cost.

4.5. Baselines

To evaluate the effectiveness of our proposed model, we conducted a comparative study with six representative baseline models, including GPT-4 [7], Claude-3 [62,63,64], LLaVA-1.5-7B [9,65], VisualGLM-6B [66,67], Qwen2-VL-7B [10,68], and MiniGPT-4 [17,69]. We randomly selected 1000 samples from the dataset as the testing set, covering various fault types observed in PV equipment. Each baseline model was used to generate corresponding fault analysis reports based on these samples. To ensure an objective evaluation, three engineers with expertise in PV systems and electrical engineering were invited to review and rate the generated reports according to a predefined set of assessment criteria.

4.6. Reasoning Analysis

As shown in Figure 4, our PV-FaultExpert effectively interprets heterogeneous multimodal inputs and generates structured diagnostic reports that align closely with human expert reasoning. The model accurately identifies key visual attributes from RGB images, incorporates auxiliary cues from thermal data, and integrates domain knowledge to provide fault descriptions, risk assessments, and safety-aware conclusions. These results demonstrate the model’s capability not only to capture fine-grained visual features but also to produce coherent and actionable outputs, highlighting its practical utility in real-world PV equipment inspection.

4.7. Comparative Results

Model performance on PV fault analysis requires not only an accurate understanding of fault types, but also the ability to generate clear, complete, and practical diagnostic reports. To ensure a fair comparison, all baseline models were evaluated on the same dataset using a standardized evaluation protocol involving four key dimensions, i.e., Accuracy, Clarity, Completeness, and Practicality. As shown in Table 1, our approach first establishes its superiority over traditional deep learning methods, significantly outperforming both classic CNN-based and transformer-based architectures. Our proposed model PV-FaultExpert, trained with domain-specific knowledge and a PEFT strategy, achieves the best overall performance across all metrics. Specifically, PV-FaultExpert outperforms GPT-4 and Claude-3 by an average margin of

0.98

and

0.92

, respectively, demonstrating significant advantages over even the most advanced closed-source models. Additionally, when comparing PV-FaultExpert with LLaVA1.5-7B (fine-tuned), which was also adapted to our dataset, we observe consistent improvements of

0.77

in Accuracy,

0.71

in Clarity,

0.84

in Completeness, and

0.89

in Practicality. These results highlight the effectiveness of our fine-tuning method and the importance of incorporating domain-specific knowledge into multimodal large models. Furthermore, to enable a fairer evaluation of report generation performance in more realistic settings, we additionally constructed a hard and realistic test subset. Specifically, we invited five experts to manually annotate 50 PV multimodal images using free writing styles. This subset was fully annotated by humans without AI assistance. We then re-evaluated all methods on this harder test set. As shown in Table 1, performance decreases for all methods on this subset. The drop is more pronounced for the Clarity metric, which can be attributed to the increased variability introduced by diverse and unconstrained writing styles.

4.8. Ablation Study

As shown in Table 2, the full version of our model, PV-FaultExpert, outperforms all ablated variants across all evaluation dimensions. The fusion integrates specialized insights from multiple experts by aggregating their outputs via a router-determined weighted sum, forming a coherent diagnostic signal that is essential for analyzing complex, multimodal faults. Specifically, removing the fusion operation after routing (Row a) leads to a noticeable drop in average score from

4.41

to

3.93

. This result indicates that our MoE framework is essential for cultivating distinct “experts” capable of mastering the unique features of each data modality. The Router enables conditional computation by analyzing an input to weight and activate only the most relevant experts for a given task. Further removing the dynamic routing mechanism (Row b) results in a decrease of

0.16

in the overall average. This proves that the router plays a critical role in intelligently analyzing each input and assigning the task to the most relevant expert(s), ensuring the right expertise is applied to the right problem. MoE enables specialization by using an ensemble of experts (low-rank adapters) to overcome the representational bottleneck of processing heterogeneous data. When the mixture-of-experts (MoE) module is disabled (Row c), performance drops even further to

3.53

. This confirms the importance of this step in integrating the specialized insights from the selected experts into a single, coherent diagnosis. These results demonstrate that each component of PV-FaultExpert contributes positively to model performance, and that the full architecture is essential for generating high-quality PV fault analysis reports.

In addition, to investigate the impact of expert quantity in our proposed DyLoRA framework, we conduct some ablation studies with varying numbers of expert branches

N \in {1, 2, 3, 4, 5}

. As shown in Figure 5, the model performance improves steadily from

N = 1

to

N = 3

, reaching the highest average score of

4.41

at

N = 3

. This suggests that incorporating multiple specialized expert branches helps the model better adapt to the heterogeneity in PV multimodal data. However, increasing the number of experts beyond three does not lead to further gains—in fact, a slight performance drop is observed. This may be attributed to over-fragmentation of the data across too many experts, which could reduce the effectiveness of individual branches. Therefore, selecting an appropriate number of expert adapters is essential to balance model capacity and generalization. Also, we use a routing confusion matrix to verify whether DyLoRA experts are specialized for different modalities. Specifically, we re-ran inference on the above hard test set and recorded the routing scores after the sigmoid activation. Based on these scores, we constructed a routing confusion matrix; see Table 3. The results show that, after fine-tuning, the experts in our proposed DyLoRA indeed focus on different modalities, which provides empirical evidence of expert specialization. This further indicates that DyLoRA is beneficial for handling heterogeneous multimodal data.

4.9. Computational Efficiency

As our experimental results show (see Table 4), while our multimodal large model approach is more computationally demanding compared to traditional deep learning methods (e.g., transformers or CNN with LSTM), the performance it achieves is substantially superior. Leveraging large models to push the boundaries of performance in various downstream tasks has become a common and well-established trend in recent research [55,72]. On the one hand, in real-world applications, it is standard practice to deploy large models on cloud-based infrastructure, which effectively meets their computational requirements. On the other hand, a significant and growing body of work is investigating the distillation of knowledge from large models into smaller, domain-specific ones [73,74]. Recent studies such as SmolVLM [75], Nano-R1, and TinyLLaVA [76] have demonstrated that large multimodal model knowledge can be distilled into compact multimodal models for edge deployment. After quantization, these models can even run on resource-constrained devices with CPU-only configurations and around 4 GB of memory. Exploring such techniques could further enhance the deployment possibilities of powerful models like ours. In summary, the rich knowledge inherent in large models is invaluable for specialized, vertical domains such as PV fault analysis. The ongoing research into making these models more efficient and accessible is a highly promising direction.

5. Conclusions

In this work, we propose PV-FaultExpert, a domain-adaptive multimodal model tailored for PV equipment fault analysis. Built upon a curated dataset of multimodal information related to PV equipment, our method incorporates a PEFT strategy to effectively adapt large models to the PV system. At the core of our approach is a LoRA-based fine-tuning method enhanced with a mixture-of-experts mechanism, which enables the model to selectively attend to different components during training. This adaptive focus improves both the precision of fine-tuning and the efficiency of inference, making the model more capable of handling domain-specific challenges in PV diagnostics. Experiments demonstrate that PV-FaultExpert consistently outperforms several recent popular AI models across multiple evaluation metrics. These results confirm the effectiveness of our design in generating more accurate and actionable fault analysis reports for PV equipment.

6. Limitations

Our approach is based on pseudo-labeled reports generated with LLM assistance and subsequently verified by experts, which may still introduce residual bias from the synthetic annotation pipeline. Although we further evaluate on a small human-annotated subset, larger fully human-labeled benchmark datasets would provide a stronger basis for validating real-world diagnostic reliability. In addition, the current dataset is collected under specific hardware settings and operational conditions. PV fault appearances and thermal patterns may vary across module vendors, inverter types, installation configurations, and geographic regions with different climate and irradiance characteristics. As a result, the generalization of the proposed model may be limited when deployed in unseen domains. Future work will expand cross-vendor and cross-region data coverage and explore domain adaptation strategies.

Author Contributions

Conceptualization, J.W. and Y.C.; methodology, J.W. and Q.M.; software, Q.M.; validation, Q.M. and J.W.; formal analysis, J.Z.; investigation, M.Y.; resources, J.Z. and M.C.; data curation, Q.M. and M.C.; writing—original draft preparation, J.W. and Q.M.; writing—review and editing, J.W. and Q.M.; visualization, Y.C.; supervision, J.Z. and M.C.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of State Grid Corporation of China, grant number 5400-20241918A-1-1-ZN.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Junjian Wu was employed by State Grid Wenzhou Electric Power Supply Company. Author Yiwei Chen was employed by State Grid Ruian Electric Power Supply Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from State Grid Corporation of China. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Osmani, K.; Haddad, A.; Lemenand, T.; Castanier, B.; Ramadan, M. A review on maintenance strategies for PV systems. Sci. Total Environ. 2020, 746, 141753. [Google Scholar]
Aram, M.; Zhang, X.; Qi, D.; Ko, Y. A state-of-the-art review of fire safety of photovoltaic systems in buildings. J. Clean. Prod. 2021, 308, 127239. [Google Scholar] [CrossRef]
Shafiullah, M.; Ahmed, S.D.; Al-Sulaiman, F.A. Grid integration challenges and solution strategies for solar PV systems: A review. IEEE Access 2022, 10, 52233–52257. [Google Scholar] [CrossRef]
Kheirrouz, M.; Melino, F.; Ancona, M.A. Fault detection and diagnosis methods for green hydrogen production: A review. Int. J. Hydrogen Energy 2022, 47, 27747–27774. [Google Scholar] [CrossRef]
Chang, Z.; Han, T. Prognostics and health management of photovoltaic systems based on deep learning: A state-of-the-art review and future perspectives. Renew. Sustain. Energy Rev. 2024, 205, 114861. [Google Scholar] [CrossRef]
Li, Y.; Cao, Y.; Cui, X.; Zhang, Y.; Mukhopadhyay, S.C.; Li, Y.; Cui, L.; Liu, Z.; Li, S. Semantic consistency guided hybrid-invariant transformer for domain adaptation in multi-view echo quality assessment. IEEE Trans. Instrum. Meas. 2025, 74, 4008519. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Anthropic. Claude 3 Haiku: Our Fastest Model Yet. 2024. Available online: https://www.anthropic.com/news/claude-3-haiku (accessed on 30 July 2025).
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Liang, J.; Huang, W.; Wan, G.; Yang, Q.; Ye, M. LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Xue, F.; Shi, Z.; Wei, F.; Lou, Y.; Liu, Y.; You, Y. Go wider instead of deeper. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
OpenAI. ChatGPT-4o. 2025. Available online: https://chat.openai.com/ (accessed on 30 July 2025).
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning (PMLR), Baltimore, ML, USA, 17–23 July 2022. [Google Scholar]
Zhu, D.; Ye, M.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021. [Google Scholar]
Fang, Y.; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. 2023. Available online: https://lmsys.org/blog/2023-03-30-vicuna/ (accessed on 14 April 2023).
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Zhu, K.; Wang, Y.; Sun, Y.; Chen, Q.; Liu, J.; Zhang, G.; Wang, J. Continual sft matches multimodal rlhf with negative supervision. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Sun, Y.; Zhang, H.; Chen, Q.; Zhang, X.; Sang, N.; Zhang, G.; Wang, J.; Li, Z. Improving multi-modal large language model through boosting vision capabilities. arXiv 2024, arXiv:2410.13733. [Google Scholar] [CrossRef]
Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Wang, Q.; Zhang, J.; Du, J.; Zhang, K.; Li, R.; Zhao, F.; Zou, L.; Xie, C. A fine-tuned multimodal large model for power defect image-text question-answering. Signal Image Video Process. 2024, 18, 9191–9203. [Google Scholar]
Mu, J.; Wang, W.; Liu, W.; Yan, T.; Wang, G. Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis. ACM Trans. Intell. Syst. Technol. 2024, 16, 139. [Google Scholar]
Han, Z.; Gao, C.; Liu, J.; Zhang, J.; Zhang, S.Q. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv 2024, arXiv:2403.14608. [Google Scholar]
Wang, J.; Song, Q.; Qian, L.; Li, H.; Peng, Q.; Zhang, J. SubstationAI: Multimodal Large Model-Based Approaches for Analyzing Substation Equipment Faults. arXiv 2024, arXiv:2412.17077. [Google Scholar]
Biderman, D.; Portes, J.; Ortiz, J.J.G.; Paul, M.; Greengard, P.; Jennings, C.; King, D.; Havens, S.; Chiley, V.; Frankle, J.; et al. Lora learns less and forgets less. arXiv 2024, arXiv:2405.09673. [Google Scholar] [CrossRef]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
He, J.; Zhou, C.; Ma, X.; Berg-Kirkpatrick, T.; Neubig, G. Towards a unified view of parameter-efficient transfer learning. arXiv 2021, arXiv:2110.04366. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar] [CrossRef]
Liu, X.; Ji, K.; Fu, Y.; Du, Z.; Yang, Z.; Tang, J. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. arXiv 2021, arXiv:2110.07602. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022. [Google Scholar]
Das, S.S.S.; Zhang, R.H.; Shi, P.; Yin, W.; Zhang, R. Unified low-resource sequence labeling by sample-aware dynamic sparse finetuning. arXiv 2023, arXiv:2311.03748. [Google Scholar]
Lawton, N.; Kumar, A.; Thattai, G.; Galstyan, A.; Steeg, G.V. Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models. arXiv 2023, arXiv:2305.16597. [Google Scholar]
Zhang, Q.; Chen, M.; Bukharin, A.; Karampatziakis, N.; He, P.; Cheng, Y.; Chen, W.; Zhao, T. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv 2023, arXiv:2303.10512. [Google Scholar]
Mao, Y.; Mathias, L.; Hou, R.; Almahairi, A.; Ma, H.; Han, J.; Yih, W.t.; Khabsa, M. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv 2021, arXiv:2110.07577. [Google Scholar]
Chen, S.; Jie, Z.; Ma, L. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv 2024, arXiv:2401.16160. [Google Scholar]
Sheng, Y.; Cao, S.; Li, D.; Hooper, C.; Lee, N.; Yang, S.; Chou, C.; Zhu, B.; Zheng, L.; Keutzer, K.; et al. S-lora: Serving thousands of concurrent lora adapters. arXiv 2023, arXiv:2311.03285. [Google Scholar]
Lopes, F.; Rocha, P.; Coelho, A. Towards Automated Visual Inspection of Electrical Grid Assets for the Smart Grid - An Application to HV Insulators. In Proceedings of the 2024 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Oslo, Norway, 17–20 September 2024. [Google Scholar]
Zheng, J.; Wu, H.; Zhang, H.; Wang, Z.; Xu, W. Insulator-Defect Detection Algorithm Based on Improved YOLOv7. Sensors 2022, 22, 8801. [Google Scholar] [CrossRef]
Merkelbach, S.; Diedrich, A.; Sztyber-Betley, A.; Travé-Massuyès, L.; Chanthery, E.; Niggemann, O.; Dumitrescu, R. Using Multi-Modal LLMs to Create Models for Fault Diagnosis (Short Paper). In Proceedings of the 35th International Conference on Principles of Diagnosis and Resilient Systems (DX 2024), Vienna, Austria, 4–7 November 2024; Schloss Dagstuhl–Leibniz-Zentrum für Informatik: Wadern, Germany, 2024. [Google Scholar]
Wang, H.; Li, C.; Li, Y.F.; Tsung, F. An Intelligent Industrial Visual Monitoring and Maintenance Framework Empowered by Large-Scale Visual and Language Models. IEEE Trans. Ind. Cyber-Phys. Syst. 2024, 2, 166–175. [Google Scholar] [CrossRef]
Jin, H.; Kim, K.; Kwon, J. GridMind: LLMs-Powered Agents for Power System Analysis and Operations. In Proceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 16–21 November 2025. [Google Scholar]
Wen, Y.; Chen, X. X-GridAgent: An LLM-Powered Agentic AI System for Assisting Power Grid Analysis. arXiv 2025, arXiv:2512.20789. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Su, B.; Zhou, Z.; Chen, H. PVEL-AD: A large-scale open-world dataset for photovoltaic cell anomaly detection. IEEE Trans. Ind. Inform. 2022, 19, 404–413. [Google Scholar]
GB/T 38335-2019; Code of Operation for Photovoltaic Power Station. Standards Press of China: Beijing, China, 2019.
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004. [Google Scholar]
Lynch, S.; Savary-Bataille, K.; Leeuw, B.; Argyle, D. Development of a questionnaire assessing health-related quality-of-life in dogs and cats with cancer. Vet. Comp. Oncol. 2011, 9, 172–182. [Google Scholar] [CrossRef]
Lahat, A.; Sharif, K.; Zoabi, N.; Shneor Patt, Y.; Sharif, Y.; Fisher, L.; Shani, U.; Arow, M.; Levin, R.; Klang, E. Assessing generative pretrained transformers (GPT) in clinical decision-making: Comparative analysis of GPT-3.5 and GPT-4. J. Med. Internet Res. 2024, 26, e54571. [Google Scholar] [CrossRef]
Zhang, H.; Chen, J.; Jiang, F.; Yu, F.; Chen, Z.; Li, J.; Chen, G.; Wu, X.; Zhang, Z.; Xiao, Q.; et al. Huatuogpt, towards taming language model to be a doctor. arXiv 2023, arXiv:2305.15075. [Google Scholar] [CrossRef]
Tiffin, P.A.; Finn, G.M.; McLachlan, J.C. Evaluating professionalism in medical undergraduates using selected response questions: Findings from an item response modelling study. BMC Med. Educ. 2011, 11, 43. [Google Scholar] [CrossRef]
Ghafourian, Y.; Hanbury, A.; Knoth, P. Readability measures as predictors of understandability and engagement in searching to learn. In Linking Theory and Practice of Digital Libraries, Proceedings of the 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, 26–29 September 2023; Springer: Cham, Switzerland, 2023; pp. 173–181. [Google Scholar]
Karunaratne, S.; Dharmarathna, D. A review of comprehensiveness, user-friendliness, and contribution for sustainable design of whole building environmental life cycle assessment software tools. Build. Environ. 2022, 212, 108784. [Google Scholar] [CrossRef]
Bai, Y.; Ying, J.; Cao, Y.; Lv, X.; He, Y.; Wang, X.; Yu, J.; Zeng, K.; Xiao, Y.; Lyu, H.; et al. Benchmarking foundation models with language-model-as-an-examiner. Adv. Neural Inf. Process. Syst. 2023, 36, 78142–78167. [Google Scholar]
Li, Z.; Xu, X.; Shen, T.; Xu, C.; Gu, J.C.; Lai, Y.; Tao, C.; Ma, S. Leveraging large language models for NLG evaluation: Advances and challenges. arXiv 2024, arXiv:2401.07103. [Google Scholar]
Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. arXiv 2024, arXiv:2406.18403. [Google Scholar] [CrossRef]
Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv 2022, arXiv:2204.05862. [Google Scholar] [CrossRef]
Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-tuning language models from human preferences. arXiv 2019, arXiv:1909.08593. [Google Scholar]
Askell, A.; Bai, Y.; Chen, A.; Drain, D.; Ganguli, D.; Henighan, T.; Jones, A.; Joseph, N.; Mann, B.; DasSarma, N.; et al. A general language assistant as a laboratory for alignment. arXiv 2021, arXiv:2112.00861. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. Glm: General language model pretraining with autoregressive blank infilling. arXiv 2021, arXiv:2103.10360. [Google Scholar]
Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin, D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; et al. Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 19822–19835. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478. [Google Scholar]
Xue, Y.; Xu, T.; Rodney Long, L.; Xue, Z.; Antani, S.; Thoma, G.R.; Huang, X. Multimodal recurrent model with attention for automated radiology report generation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018, Proceedings, Part I; Springer: Cham, Switzerland, 2018; pp. 457–466. [Google Scholar]
Chen, Z.; Song, Y.; Chang, T.H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv 2020, arXiv:2010.16056. [Google Scholar]
Yang, D.; Wei, J.; Xiao, D.; Wang, S.; Wu, T.; Li, G.; Li, M.; Wang, S.; Chen, J.; Jiang, Y.; et al. Pediatricsgpt: Large language models as chinese medical assistants for pediatric applications. Adv. Neural Inf. Process. Syst. 2024, 37, 138632–138662. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Marafioti, A.; Zohar, O.; Farré, M.; Noyan, M.; Bakouch, E.; Cuenca, P.; Zakka, C.; Allal, L.B.; Lozhkov, A.; Tazi, N.; et al. Smolvlm: Redefining small and efficient multimodal models. arXiv 2025, arXiv:2504.05299. [Google Scholar] [CrossRef]
Zhou, B.; Hu, Y.; Weng, X.; Jia, J.; Luo, J.; Liu, X.; Wu, J.; Huang, L. Tinyllava: A framework of small-scale large multimodal models. arXiv 2024, arXiv:2402.14289. [Google Scholar]

Figure 1. Comparison of standard LoRA and our proposed DyLoRA architecture. (a) Standard LoRA injects low-rank adaptation by optimizing two trainable matrices A and B on frozen weights. (b) DyLoRA introduces a shared matrix A and multiple expert-specific matrices

B_{i}

, dynamically selected via a routing mechanism based on heterogeneous multimodal representations. This design enhances adaptability to heterogeneous input modalities while preserving parameter efficiency. MHSA refers to the MultiHead Self-Attention module.

Figure 1. Comparison of standard LoRA and our proposed DyLoRA architecture. (a) Standard LoRA injects low-rank adaptation by optimizing two trainable matrices A and B on frozen weights. (b) DyLoRA introduces a shared matrix A and multiple expert-specific matrices

B_{i}

, dynamically selected via a routing mechanism based on heterogeneous multimodal representations. This design enhances adaptability to heterogeneous input modalities while preserving parameter efficiency. MHSA refers to the MultiHead Self-Attention module.

Figure 2. The overall architecture of our proposed DyLoRA. Given heterogeneous multimodal inputs (e.g., visible, thermal, infrared images) and task instructions, we first extract unified multimodal representations. These representations are fed into a shared low-rank adapter A and multiple expert-specific adapters

B_{i}

, where a trainable dynamic router predicts expert weights based on semantic content. The selected expert outputs are aggregated to adaptively generate downstream task outputs, achieving both parameter efficiency and robust generalization to diverse modalities. Attention, LN, and FNN refer to MultiHead Self-Attention module, Layer Normalization, and Feedforward Neural Network [48], respectively.

Figure 2. The overall architecture of our proposed DyLoRA. Given heterogeneous multimodal inputs (e.g., visible, thermal, infrared images) and task instructions, we first extract unified multimodal representations. These representations are fed into a shared low-rank adapter A and multiple expert-specific adapters

B_{i}

, where a trainable dynamic router predicts expert weights based on semantic content. The selected expert outputs are aggregated to adaptively generate downstream task outputs, achieving both parameter efficiency and robust generalization to diverse modalities. Attention, LN, and FNN refer to MultiHead Self-Attention module, Layer Normalization, and Feedforward Neural Network [48], respectively.

Figure 3. Illustration of the data construction pipeline. (a) We leverage domain-specific CoT prompts with ChatGPT-4o to automatically generate diverse power-related Q&A pairs, synthesize missing modalities (e.g., thermal from visible), and produce pseudo-reports for multimodal PV equipment analysis. (b) The annotation pipeline refines raw detection annotations into structured reports using step-by-step prompting, transforming bounding box metadata into rich semantic annotations that capture defect type, cause, and spatial extent.

Figure 4. Example outputs generated by PV-FaultExpert on multimodal PV equipment data. Each case includes the heterogeneous multimodal input and the predicted fault analysis report. The results demonstrate PV-FaultExpert’s ability to interpret heterogeneous multimodal inputs and produce detailed, safety-aware assessments.

Figure 5. The performance of our DyLoRA under different numbers of expert branches.

Table 1. Quantitative comparison results (higher is better) for PV fault analysis.

Method	Acc.	Cla.	Com.	Pra.	Avg.
MRG [70]	$2.23$ / $1.12$	$2.85$ / $0.98$	$2.60$ / $1.18$	$2.10$ / $1.10$	$2.45$ / $1.10$
MDTransformer [71]	$2.92$ / $1.65$	$3.58$ / $1.05$	$3.61$ / $2.43$	$3.36$ / $1.78$	$3.49$ / $1.73$
GPT-4 [7]	$3.55$ / $2.60$	$3.11$ / $2.10$	$2.78$ / $1.16$	$2.98$ / $1.32$	$2.95$ / $1.80$
Claude-3 [8]	$3.42$ / $2.24$	$3.58$ / $1.98$	$3.61$ / $2.31$	$3.36$ / $2.22$	$3.49$ / $2.19$
LLaVA1.5-7B [9]	$3.12$ / $1.82$	$3.16$ / $1.68$	$3.28$ / $2.01$	$3.06$ / $1.92$	$3.15$ / $1.86$
VisualGLM-6B [66]	$3.05$ / $2.01$	$3.18$ / $1.77$	$3.12$ / $1.98$	$3.04$ / $1.98$	$3.10$ / $1.94$
Qwen2-VL-7B [10]	$2.94$ / $1.72$	$2.88$ / $1.53$	$3.10$ / $2.12$	$3.01$ / $1.87$	$2.98$ / $1.81$
MiniGPT-4 [69]	$2.55$ / $1.68$	$2.62$ / $1.04$	$2.70$ / $1.88$	$2.65$ / $1.51$	$2.63$ / $1.53$
LLaVA1.5-7B ^†	$3.68$ / $2.39$	$3.55$ / $2.00$	$3.71$ / $2.59$	$3.50$ / $2.43$	$3.61$ / $2.35$
PV-FaultExpert ^† (ours)	$4.45$ / $2.97$	$4.26$ / $2.57$	$4.55$ / $2.86$	$4.39$ / $2.74$	$4.41$ / $2.81$

^† indicates that these methods were pre-trained and fine-tuned on our curated dataset. LLaVA1.5-7B was fine-tuned using LoRA, while our PV-FaultExpert employed the proposed DyLoRA strategy. The former reports the evaluation results on the AI-assisted annotated test set, while the latter presents the results on a human-annotated test subset consisting of 50 samples, where the reports are written in free and unconstrained styles.

Table 2. Ablation studies on our proposed efficient fine-tuning method.

Method	Acc.	Cla.	Com.	Pra.	Avg.
PV-FaultExpert ^† (DyLoRA)	4.45	4.26	4.55	4.39	4.41
↪ (a) w/o Fusion	4.01	3.82	4.02	3.87	3.93
↪ (b) w/o Router	3.88	3.70	3.85	3.66	3.77
↪ (c) w/o MoE	3.62	3.48	3.61	3.40	3.53

^† indicates that the method is pre-trained and fine-tuned on our curated dataset. The arrows indicate ablation experiments conducted based on our PV-FaultExpert.

Table 3. Routing confusion matrix on the hard test set (human-annotated test subset).

Modalities	Expert #1	Expert #2	Expert #3
Infrared + Text	$0.54$	$0.33$	$0.20$
RGB + Text	$0.24$	$0.67$	$0.29$
Thermal + Text	$0.26$	$0.33$	$0.58$
RGB + Thermal + Text	$0.13$	$0.69$	$0.51$

Table 4. Computational efficiency of the evaluated models.

Method	GFlops	Runtime (s)	Memory (GB)
MRG(CNN with LSTM) [70]	5.5	0.0048	2.60
MDTransformer [71]	20.6	0.2832	3.55
LLaVA1.5-7B [9]	578.4	0.6091	15.62
MiniGPT-4 [69]	2107.9	0.7574	12.67
PV-FaultExpert ^† (ours)	584.2	0.6219	15.74

^† indicates that the method is pre-trained and fine-tuned on our curated dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, J.; Chen, Y.; Min, Q.; Chen, M.; Zhao, J.; Ye, M. Domain-Adaptive Multimodal Large Language Models for Photovoltaic Fault Diagnosis via Dynamic LoRA Routing. Processes 2026, 14, 653. https://doi.org/10.3390/pr14040653

AMA Style

Wu J, Chen Y, Min Q, Chen M, Zhao J, Ye M. Domain-Adaptive Multimodal Large Language Models for Photovoltaic Fault Diagnosis via Dynamic LoRA Routing. Processes. 2026; 14(4):653. https://doi.org/10.3390/pr14040653

Chicago/Turabian Style

Wu, Junjian, Yiwei Chen, Qihao Min, Ming Chen, Jie Zhao, and Mang Ye. 2026. "Domain-Adaptive Multimodal Large Language Models for Photovoltaic Fault Diagnosis via Dynamic LoRA Routing" Processes 14, no. 4: 653. https://doi.org/10.3390/pr14040653

APA Style

Wu, J., Chen, Y., Min, Q., Chen, M., Zhao, J., & Ye, M. (2026). Domain-Adaptive Multimodal Large Language Models for Photovoltaic Fault Diagnosis via Dynamic LoRA Routing. Processes, 14(4), 653. https://doi.org/10.3390/pr14040653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Domain-Adaptive Multimodal Large Language Models for Photovoltaic Fault Diagnosis via Dynamic LoRA Routing

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Large Language Models

2.2. Parameter-Efficient Fine-Tuning

2.3. Power System with MLLM

3. Method

3.1. Overview

3.2. Dynamic Expert Routing with LoRA

3.3. Dataset Curation

3.4. Loss Function

3.4.1. Pre-Training for Feature Alignment

3.4.2. Fine-Tuning on Fault Analysis

4. Experiments

4.1. MLLMs for Fault Analysis

4.2. Datasets

4.3. Evaluation Metrics

4.4. Implementation Details

4.5. Baselines

4.6. Reasoning Analysis

4.7. Comparative Results

4.8. Ablation Study

4.9. Computational Efficiency

5. Conclusions

6. Limitations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI