Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs

Cheng, Huanyu; Shang, Linjiang; Chen, Xikang; Feng, Tao; Zhang, Yin

doi:10.3390/computers14120564

Open AccessArticle

Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs

by

Huanyu Cheng

¹

,

Linjiang Shang

¹,

Xikang Chen

²,

Tao Feng

² and

Yin Zhang

^2,*

¹

Information and Communication Branch, State Grid Jiangsu Electric Power Company Ltd., Nanjing 210024, China

²

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(12), 564; https://doi.org/10.3390/computers14120564

Submission received: 19 November 2025 / Revised: 7 December 2025 / Accepted: 12 December 2025 / Published: 17 December 2025

Download

Browse Figures

Versions Notes

Abstract

Building high-quality multimodal instruction datasets is often time-consuming and costly. Recent studies have shown that a small amount of carefully selected high-quality data can be more effective for improving LVLM performance than large volumes of low-quality data. Based on these observations, we propose an error-guided multimodal sample selection framework with hallucination suppression for LVLM fine-tuning. First, semantic embeddings of queries are clustered to form balanced subsets that preserve task diversity. A visual contrastive decoding module is then used to reduce hallucinations and expose genuinely difficult examples. For closed-ended tasks, such as object detection, we estimate sample value using prediction accuracy; for open-ended question answering, we use the perplexity of generated responses as a difficulty signal. Within each cluster, high-error or high-perplexity samples are preferentially selected to construct a compact yet informative training set. Experiments on the InsPLAD detection benchmark and the PowerQA visual question answering dataset show that our method consistently outperforms random sampling under the same data budget, achieving higher F1, cosine similarity, BLEU (Bilingual Evaluation Understudy), and GPT-4o-based evaluation scores. This demonstrates that hallucination-aware, uncertainty-driven data selection can improve LVLM robustness and data efficiency.

Keywords:

data selection; hallucination; instruction tuning; LVLMs

1. Introduction

Large Vision-Language Models (LVLMs), such as GPT-4o [1], Gemini2.5 [2], and Qwen2.5 [3], have demonstrated outstanding performance across a wide range of visual understanding and generation tasks, while being able to follow human instructions efficiently and safely. During training, LVLMs typically undergo two core stages: first, pretraining on large-scale image–text pairs, and second, fine-tuning on visual instruction datasets [4,5,6]. Among these stages, fine-tuning on visual instruction datasets plays a crucial role in aligning LVLMs with human instructions [7]. By training on visual instruction–output datasets, visual instruction tuning can effectively narrow the gap between LVLMs and diverse human intents [8]. Specifically, visual instruction tuning not only enables model outputs to better align with human preferences, thereby improving controllability and safety, but also allows LVLMs to quickly adapt to specific domains or acquire specialized knowledge without requiring extensive computational resources or architectural modifications [9,10].

In early studies, visual instruction fine-tuning mainly focused on the construction of large-scale instruction datasets [11,12]. Currently, there are two primary approaches: one is based on existing multimodal datasets, for example, converting image–text pairs into image–instruction–output triplets [13,14]; the other leverages high-performance multimodal large models (e.g., GPT-4o) to generate large-scale visual instruction training sets based on pre-designed visual instruction prompt templates [4,15,16]. Despite the existence of various methods for constructing large-scale visual instruction datasets, these datasets still have certain limitations in terms of quantity, diversity, and creativity.

Therefore, during the supervised fine-tuning stage, selecting high-quality datasets is critical for model training [17,18]. Research has shown that although instruction fine-tuning generally relies on large amounts of data, data quality is often more important than quantity [19,20]. For example, on MiniGPT-4, using only a small amount of high-quality instruction data can achieve excellent results, indicating that LVLMs already acquire rich world knowledge during the pretraining stage, and only a small set of high-quality instruction data is required during the instruction-tuning stage to significantly enhance the model’s capabilities [21].

Manually selecting instruction data usually incurs high costs and easily introduces human bias. Therefore, developing efficient automated methods to select instruction data is of great significance. However, because this task involves complex factors and multidimensional considerations, achieving automated data selection remains quite challenging. Recent studies propose various automated methods for selecting instruction training data [22,23,24]. PRISM introduces a training-independent data selection method that quantifies the internal visual encoding features of multimodal large models through Pearson correlation analysis and calculates task-specific relevance scores to identify high-value data instances [25]. S2L trains small models, clusters their loss trajectories, and samples from these clusters to guide data selection for large models. This method demonstrates that, during fine-tuning, samples from the same loss trajectory cluster have similar gradients, enabling efficient data selection [26]. DataTailor proposes a unified framework that selects data based on three key principles—informativeness, uniqueness, and representativeness—to jointly optimize the quality and diversity of multimodal data [27].

Although existing data selection methods achieve certain improvements in enhancing model performance, they still exhibit several limitations. On the one hand, these methods primarily rely on static metrics or pre-trained features, lacking direct feedback on the model’s actual errors and task performance. On the other hand, in open-ended tasks or complex instruction scenarios, data selection struggles to balance task diversity, sample difficulty, and model weaknesses, which in turn limits the effectiveness of fine-tuning. To address these challenges, this study proposes an error-guided multimodal data selection method. Specifically, it first clusters samples based on the semantic representations of queries. Next, it introduces a hallucination suppression module to mitigate hallucination issues during model generation. Finally, it conducts systematic evaluation of model outputs. The evaluation strategy is divided into two categories: for closed-ended tasks (e.g., multiple-choice questions), it directly selects samples that the model frequently mispredicts based on output accuracy to enhance fine-tuning; for open-ended tasks, where the correctness of outputs is difficult to measure precisely, it uses the perplexity (PPL) of each response as a quantitative indicator of output difficulty and selects high-PPL samples proportionally to each category in the dataset, ensuring coverage of different task types and difficulty levels.

In this method, the hallucination suppression module plays a key role. Unlike directly using the outputs of large models, this module suppresses false information during generation, significantly reducing unreasonable or erroneous outputs. If certain samples still fail to produce correct results after hallucination suppression, these samples are considered highly difficult and valuable for learning. In other words, such data are not only complex and challenging but also effectively expose the model’s weaknesses in multimodal understanding and reasoning.

The method employs strategies of error prioritization, category balancing, and difficulty control to achieve efficient and targeted data selection. Its advantages lie in both enhancing the model’s performance on error-prone areas and maintaining data diversity and representativeness. Experimental results show that this method not only improves fine-tuning efficiency and model performance but also significantly strengthens the robustness and generalization ability of multimodal large models. Overall, our main contributions are three-fold:

An error-guided multimodal training data selection method: This approach selects samples based on model error-prone instances and output difficulty, effectively improving visual instruction fine-tuning performance.
A hallucination suppression module: This module enhances the effectiveness of data selection. By identifying high-difficulty samples, it exposes the model’s weaknesses in multimodal understanding and reasoning, thereby increasing the learning value of the selected data.
Extensive experimental validation: Results demonstrate that the proposed strategy not only significantly improves fine-tuning efficiency and model performance but also enhances the robustness and generalization capability of multimodal large models.

2. Related Work

2.1. LVLMs

In recent years, the rapid advancement of Large Language Models (LLMs) has not only led to breakthroughs in natural language processing [28,29], but also sparked growing interest from both academia and industry in multimodal intelligence, particularly in vision–language interaction [30,31,32,33]. Against this backdrop, how to effectively integrate visual information into LLMs, thereby equipping them with cross-modal understanding and reasoning abilities, has become one of the central challenges in multimodal AI research [34,35].

LVLMs typically follow a paradigm that leverages large-scale image–text paired data for training. In this process, a projection module is often introduced to map image features extracted by a vision encoder into the embedding space of LLMs, thus achieving alignment and fusion between visual and linguistic modalities [4,5]. This design enables LLMs to process images and text jointly within a shared semantic space, thereby endowing them with cross-modal modeling capabilities. From the training perspective, LVLMs go beyond basic cross-modal alignment and rely heavily on large-scale vision–language instruction tuning to further enhance their ability to follow human intent and instructions [6,8]. Recent works have also explored alignment with human preference, reinforcement learning-based optimization, as well as multi-stage training paradigms, which collectively improve the robustness and generalization ability of LVLMs [36,37]. In terms of applications, LVLMs have demonstrated strong performance across a variety of tasks, including visual question answering, image captioning, visual reasoning, visual perception, and decision support. With their growing capabilities, LVLMs are being extended to more complex scenarios, such as multimodal dialogue, video understanding, cross-modal content generation, and even scientific data analysis, highlighting their broad potential for real-world applications [1,7,32].

2.2. Instruction Construction

The Self-Instruct method randomly selects a small set of instances from an initial task pool as exemplars to guide large language models in generating new instructions along with their corresponding input-output pairs [38]. In contrast, Evol-Instruct employs a progressive modification strategy on the original instructions, allowing for more precise control over the difficulty and complexity of the generated instructions [39]. Unlike these two approaches, Tree-Instruct extends existing instructions by guiding the language model to append a specified number of new nodes in the semantic tree of an instruction, rather than directly manipulating the textual sequence, thereby enabling structured control over instruction expansion [40].

On the other hand, some studies focus on improving the performance of large language models using smaller but higher-quality sets of instruction exemplars [41,42]. LIMA demonstrates remarkable performance using only approximately one thousand high-quality data points. Instruction Mining proposes a linear-rule-based method to select high-quality instructions; however, it requires partitioning the data into multiple intervals, which limits its ability to evaluate individual samples at a fine-grained level [22]. InstructionGPT-4 introduces a set of metrics for assessing the quality of multimodal instruction data and, based on these metrics, develops a data selector that automatically identifies and filters low-quality vision–language data [21]. AlpaGasus leverages an external, powerful model, GPT4o, to directly evaluate each example. Although this approach has been empirically validated, a significant limitation lies in its inability to account for inherent differences among models during fine-tuning, relying excessively on ChatGPT’s preferences [43]. To more clearly illustrate the differences between our approach and existing methods, we compare the characteristics of various methods along five dimensions in Table 1. As shown in Table 1, our approach is a training-free, automated data selection scheme that does not rely on proprietary large models, and it leverages clustering and hallucination mitigation strategies to balance data diversity while selecting high-quality training data.

3. Method

The core idea of our method lies in an error-guided training data selection mechanism. Specifically, samples on which the model makes mistakes during prediction are fed back into the training process, thereby further improving model performance and effectively reducing computational resource consumption. As shown in Figure 1, the overall framework of the method consists of four key stages: clustering, hallucination suppression, evaluation, and data selection. In this section, we elaborate on our method and discuss the motivation behind it.

3.1. Clustering

Introducing clustering in the data selection process has significant advantages. First, clustering structures the original data space into partitions, which makes the sample distribution clearer and avoids bias caused by excessive redundant samples during data filtering. Second, clustering ensures that representative samples from different categories or semantic clusters are effectively preserved, which improves the diversity and coverage of the selected data and prevents the model from falling into “local optima” or overfitting to a few feature patterns during training. In addition, when error-guided data filtering is applied on top of the clustering results, it identifies high-value samples more precisely and ensures that the selected data both reflect the model’s weaknesses and maintain broad representativeness. This staged filtering strategy maintains the quality of training data while further reducing redundancy and computational cost. The specific procedure is as follows:

Let the dataset be

D = {{(q_{k}, i_{k}, a_{k})}_{k = 1}^{N}}

, where

q_{k}

,

i_{k}

and

a_{k}

denote the query, image, and answer of the k-th sample, and N is the total number of samples. For each query

q_{k}

, its representation is obtained using the LLAMA2 model [28].

Q_{H}^{k} = L L A M A 2 (q_{k}) \in R^{L_{k} \times h},

where

L_{k}

is the token length of

q_{k}

, and h is the hidden dimension. The feature vector of

q_{k}

is taken as the representation of the last token:

f_{k} = Q_{H}^{k} [L_{k}, :] \in R^{h} .

All query feature vectors are stacked to form the query feature matrix for the entire dataset:

Q M = (f_{1}, f_{2}, . . ., f_{N}) R^{N \times h} .

The feature matrix

Q M

is then clustered using K-means with P clusters, resulting in cluster centroids

{c_{j}}_{j = 1}^{P}

and cluster assignments

{y_{k}}_{k = 1}^{N}

:

y_{k} = j if f_{k} belongs to j - th cluster .

Each cluster

C_{j}

can be expressed as:

C_{j} = {q_{k}, i_{k}, a_{k} | y_{k} = j}

3.2. Hallucination Suppression

During the data selection process, we input the candidate dataset into a multimodal large model optimized with hallucination suppression techniques. Samples that the model can generate correctly indicate that these problems are relatively easy to learn through training and can be considered ordinary training samples. In contrast, samples that still cannot be generated correctly even after applying hallucination suppression methods indicate higher complexity and difficulty, representing high-value hard samples. Incorporating these hard samples into training can effectively enhance the model’s performance and robustness in complex scenarios. By combining hallucination suppression with hard sample mining, this approach can accurately select the data with the greatest training value for the model, thereby achieving more efficient model optimization.

In this study, hallucination suppression is implemented using Visual Contrastive Decoding (VCD) [44]. The detailed procedure is as follows:

1.

Dual-Path Decoding Mechanism

Standard Decoding Path: For an input image i and a query q, the output at time step t is $l o g i t M_{θ} (r_{t} | q, i, r_{< t})$ .
Perturbed Decoding Path: The original image i is subjected to perturbation; for example, by adding random noise or applying a mask, resulting in a perturbed image $i^{'}$ , the output at time step t is $l o g i t M_{θ} (r_{t} | q, i^{'}, r_{< t})$ .

2.

Contrastive Decoding: The final contrastive decoding is computed as:

\begin{matrix} p_{v c d} (y | i, i^{'}, q) & = s o f t m a x ( \\ (1 + α) l o g i t M_{θ} (r_{t} | q, i, r_{< t}) \\ - α l o g i t M_{θ} (r_{t} | q, i^{'}, r_{< t})), \end{matrix}

where

α

denotes the contrastive coefficient.

It is worth noting that this approach is not limited to visual contrastive decoding and can be integrated with other hallucination suppression strategies, enhancing both flexibility and scalability.

3.3. Evaluation

In the evaluation stage, our goal is to measure the degree of inconsistency between the model’s responses and the reference answers. For tasks with well-defined answers, such as multiple-choice questions, evaluation can be performed directly by checking whether each response is correct. However, for open-ended questions, accurate evaluation is challenging due to the lack of a unique correct answer.

Recent studies have shown that LLMs possess a certain degree of self-assessment capability: when the model is confident about a question, the probability of generating the corresponding answer is relatively high; for unfamiliar or difficult questions, the probability is relatively low, indicating lower confidence. Based on this observation, we use PPL to quantify the model’s “uncertainty” for each question. A higher PPL indicates that the model’s response is more likely to be incorrect, highlighting questions that may require further training or optimization. The PPL is computed as follows:

P P L (r) = e x p (- \frac{1}{M} \sum_{i = 1}^{M} l o g P_{θ} (r_{i} | r_{< i})),

where

r = (r_{1}, r_{2}, . . ., r_{M})

represents the model’s response sequence,

P_{θ} (r_{i} | r_{< i})

denotes the probability assigned by the model to the i-th token given the preceding tokens

r_{< i}

, and M is the sequence length.

3.4. Data Selection

In the data selection stage, we first sort all samples according to the scores obtained in the evaluation phase and prioritize the selection of high-scoring “hard samples” for subsequent model training. To ensure representativeness and balance in the selected dataset, we introduce a proportional sampling strategy based on clustering. Specifically, let the target number of selected samples be T, and assume the data is clustered into P groups, with the i-th cluster containing

n_{i}

samples. The proportion of the i-th cluster is given by:

r_{i} = \frac{n_{i}}{\sum_{j = 1}^{P} n_{j}} .

Accordingly, the number of samples selected from the i-th cluster is:

k_{i} = T \times r_{i},

where

k_{i}

denotes the number of samples chosen from cluster i. Finally, we select the top

k_{i}

hard samples from each cluster in descending order of their scores, thereby ensuring both the prioritization of high-value data and the preservation of distributional consistency across clusters.

4. Experiments

To evaluate the effectiveness of our proposed method, we conduct extensive experiments on two datasets: the publicly available InsPLAD dataset and a self-constructed multimodal question answering dataset in the power domain.

4.1. Dataset Description

4.1.1. InsPLAD

The InsPLAD dataset is a widely used benchmark dataset for object detection in the power grid domain and contains 7979 training samples and 2626 test samples [45]. An example is shown in Figure 2.

4.1.2. PowerQA

To evaluate the generative capability of our method, we construct a visual question-answering dataset focused on power grid equipment, named PowerQA. During the dataset construction process, we first apply various editing and augmentation operations to the collected images to improve their quality and diversity. Then, GPT-4o generates image captions and constructs initial question–answer pairs. To ensure the accuracy and reliability of the data, we perform manual annotation and filtering. The final dataset contains 7000 high-quality question–answer pairs, of which 6000 are used for training and 1000 are used for testing. An example is shown in Figure 3.

4.2. Model and Training Details

We use the open-source multimodal large language model Qwen-VL-2.5-3B [3] and fine-tune it with the open-source framework Llama-factory. All experiments run on a server equipped with four NVIDIA RTX 3090 GPUs. For the fine-tuning process, we adopt the LoRA method to efficiently adapt the model to our specific task [46].

4.3. Evaluation Metrics

To comprehensively assess the performance of our proposed method, we use a set of standard and advanced metrics designed for the specific tasks.

4.3.1. InsPLAD Metrics

We use Precision, Recall, and F1-score, which are calculated based on true positives (

T P

), false positives (

F P

), and false negatives (

F N

):

P r e c i s i o n = \frac{T P}{T P + F P}

R e c a l l = \frac{T P}{T P + F N}

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

4.3.2. PowerQA Metrics

Since PowerQA is an open-ended question-answering task, we adopt three evaluation metrics to measure the quality of the generated answers.

Cosine Similarity: Measures the similarity between two non-zero vectors in an inner product space. We used a TF-IDF vectorization approach to represent the model-generated and ground-truth answers.

$C o s i n e S i m i l a r i t y (A, B) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥}$
BLEU Score: A standard metric for evaluating the quality of machine-generated text by measuring the correspondence between it and a set of reference texts.
GPT-4o Score: Given the limitations of automated metrics in capturing semantic nuance and hallucination, we prompted the GPT-4o model to act as an evaluator, scoring the quality of our model’s generated answers on a scale from 0 to 10. The scoring criteria strictly followed these rules:
-
10: Semantically equivalent to the label, with consistent key information and no contradictions.
-
8–9: Basically equivalent, with only minor stylistic differences.
-
6–7: Generally consistent but missing important details or with slight deviations.
-
4–5: Partially matching; many key points are missing or ambiguous.
-
1–3: Largely inconsistent or conflicting with the label.
-
0: Clearly contradictory, fabricated facts, or completely irrelevant.

4.3.3. Baselines

Base: Directly uses the original LVLM to generate outputs, without any additional fine-tuning or data selection.
Random_N: Randomly selects N samples from the training set for fine-tuning, where N denotes the number of selected samples.
AlpaGasus_N: Uses the AlpaGasus method to select N samples from the training set for fine-tuning.
Choice_N: The method proposed in this paper, which selects N high-value samples from the training set for fine-tuning.

4.4. Results

4.4.1. Main Results

As shown in Table 2, our data selection method consistently and significantly outperforms both the random sampling and AlpaGasus-based baselines on the InsPLAD dataset. With 2000 samples, the F1 score increases from 0.520 (Random_2000) and 0.592 (AlpaGasus_2000) to 0.689, i.e., gains of 0.169 and 0.097 over the two baselines. At 5000 samples, the F1 score further rises to 0.809, compared with 0.730 and 0.738, yielding improvements of 0.079 and 0.071, respectively. These consistent gains demonstrate that our method is more effective at selecting informative and impactful data under the same budget, leading to a more efficient fine-tuning process.

On PowerQA, the models fine-tuned with our selected data also outperform those trained on randomly sampled or AlpaGasus-selected subsets. For 3000 samples, the BLEU score improves from 21.49 and 21.72 to 21.88, while the GPT-4o score increases from 5.65 and 5.69 to 5.75. With 5000 samples, our method achieves the highest cosine similarity (0.1925) and a competitive GPT-4o score of 5.82, compared to 0.1916/5.77 (Random_5000) and 0.1920/5.85 (AlpaGasus_5000), while maintaining a similar BLEU level. These results further validate that providing more semantically rich and challenging data points for fine-tuning leads to consistently better overall performance.

4.4.2. Data Selection Metrics

To further investigate the impact of different data selection metrics on the performance of our model, we conduct an additional experiment on PowerQA. We first cluster the original training data into ten categories according to the questions. We then use four different criteria—Cosine score, BLEU score, GPT-4o score, and PPL score—to select a proportional number of data from each cluster. The results of this comparative analysis, based on these four selection criteria, are summarized in Table 3.

The results clearly indicate a significant variation in model performance depending on the selection metric used. The metrics based on lexical and semantic similarity, Cosine Similarity and BLEU score, result in the lowest performance, with a final score of 5.28 and 5.30, respectively. In contrast, selecting data based on the GPT-4o score or PPL leads to a substantial improvement across all evaluation metrics. Notably, using PPL as the selection criterion yields the best overall performance, with a Cosine Similarity of 0.1817, a BLEU score of 21.88, and a GPT-4o score of 5.75. This finding suggests that model-centric metrics, which better capture a model’s uncertainty or qualitative shortcomings, are more effective for identifying high-value data for fine-tuning.

4.4.3. Class Numbers

To further examine the impact of the number of data clusters on model performance, we conduct comparative experiments with different numbers of clusters, and report the results in Table 4. We observe that, compared with the baseline model without clustering (class_num = 0), introducing clustering substantially improves performance. For example, when the number of clusters increases from 0 to 10, the GPT-4o score rises from 5.58 to 5.75, the BLEU score increases from 21.19 to 21.88, and the cosine similarity improves from 0.1718 to 0.1817. These results indicate that clustering the training data helps obtain more balanced and representative subsets for fine-tuning, thereby improving the overall performance of the model.

When we further increase the number of clusters to 30 and 40, the metrics exhibit only minor fluctuations relative to the 10-cluster setting: the GPT-4o scores for 10 and 30 clusters are identical, and the cosine similarity and BLEU scores are also nearly the same; the 40-cluster configuration shows a slight decrease, but the difference remains small. This finding suggests that, under our current setup, a moderate number of clusters (e.g., 10 or 30) already captures most of the performance gains brought by clustering, while using more clusters yields only marginal additional improvements.

4.4.4. Generalization

We first validate the proposed method on the InsPLAD and PowerQA datasets, both of which consist of power-grid-related data. To further assess the generalization ability of our method, we adopt the general multimodal large model evaluation benchmark MM-Vet [47]. MM-Vet defines six core vision-language capabilities—recognition, OCR, knowledge, language generation, spatial understanding, and mathematical reasoning—and examines 16 key integrated capability combinations derived from these abilities.

As shown in Figure 4, under different training data sizes (N = 1500, 3000, 5000), our method consistently outperforms Random and AlpaGasus on MM-Vet. This result indicates that our method also achieves superior performance on general multimodal data, further demonstrating its strong generalization ability.

5. Case Study

To provide a qualitative evaluation of our proposed data selection method, we conduct a case study across four representative tasks from our PowerQA dataset. These tasks, chosen for their practical relevance and typicality in the power industry domain, include anomaly detection on power equipment, foreign object detection on large-scale facilities, image captioning of power scenes, and Optical Character Recognition (OCR) for identifying equipment models. The diverse nature of these tasks demonstrates the generalizability of our method across common vision-language tasks.

Figure 5 visually compares the output of a model fine-tuned using our method against one fine-tuned on randomly selected data. The results consistently show that our approach significantly enhances the quality of the model’s responses. For instance, in the anomaly detection task, our model successfully identifies minute abnormalities that were missed by the randomly fine-tuned model. In the image captioning task, our model’s descriptions are not only more accurate but also richer and more semantically comprehensive. Furthermore, in the OCR task, our model demonstrates a notable improvement in the accuracy of technical terminology and equipment model numbers, which are crucial for real-world applications.

6. Discussion

Our experiments on InsPLAD and PowerQA show that hallucination-aware data selection can make visual instruction tuning more efficient and more effective. Compared with random sampling, our method consistently chooses samples that are more informative and challenging, leading to higher F1 scores on InsPLAD and better cosine similarity, BLEU, and GPT-4o scores on PowerQA under the same training cost. These results support our hypothesis that error-centric and model-aware criteria are more important than simply increasing dataset size, especially in specialized domains such as power grids.

Our comparison of selection metrics shows that simple lexical or semantic similarity measures (e.g., cosine similarity, BLEU) are not very sensitive to subtle hallucinations or missing key details. Metrics derived from model uncertainty and external LLM evaluation (perplexity and GPT-4o scores) align better with the model’s true difficulties. The strong performance of perplexity-based selection suggests that uncertainty-aware sampling is a practical and scalable way to mine high-value data without expensive human labels. At the same time, clustering helps maintain diversity: by selecting hard examples from different clusters, we balance focusing on errors with preserving coverage of various task types.

7. Limitation and Future Work

This work still has several limitations in terms of data usage and hallucination mitigation. On the one hand, we adopt a one-shot, static data selection pipeline: we first select a set of hard samples based on model errors and uncertainty, and then fine-tune the model solely on this fixed subset, without exploring adaptive mechanisms that adjust the ratio of easy and hard samples as training progresses. On the other hand, we do not perform end-to-end joint optimization between hallucination mitigation and data selection.

Future work may include designing adaptive curricula that adjust the mix of hard and easy samples over time, extending hallucination-aware selection to other modalities (e.g., video, 3D, time series), and incorporating richer expert feedback from practitioners in different application scenarios. Another promising direction is to jointly optimize data selection and model architecture, for example by adding lightweight adapters specialized for hallucination-prone sample clusters. Overall, our hallucination-aware data selection framework provides a simple yet effective bridge between error analysis, data curation, and instruction tuning for LVLMs, and points toward more data-centric strategies for reliable multimodal systems.

8. Conclusions

This paper addresses the problem of data selection for LVLMs during the instruction fine-tuning stage and proposes a hallucination-aware multimodal training data selection method. Studies show that, during fine-tuning, the importance of high-quality data often outweighs data quantity; a small amount of high-quality instruction data can significantly enhance the model’s response capability. To overcome the high cost and potential bias of traditional manual selection, our method integrates semantic clustering, a hallucination suppression module, and a systematic evaluation mechanism to achieve efficient and precise data selection. Specifically, semantic clustering is first employed to group samples, ensuring class balance and data diversity. Next, the hallucination suppression module effectively identifies the model’s weaknesses in multimodal understanding and reasoning, enabling targeted selection of high-difficulty and high-value training samples. For different task types, the method adopts differentiated evaluation strategies: for closed-ended tasks, output error rates guide data selection; for open-ended tasks, perplexity measures sample difficulty to ensure coverage across task types and difficulty levels while maintaining data diversity and representativeness. Experimental results demonstrate that this method significantly improves fine-tuning efficiency and model performance, while enhancing the robustness and generalization ability of multimodal large models, providing a practical solution for constructing high-quality training data efficiently.

Author Contributions

Conceptualization, H.C., L.S. and Y.Z.; Methodology, X.C., T.F. and Y.Z.; Writing—original draft, X.C. and T.F.; Writing—review and editing, H.C., L.S., Y.Z. and T.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Projects of State Grid Jiangsu Electric Power Company Ltd., under Grant J2024168.

Data Availability Statement

This study uses two datasets: InsPLAD and PowerQA. The InsPLAD dataset has been publicly available since 26 February 2023 at: https://github.com/andreluizbvs/InsPLAD/tree/main. Data available on request due to restrictions: The PowerQA dataset is constructed by the authors with the support of the company, and the ownership of the data belongs to the company. The dataset is available on request from the corresponding author, subject to company ownership and internal approval requirements, and the time of public release is currently undetermined.

Conflicts of Interest

Authors Huanyu Cheng and Linjiang Shang are employed by the State Grid Jiangsu Electric Power Company Ltd., Information and Communication Branch, Nanjing, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E.; et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv 2025, arXiv:2507.06261. [Google Scholar] [CrossRef]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-vl technical report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306. [Google Scholar]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Xu, H.; Ye, Q.; Yan, M.; Shi, Y.; Ye, J.; Xu, Y.; Li, C.; Bi, B.; Qian, Q.; Wang, W.; et al. mplug-2: A modularized multi-modal foundation model across text, image and video. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 38728–38748. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 49250–49267. [Google Scholar]
Wang, W.; Gao, Z.; Gu, L.; Pu, H.; Cui, L.; Wei, X.; Liu, Z.; Jing, L.; Ye, S.; Shao, J.; et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv 2025, arXiv:2508.18265. [Google Scholar]
Zhang, Y.F.; Yu, T.; Tian, H.; Fu, C.; Li, P.; Zeng, J.; Xie, W.; Shi, Y.; Zhang, H.; Wu, J.; et al. Mm-rlhf: The next step forward in multimodal llm alignment. arXiv 2025, arXiv:2502.10391. [Google Scholar] [CrossRef]
Yu, T.; Zhang, H.; Li, Q.; Xu, Q.; Yao, Y.; Chen, D.; Lu, X.; Cui, G.; Dang, Y.; He, T.; et al. Rlaif-v: Open-source AI feedback leads to super gpt-4v trustworthiness. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 19985–19995. [Google Scholar]
Liu, F.; Lin, K.; Li, L.; Wang, J.; Yacoob, Y.; Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv 2023, arXiv:2306.14565. [Google Scholar]
Ye, J.; Xu, H.; Liu, H.; Hu, A.; Yan, M.; Qian, Q.; Zhang, J.; Huang, F.; Zhou, J. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv 2024, arXiv:2408.04840. [Google Scholar]
Zhao, B.; Wu, B.; He, M.; Huang, T. Svit: Scaling up visual instruction tuning. arXiv 2023, arXiv:2307.04087. [Google Scholar] [CrossRef]
Chen, J.; Zhang, T.; Liu, C.; Ding, H.; Shi, Y.; Cheng, F.; Xiao, H.; Wen, B.; Yang, F.; Gao, T.; et al. Taskgalaxy: Scaling multi-modal instruction fine-tuning with tens of thousands vision task types. arXiv 2025, arXiv:2502.09925. [Google Scholar]
Liu, J.; Huang, X.; Zheng, J.; Liu, B.; Wang, J.; Yoshie, O.; Liu, Y.; Li, H. Mm-instruct: Generated visual instructions for large multimodal model alignment. arXiv 2024, arXiv:2406.19736. [Google Scholar] [CrossRef]
Wang, J.; Meng, L.; Weng, Z.; He, B.; Wu, Z.; Jiang, Y.G. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv 2023, arXiv:2311.07574. [Google Scholar] [CrossRef]
Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; et al. Lima: Less is more for alignment. Adv. Neural Inf. Process. Syst. 2023, 36, 55006–55021. [Google Scholar]
Liu, Z.; Zhou, K.; Zhao, W.X.; Gao, D.; Li, Y.; Wen, J.R. Less is more: Data value estimation for visual instruction tuning. arXiv 2024, arXiv:2403.09559. [Google Scholar] [CrossRef]
Wu, S.; Lu, K.; Xu, B.; Lin, J.; Su, Q.; Zhou, C. Self-evolved diverse data sampling for efficient instruction tuning. arXiv 2023, arXiv:2311.08182. [Google Scholar] [CrossRef]
Li, Y.; Hui, B.; Xia, X.; Yang, J.; Yang, M.; Zhang, L.; Si, S.; Chen, L.H.; Liu, J.; Liu, T.; et al. One-shot learning as instruction data prospector for large language models. arXiv 2023, arXiv:2312.10302. [Google Scholar]
Wei, L.; Jiang, Z.; Huang, W.; Sun, L. Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4. arXiv 2023, arXiv:2308.12067. [Google Scholar]
Cao, Y.; Kang, Y.; Sun, L. Instruction mining: High-quality instruction data selection for large language models. arXiv 2023, arXiv:2307.06290. [Google Scholar]
Li, M.; Zhang, Y.; Li, Z.; Chen, J.; Chen, L.; Cheng, N.; Wang, J.; Zhou, T.; Xiao, J. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv 2023, arXiv:2308.12032. [Google Scholar]
Du, Q.; Zong, C.; Zhang, J. MoDS: Model-oriented Data Selection for Instruction Tuning. arXiv 2023, arXiv:2311.15653. [Google Scholar]
Bi, J.; Wang, Y.; Yan, D.; Xiao, X.; Hecker, A.; Tresp, V.; Ma, Y. Prism: Self-pruning intrinsic selection method for training-free multimodal data selection. arXiv 2025, arXiv:2502.12119. [Google Scholar]
Yang, Y.; Mishra, S.; Chiang, J.; Mirzasoleiman, B. Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models. Adv. Neural Inf. Process. Syst. 2024, 37, 83465–83496. [Google Scholar]
Yu, Q.; Shen, Z.; Yue, Z.; Wu, Y.; Qin, B.; Zhang, W.; Li, Y.; Li, J.; Tang, S.; Zhuang, Y. Mastering collaborative multi-modal data selection: A focus on informativeness, uniqueness, and representativeness. arXiv 2024, arXiv:2412.06293. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Hong, W.; Yu, W.; Gu, X.; Wang, G.; Gan, G.; Tang, H.; Cheng, J.; Qi, J.; Ji, J.; Pan, L.; et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv 2025, arXiv:2507.01006. [Google Scholar]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Team, K.; Du, A.; Yin, B.; Xing, B.; Qu, B.; Wang, B.; Chen, C.; Zhang, C.; Du, C.; Wei, C.; et al. Kimi-vl technical report. arXiv 2025, arXiv:2504.07491. [Google Scholar] [CrossRef]
Yao, Y.; Yu, T.; Zhang, A.; Wang, C.; Cui, J.; Zhu, H.; Cai, T.; Li, H.; Zhao, W.; He, Z.; et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv 2024, arXiv:2408.01800. [Google Scholar]
Hong, W.; Wang, W.; Ding, M.; Yu, W.; Lv, Q.; Wang, Y.; Cheng, Y.; Huang, S.; Ji, J.; Xue, Z.; et al. Cogvlm2: Visual language models for image and video understanding. arXiv 2024, arXiv:2408.16500. [Google Scholar] [CrossRef]
Wu, S.; Fei, H.; Qu, L.; Ji, W.; Chua, T.S. Next-gpt: Any-to-any multimodal llm. In Proceedings of the Forty-first International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A survey on multimodal large language models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef] [PubMed]
Nguyen, T.T.; Wilson, C.; Dalins, J. Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization. arXiv 2025, arXiv:2509.06759. [Google Scholar] [CrossRef]
Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-instruct: Aligning language models with self-generated instructions. arXiv 2022, arXiv:2212.10560. [Google Scholar]
Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; Lin, Q.; Jiang, D. WizardLM: Empowering large pre-trained language models to follow complex instructions. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Zhao, Y.; Yu, B.; Hui, B.; Yu, H.; Huang, F.; Li, Y.; Zhang, N.L. A preliminary study of the intrinsic relationship between complexity and alignment. arXiv 2024, arXiv:2308.05696. [Google Scholar] [CrossRef]
Shengyu, Z.; Linfeng, D.; Xiaoya, L.; Sen, Z.; Xiaofei, S.; Shuhe, W.; Jiwei, L.; Hu, R.; Tianwei, Z.; Wu, F.; et al. Instruction tuning for large language models: A survey. arXiv 2023, arXiv:2308.10792. [Google Scholar]
Li, M.; Zhang, Y.; He, S.; Li, Z.; Zhao, H.; Wang, J.; Cheng, N.; Zhou, T. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. arXiv 2024, arXiv:2402.00530. [Google Scholar]
Chen, L.; Li, S.; Yan, J.; Wang, H.; Gunaratna, K.; Yadav, V.; Tang, Z.; Srinivasan, V.; Zhou, T.; Huang, H.; et al. Alpagasus: Training a better alpaca with fewer data. arXiv 2023, arXiv:2307.08701. [Google Scholar]
Leng, S.; Zhang, H.; Chen, G.; Li, X.; Lu, S.; Miao, C.; Bing, L. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13872–13882. [Google Scholar]
Vieira e Silva, A.L.B.; de Castro Felix, H.; Simões, F.P.M.; Teichrieb, V.; dos Santos, M.; Santiago, H.; Sgotti, V.; Lott Neto, H. Insplad: A dataset and benchmark for power line asset inspection in uav images. Int. J. Remote Sens. 2023, 44, 7294–7320. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRa: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Yu, W.; Yang, Z.; Li, L.; Wang, J.; Lin, K.; Liu, Z.; Wang, X.; Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv 2023, arXiv:2308.02490. [Google Scholar] [CrossRef]

Figure 1. The overall process based on the hallucination-aware data selection method is as follows: First, the candidate dataset is clustered to ensure diversity and representativeness of samples. Then, the clustered data is fed into LVLMs, and a hallucination suppression mechanism is introduced during generation. Next, the responses generated by the model are evaluated: for tasks with clear ground-truth answers (such as multiple-choice questions), accuracy serves as the evaluation metric; for open-ended generation tasks, perplexity is used as the measure. Based on these evaluation scores, a proportional number of high-quality samples are selected from each category, thereby constructing a more valuable multimodal training set. Finally, this filtered dataset is used to fine-tune the multimodal large model, which improves the overall performance and reliability of the model.

Figure 2. An InsPLAD data sample illustrates the requirements of this task, specifically that LVLMs not only identify the bounding box coordinates of objects but also determine their categories.

Figure 3. This is a sample of PowerQA data. LVLMs need to perform question answering based on the content of power-related images. Only with sufficient domain knowledge in the power industry can the model provide correct answers.

Figure 4. Comparison of different methods on MM-Vet performance.

Figure 5. We compare the performance of the proposed method and the random selection method across four different question-answering scenarios in the power domain.

Table 1. Comparison of instruction data selection characteristics across different methods, where ✓ indicates that a method has the corresponding capability and × indicates that it does not.

	Automation	Training-Free	API Independence	Clustering	Hallucination Mitigation
LIMA	×	✓	✓	×	×
Instruction Mining	✓	×	✓	×	×
InstructionGPT-4	✓	×	×	✓	×
AlpaGasus	✓	✓	×	×	×
Our Method	✓	✓	✓	✓	✓

Table 2. Comparison of the performance of baseline methods and our method across different datasets.

InsPLAD
Methods	Precision	Recall	F1
Base	0.318	0.288	0.302
Random_2000	0.530	0.510	0.520
AlpaGasus_2000	0.617	0.570	0.592
Choice_2000	0.700	0.678	0.689
Random_3000	0.609	0.601	0.605
AlpaGasus_3000	0.721	0.709	0.715
Choice_3000	0.763	0.752	0.758
Random_5000	0.736	0.724	0.730
AlpaGasus_5000	0.747	0.728	0.738
Choice_5000	0.815	0.803	0.809
PowerQA
Methods	Cos	BLEU	GPTScore
Base	0.0847	8.48	4.06
Random_1500	0.1650	20.59	5.48
AlpaGasus_1500	0.1673	20.59	5.43
Choice_1500	0.1669	20.66	5.50
Random_3000	0.1774	21.49	5.65
AlpaGasus_3000	0.1817	21.72	5.69
Choice_3000	0.1817	21.88	5.75
Random_5000	0.1916	23.03	5.77
AlpaGasus_5000	0.1920	23.12	5.85
Choice_5000	0.1925	23.01	5.82

Table 3. Performance of different data selection metrics on PowerQA.

	Cos	BLEU	GPTScore
Cos	0.1405	16.08	5.28
BLEU	0.1432	16.18	5.30
GPTScore	0.1781	21.71	5.65
PPL	0.1817	21.88	5.75

Table 4. Ablation Study of Class Numbers.

class_num	Cos	BLEU	GPTScore
0	0.1718	21.19	5.58
10	0.1817	21.88	5.75
30	0.1818	21.96	5.75
40	0.1810	21.69	5.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, H.; Shang, L.; Chen, X.; Feng, T.; Zhang, Y. Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs. Computers 2025, 14, 564. https://doi.org/10.3390/computers14120564

AMA Style

Cheng H, Shang L, Chen X, Feng T, Zhang Y. Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs. Computers. 2025; 14(12):564. https://doi.org/10.3390/computers14120564

Chicago/Turabian Style

Cheng, Huanyu, Linjiang Shang, Xikang Chen, Tao Feng, and Yin Zhang. 2025. "Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs" Computers 14, no. 12: 564. https://doi.org/10.3390/computers14120564

APA Style

Cheng, H., Shang, L., Chen, X., Feng, T., & Zhang, Y. (2025). Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs. Computers, 14(12), 564. https://doi.org/10.3390/computers14120564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs

Abstract

1. Introduction

2. Related Work

2.1. LVLMs

2.2. Instruction Construction

3. Method

3.1. Clustering

3.2. Hallucination Suppression

3.3. Evaluation

3.4. Data Selection

4. Experiments

4.1. Dataset Description

4.1.1. InsPLAD

4.1.2. PowerQA

4.2. Model and Training Details

4.3. Evaluation Metrics

4.3.1. InsPLAD Metrics

4.3.2. PowerQA Metrics

4.3.3. Baselines

4.4. Results

4.4.1. Main Results

4.4.2. Data Selection Metrics

4.4.3. Class Numbers

4.4.4. Generalization

5. Case Study

6. Discussion

7. Limitation and Future Work

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI