Next Article in Journal
KD-MSA: A Multimodal Implicit Sentiment Analysis Approach Based on KAN and Asymmetric Contribution-Aware Dynamic Fusion
Previous Article in Journal
Circular Pythagorean Fuzzy Deck of Cards Model for Optimal Deep Learning Architecture in Media Sentiment Interpretation
Previous Article in Special Issue
A Study on the Application of Large Language Models Based on LoRA Fine-Tuning and Difficult-Sample Adaptation for Online Violence Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Symmetry-Aware Advances in Multimodal Large Language Models: Architectures, Training, and Evaluation

1
Department of Computer Science and Technology, Sichuan University, Chengdu 610207, China
2
College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(9), 1400; https://doi.org/10.3390/sym17091400
Submission received: 19 July 2025 / Revised: 20 August 2025 / Accepted: 22 August 2025 / Published: 28 August 2025

Abstract

With the exponential growth of multimodal data, the limitations of traditional unimodal models in cross-modal understanding and complex scenario reasoning have become increasingly evident. Built upon the foundation of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) retain strong reasoning abilities and demonstrate unique capabilities in multimodal understanding. This survey provides a comprehensive overview of the current research landscape of MLLMs. It systematically analyzes mainstream model architectures, training, fine-tuning strategies, and task classifications, while offering a structured account of evaluation methodologies. Beyond synthesis, the paper highlights emerging trends that aim for balanced integration across modalities, tasks, and components, and critically examines key challenges together with potential solutions. The survey specifically emphasizes recent reasoning-oriented MLLMs, with a focus on DeepSeek-R1, analyzing their design paradigms and contributions from the perspective of symmetric reasoning capabilities. Overall, this work offers a comprehensive overview of cutting-edge advancements and lays a foundation for the future development of MLLMs, especially those guided by symmetry principles.

1. Introduction

In the era of artificial intelligence, MLLMs have become a central driver toward artificial general intelligence. The rapid growth of multimodal data exposes the inherent limitations of unimodal models in cross-modal understanding and reasoning. By extending LLMs, MLLMs offer innovative solutions to these challenges, underscoring the need for theoretical perspectives that reveal structural patterns and guide future model development. The concept of symmetry—broadly defined as structural or functional balance across modalities, components, or learning processes—provides such a framework. In the context of multimodal learning, symmetry manifests in several key aspects, including cross-modal alignment, encoder–decoder consistency, and unified architectures that process heterogeneous inputs in a modality-invariant manner. Representative models such as CLIP [1] and Flamingo [2] have achieved breakthrough progress in tasks like image–text retrieval and visual question answering by constructing joint cross-modal semantic spaces through pretraining.
Research in multimodal learning has mainly followed two technical pathways: the first, represented by discriminative models such as CLIP and ALIGN [3], leverages contrastive learning to build cross-modal joint embedding spaces; the second adopts generative architectures like OFA [4], unifying multimodal tasks within a sequence-to-sequence framework. Both paradigms reflect an implicit pursuit of structural or functional symmetry. The technical evolution of MLLMs exhibits three key trends: first, model architectures have advanced from early dual-tower contrastive learning approaches (e.g., CLIP) to hybrid expert systems based on Transformer [5] architectures (e.g., VLMO [6]); second, novel training paradigms such as multimodal instruction tuning and reinforcement learning have significantly enhanced the models’ capacity to comprehend and execute new instructions; third, application scenarios have expanded from basic vision-language tasks to specialized domains.
Recently, MLLMs have made substantial breakthroughs in complex reasoning and multi-task processing. Many of these breakthroughs are closely tied to improved cross-modal alignment and deeper bidirectional reasoning, both of which resonate with symmetry-aware design principles. The DeepSeek-R1 model [7] represents a cutting-edge advancement in this field. Based on the DeepSeek-V3 [8] architecture, DeepSeek-R1 introduces systematic optimization under a staged reinforcement learning framework, significantly enhancing performance in complex reasoning tasks and outperforming peer models in comprehensive evaluations. Notably, the Grok-3 series of models has further pushed the boundaries of reasoning-focused MLLMs. Similarly, the year 2025 has witnessed remarkable advances in MLLMs, exemplified by VideoLLaMA 3 [9] for image and video understanding, VITA-1.5 [10] for real-time vision–speech interaction, each reflecting cutting-edge progress in modality integration and task-specific capabilities.
Nevertheless, several critical challenges continue to hinder the development of multimodal intelligence. Despite remarkable technical progress, current research faces multiple core contradictions. First, the heterogeneity of modalities makes it difficult to balance representation alignment efficiency and the preservation of fine-grained semantics—studies such as ViLT [11] have demonstrated that simple feature-level fusion is inadequate for supporting complex reasoning tasks. Second, the ever-increasing scale of models creates tension with limited computational resources, prompting the emergence of lightweight solutions such as MobileLLaMA [12]. Moreover, issues like answer quality bottlenecks caused by noisy data, multimodal hallucination, and representational bias call for systematic governance mechanisms. Addressing these challenges may benefit from explicitly considering symmetry and asymmetry across modalities, data distributions, and task formulations.
Given the rapid development and promising prospects of this field, this survey aims to provide a systematic review of the current state of research on MLLMs, offering both theoretical guidance and practical references for researchers in related domains. This survey is primarily based on the literature retrieved from Google Scholar, IEEE Xplore, ACM Digital Library, and arXiv, with emphasis on representative studies from the past five years published in leading international venues and directly related to MLLM architectures, training methods, and evaluation. Works lacking empirical evidence, showing weak relevance to multimodality, or being outdated were excluded to ensure systematic coverage and reproducibility. Additional information is provided in Appendix A.
The structure of this survey is as follows: Section 2 provides a comprehensive review of key elements in MLLMs, including mainstream model architectures and a comparative analysis of representative models, with LLaVA [13] used as a case study. Section 3 discusses complete training and fine-tuning strategies, focusing on methods employed in classical models such as CLIP and reasoning-oriented MLLMs like DeepSeek-R1. Section 4 examines the categorization of tasks and systematic evaluation methods for MLLMs. Section 5 analyzes major challenges in current research and application, identifying bottlenecks and proposing corresponding solutions. Section 6 explores future research directions from various perspectives. Finally, Section 7 summarizes the development of MLLMs and presents future outlooks. Throughout this review, we emphasize how the concept of symmetry provides a unifying perspective for understanding and advancing MLLM research.

2. Mainstream Model Architectures

Typical MLLMs can be abstracted into three core components: a pretrained modality encoder, a pretrained LLM, and a modality connector that bridges the two (refer to Figure 1). These components play critical and interrelated roles. The modality encoder is responsible for converting raw multimodal data, such as visual or auditory signals, into feature representations that can be processed by downstream models. The LLM serves as the central intelligence engine, leveraging its powerful language understanding and generation capabilities to perform complex semantic reasoning. Acting as the bridge, the modality connector ensures effective alignment between encoded features and the LLM’s input space, thereby enabling the model to comprehend and integrate multimodal information. The synergy among these three components propels the development and application of multimodal technologies. Notably, many mainstream MLLM architectures intrinsically incorporate symmetry-oriented design principles. These include encoder–decoder structural balance, which maintains bidirectional information flow and architectural parity; modality-parallel processing pathways, which enable simultaneous and structurally analogous treatment of different input modalities; and the use of shared latent representation spaces, which enforce alignment and transformation invariance across modalities. Such symmetry-aware features facilitate more unified, interpretable, and scalable multimodal systems by reducing modality-specific biases and promoting consistent cross-modal reasoning.

2.1. Modality Encoders

As the perceptual front-end of MLLMs, modality encoders convert raw multimodal inputs (e.g., images, audio, video) into low-dimensional semantic features. The design of modality encoders must strike a balance between computational efficiency and cross-modal alignment to ensure high-quality understanding by downstream LLMs. To enhance both performance and efficiency, many existing works adopt pretrained encoders that have achieved semantic alignment across modalities in other tasks. Vision encoders are a major research focus. For example, the CLIP series of models employs a ViT-based architecture trained with image–text contrastive learning on large-scale datasets to achieve cross-modal alignment. In contrast, Osprey [14] adopts ConvNext-L [15], a convolutional-based encoder, to exploit higher resolutions and multi-level feature representations. The choice of encoder often depends on factors such as input resolution, parameter size, and pretraining data. Empirical studies have shown that input resolution tends to have a greater impact on model performance than the composition of training data. Current methods for improving input resolution fall into two main categories:
  • Direct Scaling: feeding higher-resolution images directly into the encoder.
  • Image Partitioning: Splitting high-resolution images into smaller patches, which are then processed individually by a lower-resolution encoder. For example, Monkey [16] divides large images into multiple patches and combines them with a downsampled global image, allowing the patches to capture local features while the low-resolution image retains global context.
For other modalities, representative encoders have also been developed. For instance, Pengi [17] uses CLAP [18] as an audio encoder. ImageBind [19] serves as a unified encoder for multiple modalities, including images, text, audio, depth, thermal imagery, and inertial measurement unit data.

2.2. Pretrained LLMs

Compared to training from scratch, using pretrained LLMs is more efficient and practical. These models are trained on massive web-scale corpora and have embedded extensive world knowledge, exhibiting strong generalization and reasoning abilities. Most LLMs follow the autoregressive decoder architecture, derived from the GPT-3 [20] framework. Among the early widely adopted models is the FlanT5 [21] series, which has been applied in works such as BLIP-2 [22]. Currently, the LLaMA [23] series stands as a representative open-source LLM, widely used in academia.
Most LLMs are built on the Transformer architecture, though recent advancements include the Mamba [24] architecture, which introduces a time-dependent state-space model. This model has linear scaling in both inference cost and memory usage with respect to sequence length. Furthermore, thanks to efficient GPU implementations, Mamba offers faster training while maintaining high performance. Similar to Transformers, Mamba is pretrained via next-token prediction for language modeling.

2.3. Modality Connectors

Since LLMs are inherently designed to process only textual data, a bridging mechanism is required to connect them with other modalities. Training a MLLM from scratch is costly; thus, a more practical approach is to introduce trainable connectors that integrate pretrained vision encoders with LLMs. Another feasible strategy is to employ expert models to convert visual inputs into textual descriptions, which are then fed into LLMs.
Various learnable connectors have been employed in MLLMs, ranging from simple linear layers and multi-layer perceptrons (MLPs) to advanced Transformer-based solutions such as Q-Former, or conditional cross-attention layers integrated directly into the LLM.

2.3.1. Learnable Connectors

A direct approach to projecting visual features into the text embedding space is via linear MLP projections. LLaMA-Adapter [25] uses a single linear layer for multimodal alignment, while LLaVA-1.5 [26] adopts a two-layer MLP, demonstrating enhanced multimodal capabilities. Despite its simplicity, linear projection remains effective in both early and advanced MLLM frameworks [27,28,29], making it a widely used and efficient technique for aligning visual features with text embeddings.
Q-Former, first introduced in BLIP-2, is a Transformer-based architecture that has since been adopted in several methods [30,31]. It features a dual-layer self-attention module, enabling alignment between visual and textual representations. Q-Former interacts with visual features through learnable queries and cross-attention mechanisms. Models like mPLUG-Owl2 [32] simplify the Q-Former structure and propose a visual abstraction component that compresses visual information into learnable tokens, enhancing the semantic richness of visual representations. Similarly, Qwen2-VL [33] compresses visual features using learnable queries and integrates 2D positional encodings.
The cross-attention layer approach, first proposed in Flamingo, incorporates dense cross-attention blocks within pretrained LLM layers. While this adds a significant number of trainable parameters and increases computational cost, it improves training stability and performance. To reduce overhead, this strategy is often combined with Perceiver-based [34] modules that compress visual tokens before feeding them into the LLM. This method has been adopted in multiple recent models [35,36].

2.3.2. Expert Models

Expert models serve as intermediaries that convert multimodal inputs directly into natural language descriptions, allowing LLMs to understand multimodal content without requiring additional retraining. This strategy greatly enhances model flexibility and applicability across modalities. For example, image captioning models can transform visual elements such as objects, scenes, and actions into descriptive text. Similarly, models like VideoChat—Text [37] extract motion features from videos using pretrained vision models and combine them with speech recognition outputs, converting both visual and auditory content into detailed textual descriptions. This enables LLMs to comprehend video content effectively.
However, despite their convenience, expert models are less flexible than learnable connectors. The conversion process inevitably results in information loss. For instance, in the case of video-to-text conversion, the complex spatiotemporal relationships and dynamic variations present in video content may be distorted or omitted, leading to incomplete semantic representations. Therefore, practical applications must carefully balance the convenience of expert models against the need for semantic fidelity.

2.4. Representative MLLMs

The “representative MLLMs” discussed in this survey include MLLMs with broad impact in both academia and industry. They are considered representative because they achieved state-of-the-art results on major benchmarks and introduced methodological innovations that have substantially influenced subsequent research.
In this section, we take LLaVA as a representative example to illustrate the typical architecture of a MLLM:
  • Encoding: LLaVA employs the pretrained Vicuna model as the language backbone f ϕ . and uses the ViT-L/14 encoder from CLIP as the vision encoder.
  • Modality Alignment: For the image X v , a visual encoder g X v is first employed to encode the input image into the visual feature representation Z v , which is subsequently transformed via a linear projection layer W into H v . For the language instruction X q , it is converted into a textual feature representation H q through a tokenizer, which can be written as the following equation:
      Z v = g X v , H v = W · Z v
  • Connection: The visual feature H v and textual representation H q are concatenated and fed into the language model f ϕ . to produce the final output X a . This process can be formally expressed as
    X a = f c o n c a t H v , H q = f c o n c a t W Z v H q = f concat W g X v , H q
Moreover, the field of MLLMs has been developing rapidly, with numerous innovative architectures emerging (refer to Figure 2).

2.5. Classification of MLLM Architectures

Based on whether the visual and language modalities possess equivalent representational capacity in terms of dedicated parameters and computational resources, and whether these modalities interact within deep neural networks, four prototypical architectures of MLLMs can be identified (refer to Figure 3).
Visual–semantic embedding models such as SCAN [38] fall into the first category in Figure 3. These models employ separate encoders for images and texts, with the image encoder typically being more complex. The similarity between the two modalities is then computed using simple dot products or shallow attention layers.
CLIP represents the second category, as it uses independent but equally complex Transformer-based encoders for each modality. However, the interaction between image and text features remains shallow (e.g., via dot product). Although CLIP demonstrates remarkable zero-shot performance in image–text retrieval, its effectiveness on other downstream tasks remains limited. For instance, on NLVR2 [39], fine-tuning with dot-product multimodal representations yields only 50.99 ± 0.38 accuracy, close to chance level, indicating insufficient capacity for complex reasoning tasks.
In contrast, some vision-language models employ deep Transformer layers to model cross-modal interactions between image and text features. Certain modulation-based vision-language models [40] belong to the third category in Figure 3. ViLT [11] is the first model representative of the fourth category, where the image embedding layer is as shallow and computationally lightweight as textual token embeddings. Conventional VLP models (e.g., UNITER [41], Pixel-BERT [42]) rely on convolutional backbones, leading to high latency (∼900 ms for UNITER-Base, ∼60 ms for Pixel-BERT-R50). In contrast, ViLT removes CNNs by employing lightweight linear embeddings and a Transformer, reducing inference time to ∼15 ms while achieving competitive performance (76.1 on NLVR2, 83.5 R@1 on Flickr30K). This demonstrates that MLLMs can improve efficiency without sacrificing effectiveness by minimizing modality-specific encoders and emphasizing cross-modal interaction.
After presenting the taxonomy of MLLMs, we further compare representative models in terms of their architectural designs and task specializations. Table 1 summarizes the key components of different MLLMs.

3. Training and Fine-Tuning Strategies

3.1. Pretraining

Pretraining serves as the initial stage in the training of MLLMs. Its primary objective is to align different modalities and enable the model to acquire multimodal world knowledge. A common pretraining strategy involves freezing the visual encoder and the LLM, while training only the learnable interface modules [54]. This approach aims to align modalities while preserving the knowledge encoded in the pretrained components. Some methods [55], however, unfreeze more components to introduce additional trainable parameters, thereby enhancing modality alignment. Many pretraining strategies implicitly reflect symmetry-related principles, such as encouraging modality-invariant representations, aligning visual and textual semantics, or balancing the contribution of each modality through joint learning objectives. Specifically, pretraining a MLLM can be carried out in either a single-stage or multi-stage manner.

3.1.1. Single-Stage Pretraining

LLaMA-Adapter introduces additional trainable parameters to encapsulate visual knowledge while managing instruction tuning for textual inputs. The model is jointly trained on image–text pairs and instructions across different parameter subsets. Koh [56] proposes a model that adjusts the overall loss by incorporating two contrastive loss functions for image–text retrieval; during training, only three linear layers are updated. Flamingo and its open-source variants [36] train cross-attention layers and Perceiver-based components to connect visual features to the frozen backbone of the LLM.

3.1.2. Multi-Stage Pretraining

In the first stage of multi-stage training, the objective is typically to align image features with the text embedding space. The output after this stage is often fragmented. The second stage then focuses on enhancing the model’s multimodal dialogue capabilities. LLaVA is among the first models to adopt a visual instruction tuning paradigm. In its first stage, only the multimodal adapter is trainable, while the second stage fine-tunes both the adapter and the language model parameters. MiniGPT-4 [44] trains the linear alignment layers in both stages. After initial training, it uses the model itself to curate and refine high-quality training data for the second stage. Unlike previous approaches, mPLUG-Owl2 updates the visual backbone during the initial stage to capture both low-level and high-level visual features. In the second stage, it jointly trains on both unimodal (text-only) and multimodal data to enhance alignment. VLMo adopts a phased pretraining strategy: it first trains visual experts and self-attention modules using a BEiT-style [57] masked image modeling method on a large image dataset to learn visual representations; then freezes these modules and trains text experts on a large-scale text corpus using masked language modeling to capture deep semantic features. Finally, it conducts joint training on image–text pairs to achieve cross-modal learning and improve generalization and representation capacity. VideoLLaMA 3 [9] also serves as a representative multi-stage pretrained model, progressively building multimodal understanding through staged visual adaptation, modality alignment, and task fusion, highlighting the effectiveness of multi-stage pretraining for complex modality integration. The choice of training strategy is closely tied to data quality. For short and noisy subtitle data, lower image resolution can be used to accelerate training; for long, high-quality data, higher resolution is recommended to mitigate hallucinations. ShareGPT4V [58] shows that unfreezing the visual encoder significantly improves alignment when high-quality subtitle data is used in pretraining.

3.2. Fine-Tuning

3.2.1. Instruction Tuning

Instruction tuning aims to enhance the model’s ability to understand and follow user instructions, enabling better generalization to unseen tasks and improving zero-shot performance. This approach has achieved notable success in natural language processing, exemplified by models such as FLAN [21]. A multimodal instruction sample typically consists of three components: an instruction, an input, and an output. The instruction is a natural language description of the task; the input can be an image–text pair [59] or a single modality [60]; and the output is the model-generated response. Instruction templates are highly flexible and can be manually designed to suit different tasks. They can also be extended to multi-turn dialogue settings. Formally, a multimodal instruction sample can be represented as a triplet (I, M, R), where I, M, and R denote the instruction, modality input, and ground-truth response, respectively. The model generates predicted responses based on the instruction and input. SVLA [61] applies end-to-end joint instruction tuning using multimodal instruction data to align speech, vision, and language modules, exemplifying the effectiveness of instruction tuning in multimodal integration.
Instruction data can be collected via data adaptation, self-instruction, and dataset mixing. Task-specific datasets are a key source of high-quality instruction data. For example, the VQA dataset consists of (image, question, answer) triplets, which can be naturally transformed into instruction samples. Notably, answers in existing datasets tend to be short, which may limit the model’s generative capabilities. Possible solutions include: explicitly specifying output length in the instruction and using GPT to extend the answer. To support real-world scenarios such as multi-turn dialogue, some studies employ self-instruction methods [62], where LLMs generate instruction-following data based on a small set of human-labeled examples. LLaVA extends this method to the multimodal domain by converting images into text descriptions and bounding boxes, prompting GPT-4 to generate instruction data, resulting in the LLaVA-Instruct-150k dataset. With the advent of GPT-4V, some studies now leverage it to generate higher-quality training data.
Studies show that the quality of instruction-tuning data significantly impacts model performance. Lynx [63] finds that models pretrained on smaller but higher-quality datasets outperform those trained on larger yet noisier datasets. Experimental results also suggest that diversified prompts improve model performance. Among task types, visual reasoning tasks offer more significant performance gains than captioning or Q&A. Moreover, increasing instruction complexity appears to be more effective than simply expanding task diversity or adding fine-grained spatial annotations.

3.2.2. Parameter-Efficient Fine-Tuning (PEFT)

When a pretrained LLM needs to be adapted to a specific domain or application, PEFT provides an effective alternative to full model tuning. Instead of updating the entire model, PEFT methods introduce a small number of additional trainable parameters. LaVi [64] introduces an internal visual feature modulation mechanism within the language model, enabling efficient multimodal integration and exemplifying the use of parameter-efficient fine-tuning in vision–language tasks.
Among these approaches, prompt tuning [65,66] learns a small set of vectors that are prepended to the input text as soft prompts. In contrast, Low-Rank Adaptation ( LoRA) [67] constrains the number of new parameters by learning low-rank matrices to approximate weight updates. This technique is orthogonal to quantization-based methods such as QLoRA [68], which further reduces the memory consumption of LLMs by representing weights with lower precision, offering greater memory savings compared to standard half-precision models.

3.2.3. Alignment Tuning

Alignment tuning is commonly applied when models need to align with human preferences in specific contexts. Currently, two main techniques dominate this area: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO).
RLHF leverages reinforcement learning algorithms to fine-tune LLMs based on human feedback. This approach incorporates human-annotated preferences as supervision signals to guide model behavior during training. The RLHF pipeline typically involves three key stages: supervised fine-tuning, reward modeling, and reinforcement learning.
Researchers have also extended RLHF to multimodal settings. For example, LLaVA-RLHF [69] collects preference pairs based on human feedback to fine-tune the model, resulting in reduced hallucinations and more aligned outputs. On the other hand, DPO simplifies the pipeline by directly optimizing a binary preference loss, thereby avoiding explicit reward modeling required in traditional methods.

3.2.4. Representative Model Case Studies

This section presents the complete training and fine-tuning pipelines of CLIP and the latest inference model, DeepSeek-R1, providing readers with a comprehensive understanding of training practices.
CLIP is a foundational MLLM that has inspired numerous variants. Its text encoder typically adopts a Transformer architecture, using self-attention mechanisms to capture dependencies between words and generate text embeddings. For visual encoding, CLIP employs the Vision Transformer (ViT) [70], which encodes images via self-attention and projects the resulting representation into the same embedding space as the text vector through a fully connected layer.
CLIP was pretrained on a massive dataset containing 400 million image–text pairs collected from the internet. This diverse dataset spans a wide range of domains and topics, allowing the model to learn rich multimodal representations. To enhance generalization, CLIP applies various image augmentations during training, such as random cropping, rotation, and scaling, enabling the model to handle image distortions and noise more robustly.
The core idea behind CLIP’s learning objective is contrastive learning—maximizing the similarity between matched pairs while minimizing the similarity between mismatched ones. Specifically, CLIP adopts the InfoNCE loss, formulated as follows:
L = 1 N i = 1 N log exp cos z i I , z i T / τ j = 1 N exp cos z i I , z j T / τ
Here, N denotes the batch size, z i I , z i T represent the embedding vectors of the image and its corresponding text, and cos z i I , z i T indicates the cosine similarity between two vectors. The parameter τ is a temperature scalar used to scale the cosine similarity values. Contrastive pretraining methods promote bidirectional semantic consistency across modalities, which aligns with functional notions of symmetry in cross-modal representation learning. The CLIP model has demonstrated broad applicability across various domains, including image retrieval, text generation, image classification, and multimodal analysis tasks.
DeepSeek-R1 is a recently released inference-oriented MLLM, optimized from DeepSeek-V3 to enhance reasoning capabilities. This section details its innovative training and fine-tuning strategies. The DeepSeek team employed a multi-stage reinforcement learning pipeline, starting with the DeepSeek-R1-Zero phase, where reward modeling was introduced. The final DeepSeek-R1 stage involved a systematic multi-phase training procedure. The goal is to improve reasoning, problem-solving, and context-aware response generation.
The training procedure overview is followed by a detailed analysis of each stage. In the DeepSeek-R1-Zero phase, DeepSeek-V3 acts as the agent, generating answers and reasoning processes. Correct responses or logically sound reasoning chains receive positive rewards, guiding the model to update its policy and promote higher-reward generations.
Traditional reinforcement learning frameworks often incorporate a critic model to assess the quality of the agent’s output and provide feedback. Typically, the critic is of comparable scale and complexity to the policy model. However, DeepSeek takes a different approach with its Group-based Reinforcement Policy Optimization (GRPO) algorithm. Rather than employing a critic model of equal size, GRPO estimates the baseline directly from group-level scores. For each question q , the GRPO algorithm samples a set of outputs o 1 , o 2 , , o G from the previous policy and then updates the current policy π θ by maximizing the following objective:
J G R P O ( θ ) = E q P ( Q ) , o i i = 1 G π θ o l d ( O q ) 1 G i = 1 G min π θ o i q π θ o l d o i q A i , clip π θ o i q π θ old   o i q , 1 ε , 1 + ε A i β D K L π θ π r e f ,
D K L π θ π r e f = π r e f o i q π θ o i q log π r e f o i q π θ o i q 1 ,
In this formulation, ε and β are predefined hyperparameters, while A i represents the computed advantage. A group of rewards r 1 , r 2 , , r G is assigned to the corresponding outputs within each sample group:
A i = r i mean r 1 , r 2 , , r G std r 1 , r 2 , , r G
The reward serves as the source of the training signal and determines the direction of optimization. To train DeepSeek-R1-Zero, the team employed a rule-based reward system, which primarily includes two types of rewards: accuracy reward and format reward.
DeepSeek-R1-Zero excels in reasoning but struggles with readability and language consistency. To enhance its performance, the DeepSeek team employed cold-start data collection and Supervised Fine-Tuning (SFT). They provided the DeepSeek-V3 Base model with chain-of-thought exemplars, enabling structured reasoning and reducing inconsistencies. Direct prompting was used to ensure the model explicitly detailed its reasoning process.
The team also refined outputs from DeepSeek-R1-Zero, creating a high-quality dataset for fine-tuning the DeepSeek-V3 Base model. Reinforcement learning, with an extended reward mechanism including language consistency, was reintroduced. Approximately 600,000 optimal reasoning samples were retained after rigorous filtering. An additional 200,000 non-reasoning samples were included to bolster general capabilities. The final stage involved reinforcement learning across all scenarios to align the model with human preferences, enhancing usefulness and harmlessness.
After multiple rounds of training, the model outperformed both open-source and proprietary counterparts on several popular benchmark datasets, and was officially named DeepSeek-R1 (refer to Figure 4).
Following the description of the training pipeline of DeepSeek-R1, we next provide a broader comparative perspective by examining the performance of representative MLLMs across key evaluation dimensions. As illustrated in Figure 5, the radar chart highlights the relative strengths and weaknesses of these models in tasks such as mathematics, knowledge-intensive reasoning, code generation, and general-purpose inference.

4. Task Taxonomy and Evaluation Methods for MLLMs

4.1. Task Taxonomy

In the field of MLLMs, tasks can be broadly categorized into two fundamental types based on their intrinsic logic and core objectives: understanding tasks and generation tasks. These two categories are closely interconnected and mutually reinforcing, forming the foundational framework for the technological development and practical applications of MLLMs. We further summarize the representative benchmarks and commonly used evaluation metrics associated with each category. Table 2 provides an overview of task types, benchmarks, and evaluation measures, which serves as a reference for systematic comparison. From a symmetry-oriented perspective, task taxonomies can benefit from maintaining a balanced coverage across modalities, interaction types, and cognitive levels, which ensures that MLLMs are evaluated under structurally diverse yet functionally comparable scenarios.

4.1.1. Understanding Tasks

Understanding tasks aim to extract deep semantic information from diverse and complex multimodal inputs, accurately analyze the intrinsic associations among modalities, and achieve deep cross-modal alignment. This process involves the systematic analysis and comprehensive understanding of multimodal data such as images, videos, texts, and audio, laying the cognitive foundation for subsequent interaction and content generation. Specifically, understanding tasks encompasses the following closely coupled and interdependent sub-directions:
  • Deep Visual Content Analysis: This focuses on a comprehensive understanding of images and videos. In visual question answering (VQA), the model is required to respond accurately to natural language questions based on the given image or video content. For example, BLIP [92] significantly improves performance on VQA tasks through an innovative joint training strategy of vision-language models, enabling the model to better capture the semantic association between vision and language. GLaMM [93] introduces a region encoder to support fine-grained visual grounding in question answering, interpreting, and responding to specific regions within an image. Image captioning tasks aim to generate precise and descriptive text for input images. Region understanding and visual grounding tasks focus on locating specific regions in images and extracting their semantic content. VistaLLM [94] and LLaFS adopt unique point-sequence encoding approaches to process polygonal regions, providing key technical support for precise region localization.
  • Cross-modal Association and Retrieval: This focuses on establishing tight correlations between different modalities within a shared embedding space and enabling efficient bidirectional retrieval. In cross-modal retrieval, CLIP leverages contrastive learning to align image–text features, laying a solid foundation for image-to-text and text-to-image retrieval. ImageBind further expands these capabilities by incorporating images, videos, and audio, significantly enriching the application scenarios and possibilities of cross-modal retrieval.

4.1.2. Generation Tasks

The core objective of generation tasks is to creatively generate output in one modality based on input from another modality. These tasks demand high-quality, relevant, and controllable content generation. Generation tasks include:
  • Image-related Generation and Editing: Image generation tasks aim to create high-quality images based on text descriptions. GILL [53] maps frozen MLLMs to diffusion models to achieve multimodal image generation, offering new paradigms for such tasks. DALL·E 2 [95], based on advanced diffusion models, enables high-resolution image generation, significantly advancing visual quality. Image editing tasks modify images locally or globally according to instructions or text. InstructPix2Pix [96] combines LLMs to generate editing instructions, offering more intelligent and flexible editing capabilities. MLLM-Guided Editing [97] further utilizes multimodal dialogue understanding to refine editing requirements, allowing personalized and precise modifications.
  • Cross-modal and Domain-specific Generation: Multimodal generation tasks aim to generate content across modalities, such as text-to-video or image-to-text. NExT-GPT [98], a major breakthrough in this domain, supports arbitrary modality-to-modality generation, expanding the application scope and innovation potential. VITA-1.5 [10] enables interactive generation across speech and vision modalities, supporting speech output from both speech and image inputs, and exemplifies an end-to-end approach to multimodal generation tasks. Domain-specific generation focuses on areas like healthcare and design, leveraging domain knowledge and multimodal data. For instance, Medical Image Synthesis [99] combines medical knowledge bases to generate high-quality diagnostic images, serving as a valuable tool in medical diagnosis. Robot Action Generation [100] translates visual inputs into robotic action commands, enabling intelligent decision-making and control in complex environments, driving real-world deployment of robotic technologies.

4.2. Evaluation Methods

Evaluation plays a critical role in the research and development of MLLMs. It is not only essential for measuring model performance, but also drives optimization and technical advancement. Symmetry principles may also inform the design of evaluation protocols, such as using bidirectional tests or balanced benchmark construction, to reduce modality or task biases and promote more reliable comparisons. Compared to traditional LLMs, MLLMs exhibit more complex characteristics, necessitating a comprehensive evaluation of their multifunctionality, emergent capabilities, and adaptability to intricate interaction scenarios.
Currently, evaluation methods are mainly divided into closed-ended and open-ended approaches. Each is suited to different settings and comes with its own strengths and limitations.

4.2.1. Closed-Ended Evaluation

Closed-ended evaluation focuses on tasks where answers are predefined within a limited set of options. This method relies heavily on structured datasets and standardized metrics, such as accuracy and CIDEr [101] scores. Its main advantage lies in its quantifiability and reproducibility, offering objective benchmarks for model performance.
In a closed-ended evaluation, benchmark datasets play a pivotal role. In tasks such as VQA, image captioning, and scientific reasoning, models are typically evaluated under zero-shot or fine-tuned settings. Comprehensive benchmarks have also emerged to provide a broader assessment. MME [102] covers 14 perceptual and cognitive tasks, ensuring fairness and effectiveness. MMBench [74] evaluates multidimensional capabilities by mapping open responses to predefined options, enriching the evaluation perspectives. LENS [103] introduces a hierarchical visual reasoning benchmark consisting of approximately 3400 real-world images and over 60,000 human-crafted QA pairs across perception, understanding, and reasoning levels. It serves as a closed-set evaluation tool to systematically assess model performance across varying depths of reasoning. Additionally, specific capability tests, such as POPE [104], use adversarial samples to detect hallucinations and assess model reliability in content generation.

4.2.2. Open-Ended Evaluation

Open-ended evaluation requires models to generate free-form responses, which poses higher demands on semantic understanding, reasoning, and creativity. As a result, it is more complex.
Human evaluation remains a key approach in open-ended settings. Experts design question sets covering various dimensions, including visual understanding and complex reasoning, and subjectively assess the quality of model responses. For instance, GPT4Tools [105] assesses the logical coherence of model outputs from multiple perspectives, offering qualitative insights.
GPT-based scoring enables automatic evaluation of response relevance and accuracy using LLMs. With advancements in vision-language capabilities, GPT-4V has emerged, capable of analyzing visual content directly, thereby improving evaluation reliability. Woodpecker [106] scores models based on image–text alignment, offering evaluations more reflective of real-world multimodal tasks. HumaniBench [107] is a human-centered benchmark for MLLMs, covering dimensions such as fairness, ethics, and linguistic inclusivity, and is well-suited for open-set evaluation of social alignment.
However, open-ended evaluation also presents several challenges. On one hand, it relies on subjective criteria or proxy models, potentially introducing bias and affecting result accuracy. On the other hand, human annotation is resource-intensive, while automated evaluation methods still require further validation to ensure robustness and reliability.
Together, the task taxonomy and evaluation methods provide a structured lens through which the capabilities of MLLMs can be assessed. By linking different categories of tasks with appropriate evaluation protocols, they establish a consistent basis for analyzing progress and identifying areas that require further improvement.

5. Key Challenges in MLLM Research

Despite the remarkable progress in research and practical applications of MLLMs, several key challenges remain unresolved. These challenges significantly affect the further enhancement of model performance, broader application deployment, and the sustainable development of the underlying technology. Many of the challenges faced by MLLMs—ranging from hallucination to data noise and ethical risks—can be understood through the lens of representational and functional asymmetries across modalities, model components, and learning signals. The following sections examine these key challenges in detail.

5.1. Multimodal Hallucination

Multimodal hallucination is a prominent issue that severely impacts the reliability and accuracy of model outputs in real-world applications. According to current studies, multimodal hallucination can be broadly categorized into three types:
  • Existence hallucination: the most fundamental type, where the model incorrectly claims the presence of objects that do not actually exist in the image, leading to fundamental misinterpretations.
  • Attribute hallucination: This refers to incorrect descriptions of object attributes, such as misidentifying the color of a dog. Since attribute descriptions are based on object existence, this type is often closely related to existence hallucination.
  • Relation hallucination: a more complex type where the model misrepresents relationships between objects—such as relative positions or interactions—misleading the understanding of spatial and contextual relationships within a scene.
Multimodal hallucination often arises from asymmetric learning signals across modalities, where dominant modalities override weaker ones. Mitigating such an imbalance through symmetry-aware learning objectives may reduce hallucinated outputs and improve semantic fidelity. To mitigate hallucinations, strategies have been proposed across three stages:
  • Pre-generation mitigation: This involves fine-tuning with curated data. For instance, LLaVA-RLHF [108] leverages human preference data and reinforcement learning to reduce hallucinations and generate more expected responses. LRV-Instruction [109] introduces a fine-grained visual instruction tuning dataset with negative instructions to guide models toward image-faithful outputs.
  • In-process correction: Architectural design or feature representations are modified to reduce hallucinations. For example, VCD [110] attributes hallucinations to dataset bias and strong language priors, proposing a decode-then-contrast design. HACL [111] explores visual and language embedding spaces using contrastive learning to pull aligned cross-modal pairs closer and push apart hallucinated representations.
  • Post-generation correction: Methods like Woodpecker provide hallucination correction after the output is generated. Woodpecker is a training-free, general hallucination correction framework that integrates expert models to supplement image context and employs a stepwise correction pipeline with intermediate result inspection.

5.2. Challenges of Noisy Data

Automatically collected image–text pairs from the internet often contain various forms of noise, significantly hindering model training and generalization. Without human curation, textual descriptions may not semantically align with corresponding images, resulting in mislabeled or inaccurate annotations that provide misleading signals during training.
Noise and inconsistencies in multimodal data can amplify asymmetries in representation learning, leading to biased or unstable predictions. Encouraging structural or distributional symmetry during training may help improve model resilience to noise. Researchers have developed several techniques. For instance, ALBEF [112] improves robustness to noisy web data using momentum distillation, where a “momentum model” generates pseudo-targets as additional supervision, thereby reducing reliance on noisy labels. BLIP incorporates a captioning and filtering module that generates synthetic captions and filters them based on image–text matching quality, extracting high-quality pairs from large-scale web data. MVCD [107] introduces a zero-training inference-time contrastive decoding strategy that compares output distributions between original and perturbed visual inputs, enhancing robustness of multimodal reasoning under noisy visual data—demonstrating a practical approach to tackling input noise.

5.3. Computational and Storage Costs

The development of MLLMs demands extensive computational resources and storage, presenting a major bottleneck for widespread deployment. Traditional vision-language pretraining approaches rely on convolutional neural networks (CNNs) like ResNet and Faster R-CNN [113], which incur high computational costs for region extraction and other steps.
Imbalanced architectural complexity or processing pathways across modalities can lead to inefficient computation and scaling bottlenecks. Designing symmetric or unified processing blocks may reduce redundancy and facilitate more efficient parameter sharing. To reduce these costs, several innovations have emerged. Vision Transformer divides images into patches and uses transformers for processing, avoiding convolutions and improving efficiency. ViLT further simplifies the architecture by removing CNNs and region features, instead using simple linear projections for patch embedding and fusion with text, achieving strong downstream task performance with reduced cost. ALBEF separates the decoder into two parts: the first processes unimodal text without cross-attention, reducing complexity; the second performs cross-modal attention with the image encoder.

5.4. Safety and Ethics

As MLLMs are increasingly applied in critical domains, they face growing risks of adversarial attacks. Malicious inputs, such as adversarial samples, can disrupt model performance and lead to incorrect or harmful outputs, undermining decision-making reliability.
Moreover, these models may absorb various societal biases from training data—such as those related to gender, race, or culture. Studies have shown that web-scraped data can lead to models generating inappropriate or biased content. Although efforts have been made in text-to-image generation to mitigate such biases [114,115], further exploration is necessary for MLLMs.
These biases can manifest in model outputs, resulting in unfair treatment of certain groups and violating ethical principles. For example, facial recognition performance may vary across racial groups, and text generation may contain gender stereotypes. Ethical risks often stem from asymmetric treatment of modalities, tasks, or user groups during training and deployment. Incorporating symmetry-aware fairness constraints—such as balanced representation, response parity, or equal risk exposure—may contribute to safer and more equitable MLLM systems.
These challenges underscore not only the technical limitations of current MLLMs but also the broader concerns of reliability, efficiency, and ethical responsibility. Recognizing and systematically addressing these issues will lay the foundation for sustainable progress in the field.

6. Future Research Directions for MLLMs

Building on the significant progress achieved thus far, future research in MLLMs holds immense potential to overcome existing technical bottlenecks, expand application boundaries, and advance towards more efficient, intelligent, and practical systems. Addressing the limitations of current MLLMs will increasingly require principled frameworks that go beyond empirical scaling. Among these, symmetry—conceived as a structural and functional balance across modalities, temporal sequences, and representational layers—offers a compelling lens through which to guide future research. Symmetry-informed approaches have the potential to enhance robustness, coherence, and generalization in multimodal reasoning by enforcing consistency across modalities, aligning semantic structures, and balancing information flow. In the following, we outline three directions where symmetry-aware principles may play an important role in shaping the next generation of MLLMs.

6.1. Enhancing Long-Range Cross-Modal Reasoning

Long-range cross-modal reasoning poses unique challenges in maintaining coherent semantic dependencies over extended sequences. Symmetry principles—particularly those promoting temporal and structural alignment across modalities—may support the development of models capable of sustaining consistent reasoning across diverse inputs and time spans. Traditional vision-language models have laid the foundation for dynamic reasoning through explorations in modality fusion mechanisms. For instance, VLMO proposes a hybrid modality expert architecture that combines modality-specific experts with shared attention, supporting task-driven encoding mode switching. BEiT V3 achieves modality-adaptive encoding under a unified architecture, effectively addressing issues such as manual task format conversion and inefficient parameter sharing across modalities.
DeepSeek-R1 extended its hybrid expert routing mechanism to the multimodal domain. Its reinforcement learning-based dynamic expert selection strategy improves cross-modal resource allocation and demonstrates the effectiveness of dynamically adjusting modality weights in mathematical reasoning tasks. Grok-3 reinforces the stability of modality interaction through ultra-large-scale training.
Furthermore, multimodal in-context learning expands upon traditional in-context learning by integrating multimodal inputs such as images and texts. This enables inference with minimal annotation and promotes reasoning transfer using only a few examples, with performance heavily influenced by example quantity and order. Multimodal chain-of-thought reasoning mimics human step-by-step reasoning by generating reasoning steps and final answers for a given task. It enables complex multimodal reasoning by combining image interpretation with textual understanding, supporting decomposition and solution of intricate tasks, and significantly enhancing the model’s reasoning capabilities in complex multimodal scenarios.

6.2. Finer-Grained Cross-Modal Interaction

To enhance user–agent interaction, research increasingly focuses on supporting finer-grained multimodal inputs and outputs. Achieving fine-grained multimodal understanding requires tightly coupled interactions between modalities at multiple levels of abstraction. Symmetry-guided mechanisms, such as bidirectional attention balancing and shared representational mapping, offer promising means to enhance the precision and reciprocity of such interactions.
On the input side, Ferret [116] introduces a hybrid representation framework that supports diverse forms such as points, bounding boxes, and sketches to improve fine-grained object grounding. Osprey integrates a pretrained segmentation model to enable precise part-level localization with a single click, overcoming the limitations of pixel-level interaction.
On the output side, LISA [117] achieves mask-level visual understanding and reasoning, enabling pixel-wise semantic segmentation tasks for more precise and accurate model outputs.
Recent VLA models such as Helix [118], GR00T N1 [119], and Gemini Robotics [120] demonstrate fine-grained cross-modal interaction through unified architectures. By integrating perception, instruction, and control via cross-modal attention and aligned representations, they exemplify precise coordination across semantic and action layers. Finer-grained interaction across modalities is expected to be a key future direction. By exploring improved alignment strategies, interaction mechanisms, and fusion techniques, the performance and applicability of MLLMs can be further enhanced.

6.3. New Application Paradigms

As MLLMs are increasingly deployed in open-ended and dynamic scenarios, ensuring stable adaptation across novel tasks and modalities becomes essential. Here, symmetry-aware adaptation strategies can help maintain functional balance, reduce modality dominance, and support consistent performance across shifting application contexts. In graphical user interface interactions, CogAgent [121] supports GUI element parsing for automating app operations. Domain-specific adaptation is another prominent trend. In document understanding, leveraging structural data without OCR improves table recognition accuracy. In medical imaging, LLaVA-Med incorporates medical knowledge graphs and achieves 89% clinical consistency on radiological VQA tasks, demonstrating the potential of MLLMs in domain-specific applications.

6.4. Expansion Toward Omni-MLLMs

Omni-MLLMs represent a major extension of MLLMs in terms of modality breadth. Unlike earlier models that mainly focused on text–image tasks, Omni-MLLMs integrate a wider range of modalities within a unified framework, including speech [122], video, 3D environments, and IMU sensor signals [123]. These models are also exhibiting fine-grained cross-modal interaction abilities, such as ImageBind-LLM [124], which supports joint reasoning across images, audio, and 3D data, and CoDi-2 [125], which extends the capability to “Any-to-Any” generation by producing both audio and image outputs from interleaved multimodal contexts. At the same time, the application scope of Omni-MLLMs has been significantly broadened, ranging from real-time multimodal dialogue [126] to world simulation [127] and multi-sensor autonomous driving [128]. In addition to open-source models, several closed-source systems have also pushed forward omni-modal capabilities in both research and industry. Nevertheless, despite the rapid emergence of Omni-MLLMs, a systematic evaluation framework, standardized data construction methods, and a deeper understanding of cross-modal generalization remain underexplored, underscoring important directions for future research. Overall, future research on MLLMs is converging toward more general, fine-grained, and versatile systems that can operate across diverse modalities and tasks. Such directions highlight the field’s ambition to move closer to robust and truly universal multimodal intelligence.

7. Conclusions

This survey has reviewed the landscape of MLLMs, covering their architectures, training strategies, evaluation methodologies, and task classifications. We highlighted major challenges—such as multimodal hallucination, noisy data, computational costs, and safety concerns—and connected them to open research opportunities. Future directions point to enhancing long-range reasoning, enabling finer-grained cross-modal interaction, exploring new application paradigms, and expanding toward Omni-MLLMs. By aligning current limitations with these emerging trends, this work not only consolidates the present state of MLLM research but also outlines concrete paths toward more reliable, efficient, and generalizable multimodal intelligence.

Author Contributions

Investigation, X.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMsLarge Language Models
MLLMsMultimodal Large Language Models
MLPsMulti-layer perceptrons
PEFTParameter-Efficient Fine-Tuning
LoRALow-Rank Adaptation
RLHFReinforcement Learning from Human Feedback
DPODirect Preference Optimization
GRPOGroup-based Reinforcement Policy Optimization
SFTSupervised Fine-Tuning
VQAVisual Question Answering

Appendix A. Specifications Table

SubjectComputer Science
Specific subject areaArtificial intelligence
Type of dataFigures (processed)
Data collectionAll data used in this review were manually extracted from peer-reviewed literature, open-access research articles, and official technical reports. Figures were compiled and reorganized by the authors for clarity and comparative purposes. No experimental instruments were involved.
Data source locationThe data were collected from publicly available literature sources and open-access repositories; no new data were generated by the authors.
Data accessibilityNo new raw data were generated in this study. All data used are available in the original publications cited in the References section.
Repository name: not applicable
Data identification number: not applicable
Direct URL to data: not applicable
Instructions for accessing these data: refer to cited articles in the References section.
Related research articleNone
Previously published dataset or data descriptorThis review did not reuse any previously published dataset. All referenced datasets are described and cited in the original works reviewed in this article.

References

  1. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Vienna, Austria, 1 July 2021; pp. 8748–8763. [Google Scholar]
  2. Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
  3. Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Vienna, Austria, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
  4. Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. Ofa: Unifying Architectures, Tasks, and Modalities through a Simple Sequence-to-Sequence Learning Framework. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 25–27 July 2022; pp. 23318–23340. [Google Scholar]
  5. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  6. Bao, H.; Wang, W.; Dong, L.; Liu, Q.; Mohammed, O.K.; Aggarwal, K.; Som, S.; Piao, S.; Wei, F. Vlmo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. Adv. Neural Inf. Process. Syst. 2022, 35, 32897–32912. [Google Scholar]
  7. Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X. Deepseek-R1: Incentivizing Reasoning Capability in Llms via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
  8. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
  9. Zhang, B.; Li, K.; Cheng, Z.; Hu, Z.; Yuan, Y.; Chen, G.; Leng, S.; Jiang, Y.; Zhang, H.; Li, X.; et al. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding 2025. arXiv 2025, arXiv:2501.13106. [Google Scholar]
  10. Fu, C.; Lin, H.; Wang, X.; Zhang, Y.-F.; Shen, Y.; Liu, X.; Cao, H.; Long, Z.; Gao, H.; Li, K.; et al. VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction. arXiv 2025, arXiv:2501.01957. [Google Scholar]
  11. Kim, W.; Son, B.; Kim, I. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
  12. Kan, K.B.; Mun, H.; Cao, G.; Lee, Y. Mobile-Llama: Instruction Fine-Tuning Open-Source Llm for Network Analysis in 5g Networks. IEEE Netw. 2024, 38, 76–83. [Google Scholar] [CrossRef]
  13. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
  14. Yuan, Y.; Li, W.; Liu, J.; Tang, D.; Luo, X.; Qin, C.; Zhang, L.; Zhu, J. Osprey: Pixel Understanding with Visual Instruction Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28202–28211. [Google Scholar]
  15. Cherti, M.; Beaumont, R.; Wightman, R.; Wortsman, M.; Ilharco, G.; Gordon, C.; Schuhmann, C.; Schmidt, L.; Jitsev, J. Reproducible Scaling Laws for Contrastive Language-Image Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2818–2829. [Google Scholar]
  16. Li, Z.; Yang, B.; Liu, Q.; Ma, Z.; Zhang, S.; Yang, J.; Sun, Y.; Liu, Y.; Bai, X. Monkey: Image Resolution and Text Label Are Important Things for Large Multi-Modal Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26763–26773. [Google Scholar]
  17. Deshmukh, S.; Elizalde, B.; Singh, R.; Wang, H. Pengi: An Audio Language Model for Audio Tasks. Adv. Neural Inf. Process. Syst. 2023, 36, 18090–18108. [Google Scholar]
  18. Elizalde, B.; Deshmukh, S.; Al Ismail, M.; Wang, H. Clap Learning Audio Concepts from Natural Language Supervision. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
  19. Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. Imagebind: One Embedding Space to Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15180–15190. [Google Scholar]
  20. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  21. Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S. Scaling Instruction-Finetuned Language Models. J. Mach. Learn. Res. 2024, 25, 1–53. [Google Scholar]
  22. Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
  23. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. Llama: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  24. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
  25. Zhang, R.; Han, J.; Liu, C.; Zhou, A.; Lu, P.; Qiao, Y.; Li, H.; Gao, P. LLaMA-Adapter: Efficient Fine-Tuning of Large Language Models with Zero-Initialized Attention. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  26. Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306. [Google Scholar]
  27. Wang, W.; Lv, Q.; Yu, W.; Hong, W.; Qi, J.; Wang, Y.; Ji, J.; Yang, Z.; Zhao, L.; XiXuan, S. Cogvlm: Visual Expert for Pretrained Language Models. Adv. Neural Inf. Process. Syst. 2024, 37, 121475–121499. [Google Scholar]
  28. Chen, S.; Chen, X.; Zhang, C.; Li, M.; Yu, G.; Fei, H.; Zhu, H.; Fan, J.; Chen, T. Ll3da: Visual Interactive Instruction Tuning for Omni-3d Understanding Reasoning and Planning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 26428–26438. [Google Scholar]
  29. Lin, J.; Yin, H.; Ping, W.; Molchanov, P.; Shoeybi, M.; Han, S. VILA: On Pre-training for Visual Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 26689–26699. [Google Scholar]
  30. Chen, G.; Shen, L.; Shao, R.; Deng, X.; Nie, L. Lion: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 26540–26550. [Google Scholar]
  31. Hu, W.; Xu, Y.; Li, Y.; Li, W.; Chen, Z.; Tu, Z. Bliva: A Simple Multimodal Llm for Better Handling of Text-Rich Visual Questions. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2024), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 2256–2264. [Google Scholar]
  32. Ye, Q.; Xu, H.; Ye, J.; Yan, M.; Hu, A.; Liu, H.; Qian, Q.; Zhang, J.; Huang, F. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 13040–13051. [Google Scholar]
  33. Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
  34. Jaegle, A.; Gimeno, F.; Brock, A.; Vinyals, O.; Zisserman, A.; Carreira, J. Perceiver: General Perception with Iterative Attention. In Proceedings of the International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; Volume 139, pp. 4651–4664. [Google Scholar]
  35. Chen, D.; Liu, J.; Dai, W.; Wang, B. Visual Instruction Tuning with Polite Flamingo. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2024), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 17745–17753. [Google Scholar]
  36. Laurençon, H.; Saulnier, L.; Tronchon, L.; Bekman, S.; Singh, A.; Lozhkov, A.; Wang, T.; Karamcheti, S.; Rush, A.; Kiela, D. Obelics: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents. Adv. Neural Inf. Process. Syst. 2023, 36, 71683–71702. [Google Scholar]
  37. Li, K.; He, Y.; Wang, Y.; Li, Y.; Wang, W.; Luo, P.; Wang, Y.; Wang, L.; Qiao, Y. VideoChat: Chat-Centric Video Understanding. arXiv 2024, arXiv:2305.06355. [Google Scholar]
  38. Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Van Gool, L. SCAN: Learning to Classify Images Without Labels. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 268–285. ISBN 978-3-030-58606-5. [Google Scholar]
  39. Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1, pp. 2556–2565. [Google Scholar]
  40. Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2018), New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 3942–3951. [Google Scholar]
  41. Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12375, pp. 104–120. ISBN 978-3-030-58576-1. [Google Scholar]
  42. Huang, Z.; Zeng, Z.; Liu, B.; Fu, D.; Fu, J. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv 2020, arXiv:2004.00849. [Google Scholar]
  43. Ye, Q.; Xu, H.; Xu, G.; Ye, J.; Yan, M.; Zhou, Y.; Wang, J.; Hu, A.; Shi, P.; Shi, Y.; et al. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv 2024, arXiv:2304.14178. [Google Scholar]
  44. Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
  45. Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards General-Purpose Vision-Language Models with Instruction Tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 49250–49267. [Google Scholar]
  46. Li, B.; Zhang, Y.; Chen, L.; Wang, J.; Pu, F.; Cahyono, J.A.; Yang, J.; Li, C.; Liu, Z. Otter: A Multi-Modal Model with in-Context Instruction Tuning. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 7543–7557. [Google Scholar] [CrossRef]
  47. Chen, K.; Zhang, Z.; Zeng, W.; Zhang, R.; Zhu, F.; Zhao, R. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv 2023, arXiv:2306.15195. [Google Scholar]
  48. Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning. arXiv 2023, arXiv:2310.09478. [Google Scholar]
  49. Chen, X.; Djolonga, J.; Padlewski, P.; Mustafa, B.; Changpinyo, S.; Wu, J.; Ruiz, C.R.; Goodman, S.; Wang, X.; Tay, Y.; et al. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv 2023, arXiv:2305.18565. [Google Scholar]
  50. Zhao, L.; Yu, E.; Ge, Z.; Yang, J.; Wei, H.; Zhou, H.; Sun, J.; Peng, Y.; Dong, R.; Han, C.; et al. ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning. arXiv 2023, arXiv:2307.09474. [Google Scholar] [CrossRef]
  51. Zheng, K.; He, X.; Wang, X.E. MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens. arXiv 2023, arXiv:2310.02239. [Google Scholar]
  52. Jin, Y.; Xu, K.; Xu, K.; Chen, L.; Liao, C.; Tan, J.; Huang, Q.; Chen, B.; Lei, C.; Liu, A.; et al. Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. arXiv 2023, arXiv:2309.04669. [Google Scholar]
  53. Koh, J.Y.; Fried, D.; Salakhutdinov, R.R. Generating Images with Multimodal Language Models. Adv. Neural Inf. Process. Syst. 2023, 36, 21487–21506. [Google Scholar]
  54. Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. Adv. Neural Inf. Process. Syst. 2023, 36, 28541–28564. [Google Scholar]
  55. Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y. Visionllm: Large Language Model Is Also an Open-Ended Decoder for Vision-Centric Tasks. Adv. Neural Inf. Process. Syst. 2023, 36, 61501–61513. [Google Scholar]
  56. Koh, J.Y.; Salakhutdinov, R.; Fried, D. Grounding Language Models to Images for Multimodal Inputs and Outputs. In Proceedings of the International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 17283–17300. [Google Scholar]
  57. Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O.K.; Singhal, S.; Som, S.; et al. Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 19175–19186. [Google Scholar]
  58. Chen, L.; Li, J.; Dong, X.; Zhang, P.; He, C.; Wang, J.; Zhao, F.; Lin, D. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. In Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 370–387. ISBN 978-3-031-72642-2. [Google Scholar]
  59. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
  60. Karpathy, A.; Fei-Fei, L. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
  61. Huynh, N.D.; Bouadjenek, M.R.; Razzak, I.; Hacid, H.; Aryal, S. SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation. arXiv 2025, arXiv:2503.24164. [Google Scholar]
  62. Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv 2023, arXiv:2212.10560. [Google Scholar]
  63. Zeng, Y.; Zhang, H.; Zheng, J.; Xia, J.; Wei, G.; Wei, Y.; Zhang, Y.; Kong, T.; Song, R. What Matters in Training a Gpt4-Style Language Model with Multimodal Inputs? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2024), Mexico City, Mexico, 16–21 June 2024; Volume 1, pp. 7930–7957. [Google Scholar]
  64. Yue, T.; Guo, L.; Tang, Y.; Zhao, Z.; Zhu, X.; Huang, H.; Liu, J. LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation. arXiv 2025, arXiv:2506.16691. [Google Scholar]
  65. Chen, P.-Y. Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2024), Honolulu, HI, USA, 20–27 February 2024; Volume 38, pp. 22584–22591. [Google Scholar]
  66. Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT Understands, Too. AI Open 2024, 5, 208–215. [Google Scholar] [CrossRef]
  67. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-Rank Adaptation of Large Language Models. ICLR 2022, 1, 3. [Google Scholar]
  68. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient Finetuning of Quantized Llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
  69. Sun, Z.; Shen, S.; Cao, S.; Liu, H.; Li, C.; Shen, Y.; Gan, C.; Gui, L.-Y.; Wang, Y.-X.; Yang, Y.; et al. Aligning Large Multimodal Models with Factually Augmented RLHF. arXiv 2023, arXiv:2309.14525. [Google Scholar] [CrossRef]
  70. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  71. Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; pp. 3195–3204. [Google Scholar]
  72. Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From Recognition to Cognition: Visual Commonsense Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; pp. 6720–6731. [Google Scholar]
  73. Masry, A.; Long, D.X.; Tan, J.Q.; Joty, S.; Hoque, E. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. arXiv 2022, arXiv:2203.10244. [Google Scholar] [CrossRef]
  74. Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. MMBench: Is Your Multi-Modal Model an All-Around Player? In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2025; Volume 15064, pp. 216–233. ISBN 978-3-031-72657-6. [Google Scholar]
  75. Chen, L.; Li, J.; Dong, X.; Zhang, P.; Zang, Y.; Chen, Z.; Duan, H.; Wang, J.; Qiao, Y.; Lin, D. Are We on the Right Way for Evaluating Large Vision-Language Models? Adv. Neural Inf. Process. Syst. 2024, 37, 27056–27087. [Google Scholar]
  76. Yue, X.; Ni, Y.; Zhang, K.; Zheng, T.; Liu, R.; Zhang, G.; Stevens, S.; Jiang, D.; Ren, W.; Sun, Y.; et al. MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert Agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 9556–9567. [Google Scholar]
  77. Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.-W.; Galley, M.; Gao, J. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. arXiv 2024, arXiv:2310.02255. [Google Scholar] [CrossRef]
  78. Yu, W.; Yang, Z.; Li, L.; Wang, J.; Lin, K.; Liu, Z.; Wang, X.; Wang, L. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv 2024, arXiv:2308.02490. [Google Scholar]
  79. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 14–19 June 2020; pp. 9726–9735. [Google Scholar]
  80. Cho, J.; Zala, A.; Bansal, M. Dall-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2–6 October 2023; pp. 3043–3054. [Google Scholar]
  81. Bakr, E.M.; Sun, P.; Shen, X.; Khan, F.F.; Li, L.E.; Elhoseiny, M. Hrs-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2–6 October 2023; pp. 20041–20053. [Google Scholar]
  82. Ghosh, D.; Hajishirzi, H.; Schmidt, L. Geneval: An Object-Focused Framework for Evaluating Text-to-Image Alignment. Adv. Neural Inf. Process. Syst. 2023, 36, 52132–52152. [Google Scholar]
  83. Hu, X.; Wang, R.; Fang, Y.; Fu, B.; Cheng, P.; Yu, G. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment. arXiv 2024, arXiv:2403.05135. [Google Scholar] [CrossRef]
  84. Wang, S.; Saharia, C.; Montgomery, C.; Pont-Tuset, J.; Noy, S.; Pellegrini, S.; Onoe, Y.; Laszlo, S.; Fleet, D.J.; Soricut, R. Imagen Editor and Editbench: Advancing and Evaluating Text-Guided Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 18359–18369. [Google Scholar]
  85. Wu, X.; Yu, D.; Huang, Y.; Russakovsky, O.; Arora, S. Conceptmix: A Compositional Image Generation Benchmark with Controllable Difficulty. Adv. Neural Inf. Process. Syst. 2024, 37, 86004–86047. [Google Scholar]
  86. Sheynin, S.; Polyak, A.; Singer, U.; Kirstain, Y.; Zohar, A.; Ashual, O.; Parikh, D.; Taigman, Y. Emu Edit: Precise Image Editing via Recognition and Generation Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 8871–8879. [Google Scholar]
  87. Basu, S.; Saberi, M.; Bhardwaj, S.; Chegini, A.M.; Massiceti, D.; Sanjabi, M.; Hu, S.X.; Feizi, S. EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods. arXiv 2023, arXiv:2310.02426. [Google Scholar]
  88. Huang, Y.; Xie, L.; Wang, X.; Yuan, Z.; Cun, X.; Ge, Y.; Zhou, J.; Dong, C.; Huang, R.; Zhang, R. Smartedit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 8362–8371. [Google Scholar]
  89. Yu, Q.; Chow, W.; Yue, Z.; Pan, K.; Wu, Y.; Wan, X.; Li, J.; Tang, S.; Zhang, H.; Zhuang, Y. AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2025), Nashville, TN, USA, 11–15 June 2025; pp. 26125–26135. [Google Scholar]
  90. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 2017, arXiv:1706.08500. [Google Scholar]
  91. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training Gans. arXiv 2016, arXiv:1606.03498. [Google Scholar] [CrossRef]
  92. Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 12888–12900. [Google Scholar]
  93. Rasheed, H.; Maaz, M.; Shaji, S.; Shaker, A.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Xing, E.; Yang, M.-H.; Khan, F.S. GlaMM: Pixel Grounding Large Multimodal Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 13009–13018. [Google Scholar]
  94. Pramanick, S.; Han, G.; Hou, R.; Nag, S.; Lim, S.-N.; Ballas, N.; Wang, Q.; Chellappa, R.; Almahairi, A. Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 14076–14088. [Google Scholar]
  95. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with Clip Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
  96. Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 18392–18402. [Google Scholar]
  97. Kubota, K.; Togo, R.; Maeda, K.; Ogawa, T.; Haseyama, M. MLLM-Based Automatic Exploration of Editing Prompt for High Engagement Image Generation. In Proceedings of the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), Kitakyushu, Japan, 29 October–1 November 2024; IEEE: New York, NY, USA, 2024; pp. 1165–1166. [Google Scholar]
  98. Wu, S.; Fei, H.; Qu, L.; Ji, W.; Chua, T.-S. NExT-GPT: Any-to-Any Multimodal LLM. In Proceedings of the Forty-first International Conference on Machine Learning (ICML 2024), Vancouver, BC, Canada, 21–27 July 2024; Volume 235, pp. 53366–53397. [Google Scholar]
  99. Nie, D.; Trullo, R.; Lian, J.; Wang, L.; Petitjean, C.; Ruan, S.; Wang, Q.; Shen, D. Medical Image Synthesis with Deep Convolutional Adversarial Networks. IEEE Trans. Biomed. Eng. 2018, 65, 2720–2730. [Google Scholar] [CrossRef]
  100. Yoshino, K.; Wakimoto, K.; Nishimura, Y.; Nakamura, S. Caption Generation of Robot Behaviors Based on Unsupervised Learning of Action Segments. In Conversational Dialogue Systems for the Next Decade; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021; pp. 227–241. ISBN 978-981-15-8394-0. [Google Scholar]
  101. Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
  102. Li, B.; Ge, Y.; Ge, Y.; Wang, G.; Wang, R.; Zhang, R.; Shan, Y. Seed-Bench: Benchmarking Multimodal Large Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 13299–13308. [Google Scholar]
  103. Yao, R.; Zhang, B.; Huang, J.; Long, X.; Zhang, Y.; Zou, T.; Wu, Y.; Su, S.; Xu, Y.; Zeng, W.; et al. LENS: Multi-Level Evaluation of Multimodal Reasoning with Large Language Models. arXiv 2025, arXiv:2505.15616. [Google Scholar] [CrossRef]
  104. Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, W.X.; Wen, J.-R. Evaluating Object Hallucination in Large Vision-Language Models. arXiv 2023, arXiv:2305.10355. [Google Scholar] [CrossRef]
  105. Yang, R.; Song, L.; Li, Y.; Zhao, S.; Ge, Y.; Li, X.; Shan, Y. Gpt4tools: Teaching Large Language Model to Use Tools via Self-Instruction. Adv. Neural Inf. Process. Syst. 2023, 36, 71995–72007. [Google Scholar]
  106. Yin, S.; Fu, C.; Zhao, S.; Xu, T.; Wang, H.; Sui, D.; Shen, Y.; Li, K.; Sun, X.; Chen, E. Woodpecker: Hallucination Correction for Multimodal Large Language Models. Sci. China Inf. Sci. 2024, 67, 220105. [Google Scholar] [CrossRef]
  107. Raza, S.; Narayanan, A.; Khazaie, V.R.; Vayani, A.; Chettiar, M.S.; Singh, A.; Shah, M.; Pandya, D. HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation. arXiv 2025, arXiv:2505.11454. [Google Scholar]
  108. Yu, T.; Yao, Y.; Zhang, H.; He, T.; Han, Y.; Cui, G.; Hu, J.; Liu, Z.; Zheng, H.-T.; Sun, M. RLHF-V: Towards Trustworthy Mllms via Behavior Alignment from Fine-Grained Correctional Human Feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 13807–13816. [Google Scholar]
  109. Liu, F.; Lin, K.; Li, L.; Wang, J.; Yacoob, Y.; Wang, L. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning. arXiv 2023, arXiv:2306.14565. Available online: https://arxiv.org/abs/2306.14565 (accessed on 21 August 2025).
  110. Leng, S.; Zhang, H.; Chen, G.; Li, X.; Lu, S.; Miao, C.; Bing, L. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 13872–13882. [Google Scholar]
  111. Jiang, C.; Xu, H.; Dong, M.; Chen, J.; Ye, W.; Yan, M.; Ye, Q.; Zhang, J.; Huang, F.; Zhang, S. Hallucination Augmented Contrastive Learning for Multimodal Large Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 27036–27046. [Google Scholar]
  112. Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
  113. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  114. Schramowski, P.; Brack, M.; Deiseroth, B.; Kersting, K. Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 22522–22531. [Google Scholar]
  115. Poppi, S.; Poppi, T.; Cocchi, F.; Cornia, M.; Baraldi, L.; Cucchiara, R. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. In Computer Vision—ECCV 2024; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2025; pp. 340–356. ISBN 978-3-031-73667-4. [Google Scholar]
  116. You, H.; Zhang, H.; Gan, Z.; Du, X.; Zhang, B.; Wang, Z.; Cao, L.; Chang, S.-F.; Yang, Y. Ferret: Refer and Ground Anything Anywhere at Any Granularity. arXiv 2023, arXiv:2310.07704. [Google Scholar] [CrossRef]
  117. Lai, X.; Tian, Z.; Chen, Y.; Li, Y.; Yuan, Y.; Liu, S.; Jia, J. Lisa: Reasoning Segmentation via Large Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 9579–9589. [Google Scholar]
  118. Figure AI. Helix: A Vision-Language-Action Model for Generalist Humanoid Control. Fig. AI News 2024. Available online: https://www.figure.ai/news/helix (accessed on 21 August 2025).
  119. NVIDIA; Bjorck, J.; Castañeda, F.; Cherniadev, N.; Da, X.; Ding, R.; Fan, L.J.; Fang, Y.; Fox, D.; Hu, F.; et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv 2025, arXiv:2503.14734. [Google Scholar]
  120. Team, G.R.; Abeyruwan, S.; Ainslie, J.; Alayrac, J.-B.; Arenas, M.G.; Armstrong, T.; Balakrishna, A.; Baruch, R.; Bauza, M.; Blokzijl, M.; et al. Gemini Robotics: Bringing AI into the Physical World. arXiv 2025, arXiv:2503.20020. [Google Scholar] [CrossRef]
  121. Hong, W.; Wang, W.; Lv, Q.; Xu, J.; Yu, W.; Ji, J.; Wang, Y.; Wang, Z.; Dong, Y.; Ding, M. Cogagent: A Visual Language Model for Gui Agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 14281–14290. [Google Scholar]
  122. Zhang, D.; Li, S.; Zhang, X.; Zhan, J.; Wang, P.; Zhou, Y.; Qiu, X. SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. arXiv 2023, arXiv:2503.14734. [Google Scholar]
  123. Su, Y.; Lan, T.; Li, H.; Xu, J.; Wang, Y.; Cai, D. PandaGPT: One Model To Instruction-Follow Them All. arXiv 2023, arXiv:2305.16355. [Google Scholar] [CrossRef]
  124. Han, J.; Zhang, R.; Shao, W.; Gao, P.; Xu, P.; Xiao, H.; Zhang, K.; Liu, C.; Wen, S.; Guo, Z.; et al. ImageBind-LLM: Multi-Modality Instruction Tuning. arXiv 2023, arXiv:2309.03905. [Google Scholar]
  125. Tang, Z.; Yang, Z.; Khademi, M.; Liu, Y.; Zhu, C.; Bansal, M. CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 27425–27434. [Google Scholar]
  126. Xie, Z.; Wu, C. Mini-Omni2: Towards Open-Source GPT-4o with Vision, Speech and Duplex Capabilities. arXiv 2024, arXiv:2410.11190. [Google Scholar]
  127. Ge, Z.; Huang, H.; Zhou, M.; Li, J.; Wang, G.; Tang, S.; Zhuang, Y. WorldGPT: Empowering LLM as Multimodal World Model. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October 2024; ACM: Melbourne, VIC, Australia, 2024; pp. 7346–7355. [Google Scholar]
  128. Wang, W.; Xie, J.; Hu, C.; Zou, H.; Fan, J.; Tong, W.; Wen, Y.; Wu, S.; Deng, H.; Li, Z.; et al. DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving. arXiv 2023, arXiv:2312.09245. [Google Scholar]
Figure 1. The overall architecture of a representative MLLM.
Figure 1. The overall architecture of a representative MLLM.
Symmetry 17 01400 g001
Figure 2. Timeline of the development of representative MLLMs.
Figure 2. Timeline of the development of representative MLLMs.
Symmetry 17 01400 g002
Figure 3. Classification of four prototypical MLLMs.
Figure 3. Classification of four prototypical MLLMs.
Symmetry 17 01400 g003
Figure 4. Training pipeline of DeepSeek-R1.
Figure 4. Training pipeline of DeepSeek-R1.
Symmetry 17 01400 g004
Figure 5. Radar chart of MLLMs’ performance across key evaluation dimensions.
Figure 5. Radar chart of MLLMs’ performance across key evaluation dimensions.
Symmetry 17 01400 g005
Table 1. Comparison of MLLMs in architecture and capabilities.
Table 1. Comparison of MLLMs in architecture and capabilities.
ModelVisual EncoderAdapterLLM BackboneI→OMain Tasks
BLIP-2 [22]EVA ViT-gQ-FormerFlan-T5/OPTI + T→TVQA, Captioning, Retrieval
Flamingo [2]NFNet-F6LLM MH-AttnChinchilla-1.4B/7B/70BI + V + T→TVQA, Captioning
LLaVA [35]CLIP ViT-LLinearVicuna-7B/13BI + T→TVQA, Captioning
mPLUG-Owl [43]CLIP ViT-LQ-FormerLLaMA-7BI + T→TVisual Dialogue, VQA
MiniGPT-4 [44]EVA ViT-gLinearVicuna-13BI + T→TVQA, Captioning
InstructBLIP [45]EVA ViT-gQ-FormerFlan-T5/VicunaI + V + T→TVQA, Captioning
Otter [46]CLIP ViT-LLLM MH-AttnLLaMA-7BI + T→TVQA, Captioning
Shikra [47]CLIP ViT-LLinearVicuna-7B/13BI + T→T + IVQA, Captioning, Referring
CogVLM [27]EVA ViT-EMLPVicuna-v1.5-7BI + T→TVQA, Captioning, REC
VILA [29]CLIP ViT-LLinearLLaMA-2-7B/13BI + T→TVQA, Captioning
LLaVA-1.5 [26]CLIP ViT-LMLPVicuna-v1.5-7B/13BI + T→TVQA, Captioning
MiniGPT-v2 [48]EVA ViT-gLinearLLaMA-2-Chat-7BI + T→TVQA, Captioning, Referring
PaLI-X [49]ViT-22BLLM MH-AttnUL2-32BI + T→TMultilingual, VQA
ChatSpot [50]CLIP ViT-LLinearVicuna-7B/LLaMAI + T→TVQA, Captioning, Referring
MiniGPT-5 [51]EVA ViT-gQ-FormerVicuna-7BI + T→I + T Image Generation
LaVIT [52]EVA ViT-gLLM MH-AttnLLaMA-7BI + T→I + TCaptioning, Image Generation
GILL [53]CLIP ViT-LLinearOPT-6.7BI + T→I + TRetrieval, Image Generation
Table 2. Summary of task taxonomy, benchmarks, and evaluation metrics for MLLMs.
Table 2. Summary of task taxonomy, benchmarks, and evaluation metrics for MLLMs.
Task TaxonomyBenchmarkEvaluation Metrics
Understanding TasksVQA [59]
OK-VQA [71]
VCR [72]
ChartQA [73]
VSR [13]
MMBench [74]
MMStar [75]
MMMU [76]
MathVista [77]
MM-Vet [78]
Accuracy
Top-k Accuracy [79]
Precision
F1 Score
mAP
NDCG
Generation TasksPaintSkills [80]
HRS-Bench [81]
GenEval [82]
DPG-Bench [83]
EditBench [84]
ConceptMix [85]
Emu-Edit [86]
EditVal [87]
Reason-Edit [88]
AnyEdit [89]
CIDEr
SPICE
FID [90]
IS [91]
Relevance
Faithfulness
GPTScore
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Liu, H. Symmetry-Aware Advances in Multimodal Large Language Models: Architectures, Training, and Evaluation. Symmetry 2025, 17, 1400. https://doi.org/10.3390/sym17091400

AMA Style

Liu X, Liu H. Symmetry-Aware Advances in Multimodal Large Language Models: Architectures, Training, and Evaluation. Symmetry. 2025; 17(9):1400. https://doi.org/10.3390/sym17091400

Chicago/Turabian Style

Liu, Xinran, and Haojie Liu. 2025. "Symmetry-Aware Advances in Multimodal Large Language Models: Architectures, Training, and Evaluation" Symmetry 17, no. 9: 1400. https://doi.org/10.3390/sym17091400

APA Style

Liu, X., & Liu, H. (2025). Symmetry-Aware Advances in Multimodal Large Language Models: Architectures, Training, and Evaluation. Symmetry, 17(9), 1400. https://doi.org/10.3390/sym17091400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop