Skip Content
You are currently on the new version of our website. Access the old version .

595 Results Found

  • Article
  • Open Access
9 Citations
5,238 Views
14 Pages

Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction

  • Wentao He,
  • Hanjie Ma,
  • Shaohua Li,
  • Hui Dong,
  • Haixiang Zhang and
  • Jie Feng

10 November 2023

Multimodal Relation Extraction (MRE) is a core task for constructing Multimodal Knowledge images (MKGs). Most current research is based on fine-tuning small-scale single-modal image and text pre-trained models, but we find that image-text datasets fr...

  • Article
  • Open Access
2 Citations
1,810 Views
22 Pages

Synthesizing Olfactory Understanding: Multimodal Language Models for Image–Text Smell Matching

  • Sergio Esteban-Romero,
  • Iván Martín-Fernández,
  • Manuel Gil-Martín and
  • Fernando Fernández-Martínez

18 August 2025

Olfactory information, crucial for human perception, is often underrepresented compared to visual and textual data. This work explores methods for understanding smell descriptions within a multimodal context, where scent information is conveyed indir...

  • Review
  • Open Access
65 Citations
36,010 Views
30 Pages

From Large Language Models to Large Multimodal Models: A Literature Review

  • Dawei Huang,
  • Chuan Yan,
  • Qing Li and
  • Xiaojiang Peng

11 June 2024

With the deepening of research on Large Language Models (LLMs), significant progress has been made in recent years on the development of Large Multimodal Models (LMMs), which are gradually moving toward Artificial General Intelligence. This paper aim...

  • Article
  • Open Access
8 Citations
5,246 Views
16 Pages

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

  • Han Ma,
  • Baoyu Fan,
  • Benjamin K. Ng and
  • Chan-Tong Lam

16 January 2024

Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation pow...

  • Article
  • Open Access
1 Citations
3,743 Views
22 Pages

Multimodal AI for UAV: Vision–Language Models in Human– Machine Collaboration

  • Maroš Krupáš,
  • Ľubomír Urblík and
  • Iveta Zolotová

6 September 2025

Recent advances in multimodal large language models (MLLMs)—particularly vision– language models (VLMs)—introduce new possibilities for integrating visual perception with natural-language understanding in human–machine collabo...

  • Review
  • Open Access
6,189 Views
26 Pages

28 August 2025

With the exponential growth of multimodal data, the limitations of traditional unimodal models in cross-modal understanding and complex scenario reasoning have become increasingly evident. Built upon the foundation of Large Language Models (LLMs), Mu...

  • Article
  • Open Access
7 Citations
4,622 Views
10 Pages

Multimodal Food Image Classification with Large Language Models

  • Jun-Hwa Kim,
  • Nam-Ho Kim,
  • Donghyeok Jo and
  • Chee Sun Won

20 November 2024

In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically,...

  • Article
  • Open Access
1,713 Views
27 Pages

18 November 2025

Large Language Models (LLMs) and Multimodal Models (LMMs) are significantly influencing scientific research, including archaeology—a discipline dealing with uniquely complex, multimodal data. This comprehensive review systematically examines re...

  • Article
  • Open Access
9 Citations
6,862 Views
17 Pages

The diffusion of Multimodal Large Language Models (MLLMs) has opened new research directions in the context of video content understanding and classification. Emotion recognition from videos aims to automatically detect human emotions such as anxiety...

  • Article
  • Open Access
3 Citations
6,119 Views
13 Pages

MLKGC: Large Language Models for Knowledge Graph Completion Under Multimodal Augmentation

  • Pengfei Yue,
  • Hailiang Tang,
  • Wanyu Li,
  • Wenxiao Zhang and
  • Bingjie Yan

29 April 2025

Knowledge graph completion (KGC) is a critical task for addressing the incompleteness of knowledge graphs and supporting downstream applications. However, it faces significant challenges, including insufficient structured information and uneven entit...

  • Review
  • Open Access
8 Citations
13,392 Views
27 Pages

12 February 2025

Large language models (LLMs) and large vision models (LVMs) have driven significant advancements in natural language processing (NLP) and computer vision (CV), establishing a foundation for multimodal large language models (MLLMs) to integrate divers...

  • Article
  • Open Access
12 Citations
11,991 Views
47 Pages

24 March 2025

The rapid development of large language models (LLMs) and multimodal large models (MLMs) has introduced transformative opportunities for autonomous driving systems. These advanced models provide robust support for the realization of more intelligent,...

  • Review
  • Open Access
115 Views
40 Pages

A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models

  • Mirela-Magdalena Grosu (Marinescu),
  • Octaviana Datcu,
  • Ruxandra Tapu and
  • Bogdan Mocanu

27 January 2026

Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-s...

  • Article
  • Open Access
6 Citations
4,943 Views
28 Pages

Multimodal AI and Large Language Models for Orthopantomography Radiology Report Generation and Q&A

  • Chirath Dasanayaka,
  • Kanishka Dandeniya,
  • Maheshi B. Dissanayake,
  • Chandira Gunasena and
  • Ruwan Jayasinghe

Access to high-quality dental healthcare remains a challenge in many countries due to limited resources, lack of trained professionals, and time-consuming report generation tasks. An intelligent clinical decision support system (ICDSS), which can mak...

  • Article
  • Open Access
869 Views
21 Pages

Background: Multimodal large language models (LLMs) are increasingly being explored as clinical support tools, yet their capacity for orthodontic biomechanical reasoning has not been systematically evaluated. This retrospective study assessed their a...

  • Review
  • Open Access
14 Citations
7,869 Views
15 Pages

26 December 2023

This paper provides a comprehensive overview of affective computing systems for facial expression recognition (FER) research in naturalistic contexts. The first section presents an updated account of user-friendly FER toolboxes incorporating state-of...

  • Article
  • Open Access
5 Citations
9,531 Views
30 Pages

4 April 2025

This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL),...

  • Article
  • Open Access
29 Citations
9,591 Views
20 Pages

11 September 2024

Large Language Models (LLMs) combined with visual foundation models have demonstrated significant advancements, achieving intelligence levels comparable to human capabilities. This study analyzes the latest Multimodal LLMs (MLLMs), including Multimod...

  • Article
  • Open Access
4 Citations
2,879 Views
15 Pages

24 March 2025

Background/Objectives: Image-based food energy estimation is essential for user-friendly food tracking applications, enabling individuals to monitor their dietary intake through smartphones or AR devices. However, existing deep learning approaches st...

  • Article
  • Open Access
5 Citations
6,236 Views
41 Pages

Predictive maintenance in industrial settings increasingly demands systems capable of integrating heterogeneous data streams while balancing computational efficiency and contextual reasoning. This paper introduces a novel framework leveraging Large L...

  • Feature Paper
  • Article
  • Open Access
2 Citations
3,680 Views
27 Pages

Mitigating Context Bias in Vision–Language Models via Multimodal Emotion Recognition

  • Constantin-Bogdan Popescu,
  • Laura Florea and
  • Corneliu Florea

20 August 2025

Vision–Language Models (VLMs) have become key contributors to the state of the art in contextual emotion recognition, demonstrating a superior ability to understand the relationship between context, facial expressions, and interactions in image...

  • Article
  • Open Access
2,828 Views
18 Pages

MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models

  • Angus Fung,
  • Aaron Hao Tan,
  • Haitong Wang,
  • Bensiyon Benhabib and
  • Goldie Nejat

28 July 2025

Robotic search of people in human-centered environments, including healthcare settings, is challenging, as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans, or locations. Furthermore, robots ne...

  • Review
  • Open Access
12 Citations
4,594 Views
24 Pages

This study aims to comprehensively review and empirically evaluate the application of multimodal large language models (MLLMs) and Large Vision Models (VLMs) in object detection for transportation systems. In the first fold, we provide a background a...

  • Review
  • Open Access
2 Citations
5,342 Views
25 Pages

1 August 2025

Emotion regulation is essential for mental health. However, many people ignore their own emotional regulation or are deterred by the high cost of psychological counseling, which poses significant challenges to making effective support widely availabl...

  • Article
  • Open Access
38 Citations
6,973 Views
20 Pages

Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

  • Mohammad Abu Tami,
  • Huthaifa I. Ashqar,
  • Mohammed Elhenawy,
  • Sebastien Glaser and
  • Andry Rakotonirainy

2 September 2024

Traditional approaches to safety event analysis in autonomous systems have relied on complex machine and deep learning models and extensive datasets for high accuracy and reliability. However, the emerge of multimodal large language models (MLLMs) of...

  • Article
  • Open Access
248 Views
20 Pages

Background: Multimodal large language models (MLLMs) integrating multiple AI systems and unimodal large language models (LLMs) represent distinct approaches to clinical decision support. Their comparative performance against human clinical experts in...

  • Article
  • Open Access
4 Citations
2,427 Views
19 Pages

Facial Analysis for Plastic Surgery in the Era of Artificial Intelligence: A Comparative Evaluation of Multimodal Large Language Models

  • Syed Ali Haider,
  • Srinivasagam Prabha,
  • Cesar A. Gomez-Cabello,
  • Sahar Borna,
  • Ariana Genovese,
  • Maissa Trabilsy,
  • Adekunle Elegbede,
  • Jenny Fei Yang,
  • Andrea Galvao and
  • Antonio Jorge Forte
  • + 1 author

16 May 2025

Background/Objectives: Facial analysis is critical for preoperative planning in facial plastic surgery, but traditional methods can be time consuming and subjective. This study investigated the potential of Artificial Intelligence (AI) for objective...

  • Article
  • Open Access
2,194 Views
15 Pages

Managing traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimo...

  • Article
  • Open Access
743 Views
22 Pages

In the realm of children’s education, multimodal large language models (MLLMs) are already being utilized to create educational materials for young learners. But how significant are the differences between image-based fairy tales generated by M...

  • Article
  • Open Access
16 Citations
3,941 Views
28 Pages

Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges

  • Mohammed Elhenawy,
  • Ahmad Abutahoun,
  • Taqwa I. Alhadidi,
  • Ahmed Jaber,
  • Huthaifa I. Ashqar,
  • Shadi Jaradat,
  • Ahmed Abdelhay,
  • Sebastien Glaser and
  • Andry Rakotonirainy

Multimodal Large Language Models (MLLMs) harness comprehensive knowledge spanning text, images, and audio to adeptly tackle complex problems. This study explores the ability of MLLMs in visually solving the Traveling Salesman Problem (TSP) and Multip...

  • Review
  • Open Access
5 Citations
7,319 Views
22 Pages

31 October 2022

Because the pretraining model is not limited by the scale of data annotation and can learn general semantic information, it performs well in tasks related to natural language processing and computer vision. In recent years, more and more attention ha...

  • Article
  • Open Access

Background/Objectives: Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability,...

  • Article
  • Open Access
25 Citations
5,722 Views
19 Pages

10 October 2024

The integration of thermal imaging data with multimodal large language models (MLLMs) offers promising advancements for enhancing the safety and functionality of autonomous driving systems (ADS) and intelligent transportation systems (ITS). This stud...

  • Article
  • Open Access
2 Citations
3,472 Views
19 Pages

6 September 2021

Over the last few years, there has been an increase in the studies that consider experiential (visual) information by building multi-modal language models and representations. It is shown by several studies that language acquisition in humans starts...

  • Article
  • Open Access
4 Citations
1,669 Views
23 Pages

Background: Periodontitis is a multifactorial disease leading to the loss of clinical attachment and alveolar bone. The diagnosis of periodontitis involves a clinical examination and radiographic evaluation, including panoramic images. Panoramic radi...

  • Article
  • Open Access
1 Citations
3,819 Views
17 Pages

5 January 2025

Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual langua...

  • Article
  • Open Access
31 Citations
13,491 Views
30 Pages

Multimodal Large Language Model-Based Fault Detection and Diagnosis in Context of Industry 4.0

  • Khalid M. Alsaif,
  • Aiiad A. Albeshri,
  • Maher A. Khemakhem and
  • Fathy E. Eassa

12 December 2024

In this paper, a novel multimodal large language model-based fault detection and diagnosis framework that addresses the limitations of traditional fault detection and diagnosis approaches is proposed. The proposed framework leverages the Generative P...

  • Article
  • Open Access
1,042 Views
29 Pages

17 September 2025

Driving safety hinges on the dynamic interplay between task demand and driving capability, yet these concepts lack a unified, quantifiable formulation. In this work, we present a framework based on a multimodal large language model that transforms he...

  • Article
  • Open Access
4 Citations
6,413 Views
20 Pages

13 May 2025

In modern manufacturing, making accurate and timely decisions requires the ability to effectively handle multiple types of data. This paper presents a multimodal system designed specifically for smart manufacturing applications. The system combines v...

  • Article
  • Open Access
325 Views
31 Pages

7 January 2026

Forest fire monitoring in remote sensing imagery has long relied on traditional perception models that primarily focus on detection or segmentation. However, such approaches fall short in understanding complex fire dynamics, including contextual reas...

  • Article
  • Open Access
748 Views
22 Pages

10 December 2025

Eye tracking scanpaths encode the temporal sequence and spatial distribution of eye movements, offering insights into visual attention and aesthetic perception. However, analysing scanpaths still requires substantial manual effort and specialised exp...

  • Feature Paper
  • Article
  • Open Access
2 Citations
3,581 Views
44 Pages

Harmonizer: A Universal Signal Tokenization Framework for Multimodal Large Language Models

  • Amin Amiri,
  • Alireza Ghaffarnia,
  • Nafiseh Ghaffar Nia,
  • Dalei Wu and
  • Yu Liang

29 May 2025

This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified a...

  • Article
  • Open Access
279 Views
16 Pages

8 January 2026

The escalating volume of waste electrical and electronic equipment (WEEE) poses a significant global environmental challenge. The disassembly of printed circuit boards (PCBs), a critical step for resource recovery, remains inefficient due to limitati...

  • Article
  • Open Access
1,194 Views
16 Pages

6 November 2025

Background: Artificial intelligence (AI) has shown significant promise in augmenting diagnostic capabilities across medical specialties. Recent advancements in generative AI allow for synthesis and interpretation of complex clinical data including im...

  • Article
  • Open Access
2 Citations
1,192 Views
26 Pages

Urban Greening Analysis: A Multimodal Large Language Model for Pinpointing Vegetation Areas in Adverse Weather Conditions

  • Hanzhang Liu,
  • Shijie Yang,
  • Chengwu Long,
  • Jiateng Yuan,
  • Qirui Yang,
  • Jiahua Fan,
  • Bingnan Meng,
  • Zhibo Chen,
  • Fu Xu and
  • Chao Mou

14 June 2025

Urban green spaces are an important part of the urban ecosystem and hold significant ecological value. To effectively protect these green spaces, urban managers urgently need to identify them and monitor their changes. Common urban vegetation positio...

  • Article
  • Open Access
4 Citations
6,721 Views
25 Pages

This study presents a comparative analysis of several multimodal large language models (LLMs) for no-reference image quality assessment, with a particular focus on images containing authentic distortions. We evaluate three models developed by OpenAI...

  • Article
  • Open Access
3 Citations
1,671 Views
30 Pages

AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models

  • Huayi Wu,
  • Zhangxiao Shen,
  • Shuyang Hou,
  • Jianyuan Liang,
  • Haoyue Jiao,
  • Yaxian Qing,
  • Xiaopu Zhang,
  • Xu Li,
  • Zhipeng Gui and
  • Longgang Xiang
  • + 1 author

Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we p...

  • Article
  • Open Access
1 Citations
3,985 Views
22 Pages

2 February 2021

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data...

  • Article
  • Open Access
3 Citations
3,499 Views
15 Pages

7 March 2019

When using traditional knowledge retrieval algorithms to analyze whether the feature input of words in multi-modal natural language library is symmetrical, the symmetry of words cannot be analyzed, resulting in inaccurate analysis results. A feature...

  • Article
  • Open Access
5,688 Views
22 Pages

5 October 2025

Autonomous driving in complex real-world environments requires robust perception, reasoning, and physically feasible planning, which remain challenging for current end-to-end approaches. This paper introduces VLA-MP, a unified vision-language-action...

of 12