You are currently viewing a new version of our website. To view the old version click .

270 Results Found

  • Article
  • Open Access
1,642 Views
17 Pages

Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions

  • Chenchen Wang,
  • Chaomurilige,
  • Yu Weng,
  • Xuan Liu and
  • Zheng Liu

24 June 2025

Multimodal knowledge graph entity alignment is a key basic task of knowledge fusion and integration, which is used to identify entities with semantic equivalent but different representation forms in different knowledge graphs. Previous entity alignme...

  • Article
  • Open Access
2 Citations
2,605 Views
17 Pages

A Study on Generative Models for Visual Recognition of Unknown Scenes Using a Textual Description

  • Jose Martinez-Carranza,
  • Delia Irazú Hernández-Farías,
  • Victoria Eugenia Vazquez-Meza,
  • Leticia Oyuki Rojas-Perez and
  • Aldrich Alfredo Cabrera-Ponce

27 October 2023

In this study, we investigate the application of generative models to assist artificial agents, such as delivery drones or service robots, in visualising unfamiliar destinations solely based on textual descriptions. We explore the use of generative m...

  • Article
  • Open Access
3 Citations
2,351 Views
19 Pages

12 April 2024

The current research on text-guided 3D synthesis predominantly utilizes complex diffusion models, posing significant challenges in tasks like terrain generation. This study ventures into the direct synthesis of text-to-3D terrain in a zero-shot fashi...

  • Article
  • Open Access
3 Citations
2,657 Views
27 Pages

10 March 2025

In the early stages of architectural design, architects convert initial ideas into concrete design schemes, which heavily rely on their creativity and consume considerable time. Therefore, generative design methods based on artificial intelligence ar...

  • Article
  • Open Access
2 Citations
2,229 Views
20 Pages

11 December 2023

With the rapid growth of social media, textual content is increasingly growing. Unstructured texts are a rich source of latent spatial information. Extracting such information is useful in query processing, geographical information retrieval (GIR), a...

  • Article
  • Open Access
5 Citations
3,423 Views
38 Pages

A New Generative Model for Textual Descriptions of Medical Images Using Transformers Enhanced with Convolutional Neural Networks

  • Artur Gomes Barreto,
  • Juliana Martins de Oliveira,
  • Francisco Nauber Bernardo Gois,
  • Paulo Cesar Cortez and
  • Victor Hugo Costa de Albuquerque

The automatic generation of descriptions for medical images has sparked increasing interest in the healthcare field due to its potential to assist professionals in the interpretation and analysis of clinical exams. This study explores the development...

  • Article
  • Open Access
5 Citations
2,721 Views
17 Pages

Visual Description Augmented Integration Network for Multimodal Entity and Relation Extraction

  • Min Zuo,
  • Yingjun Wang,
  • Wei Dong,
  • Qingchuan Zhang,
  • Yuanyuan Cai and
  • Jianlei Kong

18 May 2023

Multimodal Named Entity Recognition (MNER) and multimodal Relationship Extraction (MRE) play an important role in processing multimodal data and understanding entity relationships across textual and visual domains. However, irrelevant image informati...

  • Article
  • Open Access
15 Citations
8,563 Views
15 Pages

Enhanced Image Captioning with Color Recognition Using Deep Learning Methods

  • Yeong-Hwa Chang,
  • Yen-Jen Chen,
  • Ren-Hung Huang and
  • Yi-Ting Yu

26 December 2021

Automatically describing the content of an image is an interesting and challenging task in artificial intelligence. In this paper, an enhanced image captioning model—including object detection, color analysis, and image captioning—is prop...

  • Article
  • Open Access
2 Citations
1,863 Views
14 Pages

30 June 2025

Crop diseases pose a significant threat to agricultural productivity and global food security. Timely and accurate disease identification is crucial for improving crop yield and quality. While most existing deep learning-based methods focus primarily...

  • Article
  • Open Access
1 Citations
1,897 Views
19 Pages

A Picture May Be Worth a Hundred Words for Visual Question Answering

  • Yusuke Hirota,
  • Noa Garcia,
  • Mayu Otani,
  • Chenhui Chu and
  • Yuta Nakashima

31 October 2024

How far can textual representations go in understanding images? In image understanding, effective representations are essential. Deep visual features from object recognition models currently dominate various tasks, especially Visual Question Answerin...

  • Article
  • Open Access
1,557 Views
13 Pages

18 November 2024

In the realm of computer vision and animation, the generation of human motion from textual descriptions represents a frontier of significant challenge and potential. This paper introduces MLUG, a groundbreaking framework poised to transform motion sy...

  • Article
  • Open Access
2 Citations
6,891 Views
22 Pages

22 July 2022

Technological problems related to everyday work elements are real, and IT professionals can solve them. However, when they encounter a problem, they must go to a platform where they can detail the category and textual description of the incident so t...

  • Article
  • Open Access
777 Views
15 Pages

Lightweight Multimodal Adapter for Visual Object Tracking

  • Vasyl Borsuk,
  • Vitaliy Yakovyna and
  • Nataliya Shakhovska

Visual object tracking is a fundamental computer vision task recently extended to multimodal settings, where natural language descriptions complement visual information. Existing multimodal trackers typically rely on large-scale transformer architect...

  • Article
  • Open Access
7 Citations
4,523 Views
10 Pages

Multimodal Food Image Classification with Large Language Models

  • Jun-Hwa Kim,
  • Nam-Ho Kim,
  • Donghyeok Jo and
  • Chee Sun Won

20 November 2024

In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically,...

  • Article
  • Open Access
31 Citations
4,971 Views
14 Pages

Deep Unsupervised Embedding for Remote Sensing Image Retrieval Using Textual Cues

  • Mohamad M. Al Rahhal,
  • Yakoub Bazi,
  • Taghreed Abdullah,
  • Mohamed L. Mekhalfi and
  • Mansour Zuair

14 December 2020

Compared to image-image retrieval, text-image retrieval has been less investigated in the remote sensing community, possibly because of the complexity of appropriately tying textual data to respective visual representations. Moreover, a single image...

  • Article
  • Open Access
2,663 Views
19 Pages

25 August 2025

Open-vocabulary object detection (OVOD) aims to localize and recognize objects in images by leveraging category-specific textual inputs, including both known and novel categories. While existing methods excel in general scenarios, their performance s...

  • Article
  • Open Access
12 Citations
3,628 Views
18 Pages

23 January 2024

Image captioning is a technique that enables the automatic extraction of natural language descriptions about the contents of an image. On the one hand, information in the form of natural language can enhance accessibility by reducing the expertise re...

  • Article
  • Open Access
4 Citations
4,223 Views
17 Pages

Generation of Custom Textual Model Editors

  • Eugene Syriani,
  • Daniel Riegelhaupt,
  • Bruno Barroca and
  • Istvan David

6 November 2021

Textual editors are omnipresent in all software tools. Editors provide basic features, such as copy-pasting and searching, or more advanced features, such as error checking and text completion. Current technologies in model-driven engineering can aut...

  • Article
  • Open Access
1 Citations
2,098 Views
20 Pages

8 January 2024

Collating vast test reports is a time-consuming and laborious task in crowdsourced testing. Crowdsourced test reports are usually presented in two ways, one as text and the other as images, which have symmetrical content. Researchers have proposed ma...

  • Article
  • Open Access
2,467 Views
18 Pages

The paper demonstrates a novel methodology for Content-Based Image Retrieval (CBIR), which shifts the focus from conventional domain-specific image queries to more complex text-based query processing. Latent diffusion models are employed to interpret...

  • Article
  • Open Access
26 Citations
7,164 Views
19 Pages

Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs

  • Helena Gómez-Adorno,
  • Grigori Sidorov,
  • David Pinto,
  • Darnes Vilariño and
  • Alexander Gelbukh

29 August 2016

We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract text...

  • Article
  • Open Access
3,417 Views
25 Pages

OVSLT: Advancing Sign Language Translation with Open Vocabulary

  • Ai Wang,
  • Junhui Li,
  • Wuyang Luan and
  • Lei Pan

Hearing impairments affect approximately 1.5 billion individuals worldwide, highlighting the critical need for effective communication tools between deaf and hearing populations. Traditional sign language translation (SLT) models predominantly rely o...

  • Article
  • Open Access
1,125 Views
26 Pages

Text-based person search (TPS), a critical technology for security and surveillance, aims to retrieve target individuals from image galleries using textual descriptions. The existing methods face two challenges: (1) ambiguous attribute–noun ass...

  • Article
  • Open Access
2,236 Views
21 Pages

21 March 2024

Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual desc...

  • Article
  • Open Access
78 Views
22 Pages

4 January 2026

Remote Sensing Visual Grounding (RSVG) requires fine-grained understanding of language descriptions to localize the specific image regions. Conventional methods typically employ a pipeline of separate visual and textual encoders and a fusion module....

  • Article
  • Open Access
43 Citations
727 Views
17 Pages

Gaze Transitions when Learning with Multimedia

  • Krzysztof Krejtz,
  • Andrew T. Duchowski,
  • Izabela Krejtz,
  • Agata Kopacz and
  • Piotr Chrząstowski-Wachtel

10 February 2016

Eye tracking methodology is used to examine the influence of interactive multimedia on the allocation of visual attention and its dynamics during learning. We hypothesized that an interactive simulation promotes more organized switching of attention...

  • Article
  • Open Access
2 Citations
1,468 Views
23 Pages
Electronics2024, 13(24), 4905;https://doi.org/10.3390/electronics13244905 
(registering DOI)

12 December 2024

Although significant progress has been made in sentiment analysis tasks based on image–text data, existing methods still have limitations in capturing cross-modal correlations and detailed information. To address these issues, we propose a Mult...

  • Article
  • Open Access
95 Views
19 Pages

Do LLMs Speak BPMN? An Evaluation of Their Process Modeling Capabilities Based on Quality Measures

  • Panagiotis Drakopoulos,
  • Panagiotis Malousoudis,
  • Nikolaos Nousias,
  • George Tsakalidis and
  • Kostas Vergidis

Large Language Models (LLMs) are emerging as powerful tools for automating business process modeling, promising to streamline the translation of textual process descriptions into Business Process Model and Notation (BPMN) diagrams. However, the exten...

  • Article
  • Open Access
2 Citations
3,359 Views
22 Pages

Fraud exists on both legitimate e-commerce platforms and illicit dark web marketplaces, impacting both environments. Detecting fraudulent vendors proves challenging, despite clients’ reporting scams to platform administrators and specialised fo...

  • Article
  • Open Access
1,138 Views
21 Pages

With the rapid development of multimodal prompt learning in unsupervised domains, prompt tuning has demonstrated significant potential for dense counting tasks. However, existing supervised methods heavily rely on annotated data, limiting their gener...

  • Article
  • Open Access
2 Citations
2,624 Views
23 Pages

Extraction of Event-Related Information from Text for the Representation of Cultural Heritage

  • Emmanouil Ntafotis,
  • Emmanouil Zidianakis,
  • Nikolaos Partarakis and
  • Constantine Stephanidis

9 November 2022

In knowledge representation systems for Cultural Heritage (CH) there is a vast amount of curated textual information for CH objects and sites. However, the large-scale study of the accumulated knowledge is difficult as long as it is provided in the f...

  • Article
  • Open Access
2 Citations
3,643 Views
17 Pages

4 May 2023

Multi-modal deep learning methods have achieved great improvements in visual grounding; their objective is to localize text-specified objects in images. Most of the existing methods can localize and classify objects with significant appearance differ...

  • Article
  • Open Access
4 Citations
5,644 Views
17 Pages

12 November 2018

Image caption generation is a fundamental task to build a bridge between image and its description in text, which is drawing increasing interest in artificial intelligence. Images and textual sentences are viewed as two different carriers of informat...

  • Article
  • Open Access
743 Views
16 Pages

28 October 2025

We propose a point cloud-based framework for open-vocabulary object pose estimation, called Pov9D. Existing approaches are predominantly RGB-based and often rely on texture or appearance cues, making them susceptible to pose ambiguities when objects...

  • Article
  • Open Access
9 Citations
5,065 Views
14 Pages

Image Caption Generation via Unified Retrieval and Generation-Based Method

  • Shanshan Zhao,
  • Lixiang Li,
  • Haipeng Peng,
  • Zihang Yang and
  • Jiaxuan Zhang

8 September 2020

Image captioning is a multi-modal transduction task, translating the source image into the target language. Numerous dominant approaches primarily employed the generation-based or the retrieval-based method. These two kinds of frameworks have their a...

  • Article
  • Open Access
1 Citations
2,415 Views
13 Pages

13 June 2023

Although image recognition technologies are developing rapidly with deep learning, conventional recognition models trained by supervised learning with class labels do not work well when test inputs from untrained classes are given. For example, a rec...

  • Article
  • Open Access
5 Citations
5,558 Views
21 Pages

Multi-modal data are widely available for online real estate listings. Announcements can contain various forms of data, including visual data and unstructured textual descriptions. Nonetheless, many traditional real estate pricing models rely solely...

  • Article
  • Open Access
784 Views
17 Pages

19 November 2025

Knowledge graphs (KGs) have emerged as fundamental infrastructures for organizing structured information across a wide range of AI applications. Practically, KGs are often incomplete, which limits their effectiveness. Knowledge Graph Completion (KGC)...

  • Article
  • Open Access
8 Citations
5,183 Views
31 Pages

Multimodal Classification of Safety-Report Observations

  • Georgios Paraskevopoulos,
  • Petros Pistofidis,
  • Georgios Banoutsos,
  • Efthymios Georgiou and
  • Vassilis Katsouros

7 June 2022

Modern businesses are obligated to conform to regulations to prevent physical injuries and ill health for anyone present on a site under their responsibility, such as customers, employees and visitors. Safety officers (SOs) are engineers, who perform...

  • Article
  • Open Access
1,609 Views
14 Pages

14 December 2024

The endeavor of spatial position reasoning effectively simulates the sensory and comprehension faculties of artificial intelligence, especially within the purview of multimodal modeling that fuses imagery with linguistic data. Recent progress in visu...

  • Article
  • Open Access
2 Citations
2,355 Views
28 Pages

Accessible IoT Dashboard Design with AI-Enhanced Descriptions for Visually Impaired Users

  • George Alex Stelea,
  • Livia Sangeorzan and
  • Nicoleta Enache-David

The proliferation of the Internet of Things (IoT) has led to an abundance of data streams and real-time dashboards in domains such as smart cities, healthcare, manufacturing, and agriculture. However, many current IoT dashboards emphasize complex vis...

  • Article
  • Open Access
2 Citations
2,047 Views
18 Pages

30 October 2023

Given a textual query, text-based person re-identification is supposed to search for the targeted pedestrian images from a large-scale visual database. Due to the inherent heterogeneity between different modalities, it is challenging to measure the c...

  • Article
  • Open Access
8 Citations
3,335 Views
21 Pages

22 May 2023

Recommender systems are challenged with providing accurate recommendations that meet the diverse preferences of users. The main information sources for these systems are the utility matrix and textual sources, such as item descriptions, users’...

  • Article
  • Open Access
1,300 Views
19 Pages

MedLangViT: A Language–Vision Network for Medical Image Segmentation

  • Yiyi Wang,
  • Jia Su,
  • Xinxiao Li and
  • Eisei Nakahara

Precise medical image segmentation is crucial for advancing computer-aided diagnosis. Although deep learning-based medical image segmentation is now widely applied in this field, the complexity of human anatomy and the diversity of pathological manif...

  • Article
  • Open Access
349 Citations
28,047 Views
22 Pages

COVID-19 Public Sentiment Insights and Machine Learning for Tweets Classification

  • Jim Samuel,
  • G. G. Md. Nawaz Ali,
  • Md. Mokhlesur Rahman,
  • Ek Esawi and
  • Yana Samuel

11 June 2020

Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fueled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID...

  • Article
  • Open Access
1,446 Views
18 Pages

17 July 2025

Many radar applications rely primarily on visual classification for their evaluations. However, new research is integrating textual descriptions alongside visual input and showing that such multimodal fusion improves contextual understanding. A criti...

  • Article
  • Open Access
813 Views
18 Pages

SATrack: Semantic-Aware Alignment Framework for Visual–Language Tracking

  • Yangyang Tian,
  • Liusen Xu,
  • Zhe Li,
  • Liang Jiang,
  • Cen Chen and
  • Huanlong Zhang

4 October 2025

Visual–language tracking often faces challenges like target deformation and confusion caused by similar objects. These issues can disrupt the alignment between visual inputs and their textual descriptions, leading to cross-modal semantic drift...

  • Article
  • Open Access
473 Views
17 Pages

24 November 2025

Text-to-image person re-identification (T2I-ReID) aims to retrieve pedestrians from images/videos based on textual descriptions. However, most methods implicitly assume that training image–text pairs are correctly aligned, while in practice, is...

  • Article
  • Open Access
2 Citations
3,528 Views
19 Pages

7 February 2025

This paper investigates points of vulnerability in the decisions made by backers and campaigners in crowdfund pledges in an attempt to facilitate a sustainable entrepreneurial ecosystem by increasing the rate of good projects being funded. In doing s...

  • Article
  • Open Access
304 Views
15 Pages

27 November 2025

Accurate classification of focal liver lesions (FLLs) is crucial for reliable clinical decision-making. Inspired by contrastive vision-language models such as CLIP and MedCLIP, we propose Liver-VLM for FLLs classification, trained on a dedicated mult...

of 6