Multi-Modal Deep Learning and Its Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 November 2023) | Viewed by 23873

Special Issue Editors

Shenzhen Key Lab for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
Interests: natural language processing; image captioning; text–image retrieval; visual storytelling
Special Issues, Collections and Topics in MDPI journals
School of Information Engineering, Ningxia University, Yinchuan 750021, China
Interests: facial analysis; medical imaging; metric learning; representation learning; self-supervised learning; reinforcement learning; deep learning

E-Mail Website
Guest Editor
College of Computer and Information Science, Southwest University, Chongqing 400715, China
Interests: data mining; pattern recognition; image processing

E-Mail Website
Guest Editor
School of Computing and Augmented Intelligence, Arizona State University, TEMPE Campus, Tempe, AZ 8809, USA
Interests: computer science education; service-oriented computing; programming languages; dependable computing; robotics and embedded systems

Special Issue Information

Dear Colleagues,

The Sixth Asian Conference on Artificial Intelligence Technology will be held in Changzhou, China. The ACAIT-2022 conference invites the submission of substantial, original, and unpublished research papers regarding Artificial Intelligence (AI) applications in image analysis, video analysis, medical image processing, intelligent vehicles, natural language processing, and other AI-enabled applications.

Multi-modal learning, which is an important sub-area of AI, has recently attracted noticeable attention due to its broad applications in the multi-media community. Early studies relied heavily on feature engineering, which is time-consuming and labor intensive. With the advancements made in deep learning, great efforts have been made to improve the performances of multi-modal applications with multi-modal deep learning. However, this progress still does not bridge the heterogeneity gaps between different modalities (i.e., computer vision, natural language process, speech, and heterogeneous signals) with deep learning techniques. The goal of this Special Issue is to collect contributions regarding multi-modal deep learning and its applications.

Papers for this Special Issue, entitled “Multi-modal Deep Learning and its Applications”, will be focused on (but not limited to):

  • Deep learning for cross-modality data (e.g., video captioning, cross-modal retrieval, and video generation);
  • Deep learning for video processing;
  • Multi-modal representation learning;
  • Unified multi-modal pre-training;
  • Multi-modal metric learning;
  • Multi-modal medical imaging;
  • Unsupervised/self-supervised approaches in modality alignment;
  • Model-agnostic approaches in modality fusion;
  • Co-training, transfer learning, and zero-shot learning;
  • Industrial visual inspection.

Dr. Min Yang
Dr. Hao Liu
Prof. Dr. Shanxiong Chen
Dr. Yinong Chen
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep multi-modal learning
  • deep video processing
  • multi-modal representation learning
  • unified multi-modal pre-training
  • multi-modal metric learning
  • multi-modal medical imaging
  • unsupervised modality alignment
  • self-supervised modality alignment
  • model-agnostic modality fusion
  • co-training
  • transfer learning
  • zero-shot learning
  • industrial visual inspection

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

14 pages, 406 KiB  
Article
English Speech Emotion Classification Based on Multi-Objective Differential Evolution
by Liya Yue, Pei Hu, Shu-Chuan Chu and Jeng-Shyang Pan
Appl. Sci. 2023, 13(22), 12262; https://doi.org/10.3390/app132212262 - 13 Nov 2023
Cited by 1 | Viewed by 625
Abstract
Speech signals involve speakers’ emotional states and language information, which is very important for human–computer interaction that recognizes speakers’ emotions. Feature selection is a common method for improving recognition accuracy. In this paper, we propose a multi-objective optimization method based on differential evolution [...] Read more.
Speech signals involve speakers’ emotional states and language information, which is very important for human–computer interaction that recognizes speakers’ emotions. Feature selection is a common method for improving recognition accuracy. In this paper, we propose a multi-objective optimization method based on differential evolution (MODE-NSF) that maximizes recognition accuracy and minimizes the number of selected features (NSF). First, the Mel-frequency cepstral coefficient (MFCC) features and pitch features are extracted from speech signals. Then, the proposed algorithm implements feature selection where the NSF guides the initialization, crossover, and mutation of the algorithm. We used four English speech emotion datasets, and K-nearest neighbor (KNN) and random forest (RF) classifiers to validate the performance of the proposed algorithm. The results illustrate that MODE-NSF is superior to other multi-objective algorithms in terms of the hypervolume (HV), inverted generational distance (IGD), Pareto optimal solutions, and running time. MODE-NSF achieved an accuracy of 49% using eNTERFACE05, 53% using the Ryerson audio-visual database of emotional speech and song (RAVDESS), 76% using Surrey audio-visual expressed emotion (SAVEE) database, and 98% using the Toronto emotional speech set (TESS). MODE-NSF obtained good recognition results, which provides a basis for the establishment of emotional models. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

27 pages, 5750 KiB  
Article
Sound-to-Imagination: An Exploratory Study on Cross-Modal Translation Using Diverse Audiovisual Data
by Leonardo A. Fanzeres and Climent Nadeu
Appl. Sci. 2023, 13(19), 10833; https://doi.org/10.3390/app131910833 - 29 Sep 2023
Viewed by 852
Abstract
The motivation of our research is to explore the possibilities of automatic sound-to-image (S2I) translation for enabling a human receiver to visually infer occurrences of sound-related events. We expect the computer to ‘imagine’ scenes from captured sounds, generating original images that depict the [...] Read more.
The motivation of our research is to explore the possibilities of automatic sound-to-image (S2I) translation for enabling a human receiver to visually infer occurrences of sound-related events. We expect the computer to ‘imagine’ scenes from captured sounds, generating original images that depict the sound-emitting sources. Previous studies on similar topics opted for simplified approaches using data with low content diversity and/or supervision/self-supervision for training. In contrast, our approach involves performing S2I translation using thousands of distinct and unknown scenes, using sound class annotations solely for data preparation, just enough to ensure aural–visual semantic coherence. To model the translator, we employ an audio encoder and a conditional generative adversarial network (GAN) with a deep densely connected generator. Furthermore, we present a solution using informativity classifiers for quantitatively evaluating the generated images. This allows us to analyze the influence of network-bottleneck variation on the translation process, highlighting a potential trade-off between informativity and pixel space convergence. Despite the complexity of the specified S2I translation task, we were able to generalize the model enough to obtain more than 14%, on average, of interpretable and semantically coherent images translated from unknown sounds. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

14 pages, 5415 KiB  
Article
Polarformer: Optic Disc and Cup Segmentation Using a Hybrid CNN-Transformer and Polar Transformation
by Yaowei Feng, Zhendong Li, Dong Yang, Hongkai Hu, Hui Guo and Hao Liu
Appl. Sci. 2023, 13(1), 541; https://doi.org/10.3390/app13010541 - 30 Dec 2022
Cited by 1 | Viewed by 1504
Abstract
The segmentation of optic disc (OD) and optic cup (OC) are used in the automatic diagnosis of glaucoma. However, the spatially ambiguous boundary and semantically uncertain region-of-interest area in pictures may lead to the degradation of the performance of precise segmentation of the [...] Read more.
The segmentation of optic disc (OD) and optic cup (OC) are used in the automatic diagnosis of glaucoma. However, the spatially ambiguous boundary and semantically uncertain region-of-interest area in pictures may lead to the degradation of the performance of precise segmentation of the OC and OD. Unlike most existing methods, including the variants of CNNs (Convolutional Neural Networks) and U-Net, which limit the contributions of rich global features, we instead propose a hybrid CNN-transformer and polar transformation network, dubbed as Polarformer, which aims to extract discriminative and semantic features for robust OD and OC segmentation. Our Polarformer typically exploits contextualized features among all input units and models the correlation of structural relationships under the paradigm of the transformer backbone. More specifically, our learnable polar transformer module optimizes the polar transformations by sampling images in the Cartesian space and then mapping them back to the polar coordinate system for masked-image reconstruction. Extensive experimental results present that our Polarformer achieves superior performance in comparison to most state-of-the-art methods on three publicly available datasets. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

21 pages, 7315 KiB  
Article
Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
by Yue Ran, Hongying Tang, Baoqing Li and Guohui Wang
Appl. Sci. 2022, 12(24), 12622; https://doi.org/10.3390/app122412622 - 09 Dec 2022
Cited by 2 | Viewed by 1354
Abstract
Localizing the audio-visual events in video requires a combined judgment of visual and audio components. To integrate multimodal information, existing methods modeled the cross-modal relationships by feeding unimodal features into attention modules. However, these unimodal features are encoded in separate spaces, resulting in [...] Read more.
Localizing the audio-visual events in video requires a combined judgment of visual and audio components. To integrate multimodal information, existing methods modeled the cross-modal relationships by feeding unimodal features into attention modules. However, these unimodal features are encoded in separate spaces, resulting in a large heterogeneity gap between modalities. Existing attention modules, on the other hand, ignore the temporal asynchrony between vision and hearing when constructing cross-modal connections, which may lead to the misinterpretation of one modality by another. Therefore, this paper aims to improve event localization performance by addressing these two problems and proposes a framework that feeds audio and visual features encoded in the same semantic space into a temporally adaptive attention module. Specifically, we develop a self-supervised representation method to encode features with a smaller heterogeneity gap by matching corresponding semantic cues between synchronized audio and visual signals. Furthermore, we develop a temporally adaptive cross-modal attention based on a weighting method that dynamically channels attention according to the time differences between event-related features. The proposed framework achieves state-of-the-art performance on the public audio-visual event dataset and the experimental results not only show that our self-supervised method can learn more discriminative features but also verify the effectiveness of our strategy for assigning attention. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

16 pages, 880 KiB  
Article
EFAFN: An Efficient Feature Adaptive Fusion Network with Facial Feature for Multimodal Sarcasm Detection
by Yukuan Sun, Hangming Zhang, Shengjiao Yang and Jianming Wang
Appl. Sci. 2022, 12(21), 11235; https://doi.org/10.3390/app122111235 - 05 Nov 2022
Cited by 2 | Viewed by 1659
Abstract
Sarcasm often manifests itself in some implicit language and exaggerated expressions. For instance, an elongated word, a sarcastic phrase, or a change of tone. Most research on sarcasm detection has recently been based on text and image information. In this paper, we argue [...] Read more.
Sarcasm often manifests itself in some implicit language and exaggerated expressions. For instance, an elongated word, a sarcastic phrase, or a change of tone. Most research on sarcasm detection has recently been based on text and image information. In this paper, we argue that most image data input to the sarcasm detection model is redundant, for example, complex background information and foreground information irrelevant to sarcasm detection. Since facial details contain emotional changes and social characteristics, we should pay more attention to the image data of the face area. We, therefore, treat text, audio, and face images as three modalities and propose a multimodal deep-learning model to tackle this problem. Our model extracts the text, audio, and image features of face regions and then uses our proposed feature fusion strategy to fuse these three modal features into one feature vector for classification. To enhance the model’s generalization ability, we use the IMGAUG image enhancement tool to augment the public sarcasm detection dataset MUStARD. Experiments show that although using a simple supervised method is effective, using a feature fusion strategy and image features from face regions can further improve the F1 score from 72.5% to 79.0%. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

17 pages, 2187 KiB  
Article
An Abstract Summarization Method Combining Global Topics
by Zhili Duan, Ling Lu, Wu Yang, Jinghui Wang and Yuke Wang
Appl. Sci. 2022, 12(20), 10378; https://doi.org/10.3390/app122010378 - 14 Oct 2022
Cited by 3 | Viewed by 1217
Abstract
Existing abstractive summarization methods only focus on the correlation between the original words and the summary words, ignoring the topics’ influence on the summaries. To this end, an abstract summarization method combining global topic information, ACGT, is proposed. A topic information extractor, based [...] Read more.
Existing abstractive summarization methods only focus on the correlation between the original words and the summary words, ignoring the topics’ influence on the summaries. To this end, an abstract summarization method combining global topic information, ACGT, is proposed. A topic information extractor, based on Latent Dirichlet Allocation, is constructed to extract key topic information from the original text, and an attention module is built to fuse key topic information with the original text representation. The summary is then generated by combining a pointer generation network and coverage mechanism. With evaluation metrics of ROUGE-1, ROUGE-2, and ROUGE-L, the experimental results of ACGT in the English dataset CNN/Daily Mail are 0.96%, 2.44%, and 1.03% higher than the baseline model, respectively. In the Chinese dataset, LCSTS, ACGT shows a higher performance than the baseline method by 1.19%, 1.03%, and 0.85%, respectively. Our results demonstrate that the performance of summaries is significantly correlated with the number of topics that are introduced. Case studies show that the introduction of topic information can improve both the coverage of original text topics and the fluency of summaries. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

13 pages, 2630 KiB  
Article
MFFAMM: A Small Object Detection with Multi-Scale Feature Fusion and Attention Mechanism Module
by Zhong Qu, Tongqiang Han and Tuming Yi
Appl. Sci. 2022, 12(18), 8940; https://doi.org/10.3390/app12188940 - 06 Sep 2022
Cited by 5 | Viewed by 1885
Abstract
Aiming at the low detection accuracy and poor positioning for small objects of single-stage object detection algorithms, we improve the backbone network of SSD (Single Shot MultiBox Detector) and present an improved SSD model based on multi-scale feature fusion and attention mechanism module [...] Read more.
Aiming at the low detection accuracy and poor positioning for small objects of single-stage object detection algorithms, we improve the backbone network of SSD (Single Shot MultiBox Detector) and present an improved SSD model based on multi-scale feature fusion and attention mechanism module in this paper. Firstly, we enhance the feature extraction ability of the shallow network through the feature fusion method that is beneficial to small object recognition. Secondly, the RFB (Receptive Field block) is used to expand the object’s receptive field and extract richer semantic information. After feature fusion, the attention mechanism module is added to enhance the feature information of important objects and suppress irrelevant other information. The experimental results show that our algorithm achieves 80.7% and 51.8% mAP on the PASCAL VOC 2007 classic dataset and MS COCO 2017 dataset, which are 3.2% and 10.6% higher than the original SSD algorithm. Our algorithm greatly improves the accuracy of object detection and meets the requirements of real-time. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

16 pages, 2568 KiB  
Article
Typhoon Track Prediction Based on Deep Learning
by Jia Ren, Nan Xu and Yani Cui
Appl. Sci. 2022, 12(16), 8028; https://doi.org/10.3390/app12168028 - 11 Aug 2022
Cited by 5 | Viewed by 2127
Abstract
China is located in the northwest Pacific region where typhoons occur frequently, and every year typhoons make landfall and cause large or small economic losses or even casualties. Therefore, how to predict typhoon paths more accurately has undoubtedly become an important research topic [...] Read more.
China is located in the northwest Pacific region where typhoons occur frequently, and every year typhoons make landfall and cause large or small economic losses or even casualties. Therefore, how to predict typhoon paths more accurately has undoubtedly become an important research topic nowadays. Therefore, this paper predicts the path of typhoons formed in the South China Sea based on deep learning. This paper combines the CNN network and the LSTM network to build a C-LSTM typhoon path prediction model, using the typhoon paths and related meteorological variables formed in the South China Sea from 1949 to 2021 as the data set, and using the Granger causality test to select multiple features for the data set to achieve data dimensionality reduction. Finally, by comparing the experiments with the LSTM typhoon path prediction model, it is proved that the prediction results of the model have smaller errors. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

16 pages, 18041 KiB  
Article
Cervical Cell Segmentation Method Based on Global Dependency and Local Attention
by Gang Li, Chengjie Sun, Chuanyun Xu, Yu Zheng and Keya Wang
Appl. Sci. 2022, 12(15), 7742; https://doi.org/10.3390/app12157742 - 01 Aug 2022
Cited by 10 | Viewed by 2012
Abstract
The refined segmentation of nuclei and the cytoplasm is the most challenging task in the automation of cervical cell screening. The U-Shape network structure has demonstrated great superiority in the field of biomedical imaging. However, the classical U-Net network cannot effectively utilize mixed [...] Read more.
The refined segmentation of nuclei and the cytoplasm is the most challenging task in the automation of cervical cell screening. The U-Shape network structure has demonstrated great superiority in the field of biomedical imaging. However, the classical U-Net network cannot effectively utilize mixed domain information and contextual information, and fails to achieve satisfactory results in this task. To address the above problems, a module based on global dependency and local attention (GDLA) for contextual information modeling and features refinement, is proposed in this study. It consists of three components computed in parallel, which are the global dependency module, the spatial attention module, and the channel attention module. The global dependency module models global contextual information to capture a priori knowledge of cervical cells, such as the positional dependence of the nuclei and cytoplasm, and the closure and uniqueness of the nuclei. The spatial attention module combines contextual information to extract cell boundary information and refine target boundaries. The channel and spatial attention modules are used to provide adaption of the input information, and make it easy to identify subtle but dominant differences of similar objects. Comparative and ablation experiments are conducted on the Herlev dataset, and the experimental results demonstrate the effectiveness of the proposed method, which surpasses the most popular existing channel attention, hybrid attention, and context networks in terms of the nuclei and cytoplasm segmentation metrics, achieving better segmentation performance than most previous advanced methods. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

Review

Jump to: Research

35 pages, 5184 KiB  
Review
A Survey of Information Extraction Based on Deep Learning
by Yang Yang, Zhilei Wu, Yuexiang Yang, Shuangshuang Lian, Fengjie Guo and Zhiwei Wang
Appl. Sci. 2022, 12(19), 9691; https://doi.org/10.3390/app12199691 - 27 Sep 2022
Cited by 20 | Viewed by 8154
Abstract
As a core task and an important link in the fields of natural language understanding and information retrieval, information extraction (IE) can structure and semanticize unstructured multi-modal information. In recent years, deep learning (DL) has attracted considerable research attention to IE tasks. Deep [...] Read more.
As a core task and an important link in the fields of natural language understanding and information retrieval, information extraction (IE) can structure and semanticize unstructured multi-modal information. In recent years, deep learning (DL) has attracted considerable research attention to IE tasks. Deep learning-based entity relation extraction techniques have gradually surpassed traditional feature- and kernel-function-based methods in terms of the depth of feature extraction and model accuracy. In this paper, we explain the basic concepts of IE and DL, primarily expounding on the research progress and achievements of DL technologies in the field of IE. At the level of IE tasks, it is expounded from entity relationship extraction, event extraction, and multi-modal information extraction three aspects, and creates a comparative analysis of various extraction techniques. We also summarize the prospects and development trends in DL in the field of IE as well as difficulties requiring further study. It is believed that research can be carried out in the direction of multi-model and multi-task joint extraction, information extraction based on knowledge enhancement, and information fusion based on multi-modal at the method level. At the model level, further research should be carried out in the aspects of strengthening theoretical research, model lightweight, and improving model generalization ability. Full article
(This article belongs to the Special Issue Multi-Modal Deep Learning and Its Applications)
Show Figures

Figure 1

Back to TopTop