A Short Survey on Deep Learning for Multimodal Integration: Applications, Future Perspectives and Challenges
Abstract
:1. Introduction
2. Multimodality: A Definition
- 1
- Dimension 1 Taxonomy: concerns a taxonomical definition of the concept of multimodality.
- 2
- Dimension 2 Fusion: instead, focuses on the types of fusion to be used for the multimodal features.
2.1. Dimension 1: Taxonomy
- 1
- Representation: this concept addresses the need of representing and summarizing in a proper way the input features, irrespective of the differences existing between the modalities. To give an example, language is often represented as symbolic, while visual modalities are often represented as signals.
- 2
- Translation: this point concerns the capability of translating one modality to another. This task is particularly crucial, as it determines that the relationship between modalities might exist. If this is not case, then this should be carefully taken into account in the multimodality-DL implementation.
- 3
- Alignment: in this case, this concept refers to the need to identify the direct relationships between one or more sub-element parts of the same input data. In [15], the authors assess the need of measuring the similarity between the modalities used in the experimental setting.
- 4
- Fusion: this research challenge tackles, instead, the need of joining information in two or more modalities, to perform a certain prediction task. We will expand on this in the definition of Section 2.2.
- 5
- Co-learning: this research challenge entails the need of transferring knowledge between the modalities used in the experimental settings. This is said to be particularly relevant [15] for several types of DL algorithms such as, for example, zero-shot learning, which is a new DL approach in which models are tested on examples which were never seen during training. Moreover, this research aspect is said to be particularly relevant when one of the modalities is characterized by a limited amount of resources (for example, a limited amount of labelled data). Several deep-learning extensions have been proposed for the co-learning research field. A relevant overview concerning this concept is reported in [17].
2.2. Dimension 2: Fusion
- 1
- Early fusion: such fusion happens when the multimodal input data is fused prior to the application of the AI model. More specifically, the fusion process takes places as at the initial step, before the dataset is used as input to the DL algorithm. More specifically, we could say that the fusion process happens directly on the raw data. If, instead of the raw data, a pre-processing feature-extraction step is performed, then we say that the merging step is performed at a feature level.
- 2
- Late fusion: in this case, the fusion step takes place after the application of the AI algorithm. In this case, data are processed in a separate multimodal way.More specifically, such an approach considers the different modalities as single streams. The drawback is that the possible conditional relationship existing among the different modalities are not considered during the learning process. Merging strategies could be of different types:
- 3
- Hybrid fusion: this fusion takes place when the multimodal input data is fused both before and after the application of the relevant AI algorithm. In particular, this approach stands in the middle between late and early fusion, taking place halfway in the DNN model considered. This approach could be, for example, particularly suitable when there are modalities with consistent dimensions which need to be merged, as well as when the modalities are of a very different nature, and, therefore, require a first pre-processing and a later merging procedure during the training process.
- 1
- Data-level fusion: this is defined as the case in which the modalities are fused before the learning process. In other words, features are computed independently for each single modality, and the fusion step takes place when they are already in the feature-extracted form [20,21]. This is part of the early fusion modality approach.
- 2
- Decision-level fusion: in this case, the decision scores are firstly computed separately for each AI model applied. Subsequently, the individual decision scores are fused into one decision, using, for example, an ensemble of methods, or majority voting approaches [22,23]. This is part of the late fusion-modality approach.
- 3
- Score-level fusion: in this case, the fusion mechanism happens at the level of the probabilistic output scores of neural networks. It is different from decision-level fusion, as, in this case, a proper fusion of scores is performed, while in the previous case, majority voting and ensembling approaches were applied [24]. A single decision is then taken only after the fused scores have been produced. This is part of the late fusion-modality approach.
- 4
- Hybrid-level fusion: in this case, the characteristics of the features- and decision-level fusion are merged. In particular, a comparison takes place between the decisions taken by single classifiers and the decisions taken by classifiers when modalities are fused at a feature level. Such approaches, therefore, rely on the idea of improving the performances of single classifiers with the use of features-future engineering [25,26]. This is part of the hybrid-modality approach.
- 5
3. Multimodality Deep-Learning Applications
3.1. Audio-Visual Multimodal Applications for Speech Recognition
3.2. Multimodal Applications for Sentiment Analysis
3.3. Forensic Applications: Multimodal Deepfake Detection
3.4. Computer Vision and Multimodality: Image Segmentation and Reconstruction
4. Discussion and Conclusions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ANN | Artificial neural networks |
MLP | Multi-layer perceptrons |
DL | Deep learning |
ML | Machine learning |
AI | Artificial intelligence |
CNN | Convolutional neural networks |
RNN | Recurrent neural network |
References
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Cheng, H.D.; Jiang, X.H.; Sun, Y.; Wang, J. Color image segmentation: Advances and prospects. Pattern Recognit. 2001, 34, 2259–2281. [Google Scholar] [CrossRef]
- Dimitri, G.M.; Spasov, S.; Duggento, A.; Passamonti, L.; Toschi, N. Unsupervised stratification in neuroimaging through deep latent embeddings. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada, 20–24 July 2020; pp. 1568–1571. [Google Scholar]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [Green Version]
- Cicaloni, V.; Spiga, O.; Dimitri, G.M.; Maiocchi, R.; Millucci, L.; Giustarini, D.; Bernardini, G.; Bernini, A.; Marzocchi, B.; Braconi, D.; et al. Interactive alkaptonuria database: Investigating clinical data to improve patient care in a rare disease. FASEB J. 2019, 33, 12696–12703. [Google Scholar] [CrossRef] [Green Version]
- Iqbal, T.; Qureshi, S. The survey: Text generation models in deep learning. J. King Saud-Univ.-Comput. Inf. Sci. 2020, 34, 2515–2528. [Google Scholar] [CrossRef]
- He, X.; Deng, L. Deep learning for image-to-text generation: A technical overview. IEEE Signal Process. Mag. 2017, 34, 109–116. [Google Scholar] [CrossRef]
- Bianchini, M.; Dimitri, G.M.; Maggini, M.; Scarselli, F. Deep neural networks for structured data. In Computational Intelligence for Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; pp. 29–51. [Google Scholar]
- Summaira, J.; Li, X.; Shoib, A.M.; Li, S.; Abdul, J. Recent Advances and Trends in Multimodal Deep Learning: A Review. arXiv 2021, arXiv:2105.11087. [Google Scholar]
- Van Leeuwen, T. Multimodality. In The Routledge Handbook of Applied Linguistics; Routledge: London, UK, 2011; pp. 668–682. [Google Scholar]
- Jewitt, C.; Bezemer, J.; O’Halloran, K. Introducing Multimodality; Routledge: London, UK, 2016. [Google Scholar]
- Bateman, J.; Wildfeuer, J.; Hiippala, T. Multimodality: Foundations, Research and Analysis–A Problem-Oriented Introduction; Walter de Gruyter GmbH & Co KG: Berlin, Germany, 2017. [Google Scholar]
- Bernsen, N.O. Multimodality theory. In Multimodal User Interfaces; Springer: Berlin/Heidelberg, Germany, 2008; pp. 5–29. [Google Scholar]
- Bertelson, P.; De Gelder, B. The psychology of multimodal perception. Crossmodal Space Crossmodal Attention, Spence; Spence, C., Driver, J., Eds.; Oxford University Press: Oxford, UK, 2004; pp. 141–177. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [Green Version]
- Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs late fusion in multimodal convolutional neural networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; pp. 1–6. [Google Scholar]
- Rahate, A.; Walambe, R.; Ramanna, S.; Kotecha, K. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions. Inf. Fusion 2022, 81, 203–239. [Google Scholar] [CrossRef]
- Snoek, C.G.; Worring, M.; Smeulders, A.W. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia, Singapore, 6–11 November 2005; pp. 399–402. [Google Scholar]
- D’mello, S.K.; Kory, J. A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. 2015, 47, 1–36. [Google Scholar] [CrossRef]
- Castellano, G.; Kessous, L.; Caridakis, G. Emotion recognition through multiple modalities: Face, body gesture, speech. In Affect and Emotion in Human-Computer Interaction; Springer: Berlin/Heidelberg, Germany, 2008; pp. 92–103. [Google Scholar]
- D’mello, S.K.; Graesser, A. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Model. User Adapt. Interact. 2010, 20, 147–187. [Google Scholar] [CrossRef]
- Kanluan, I.; Grimm, M.; Kroschel, K. Audio-visual emotion recognition using an emotion space concept. In Proceedings of the 2008 16th European Signal Processing Conference, Lausanne, Switzerland, 25–29 August 2008; pp. 1–5. [Google Scholar]
- Salur, M.U.; Aydın, İ. A soft voting ensemble learning-based approach for multimodal sentiment analysis. Neural Comput. Appl. 2022, 34, 18391–18406. [Google Scholar] [CrossRef]
- Aizi, K.; Ouslim, M. Score level fusion in multi-biometric identification based on zones of interest. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 1498–1509. [Google Scholar] [CrossRef]
- Mansoorizadeh, M.; Moghaddam Charkari, N. Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 2010, 49, 277–297. [Google Scholar] [CrossRef]
- Chetty, G.; Wagner, M.; Goecke, R. A multilevel fusion approach for audiovisual emotion recognition. Emot. Recognit. Pattern Anal. Approach 2015, 2015, 437–460. [Google Scholar]
- Metallinou, A.; Wollmer, M.; Katsamanis, A.; Eyben, F.; Schuller, B.; Narayanan, S. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affect. Comput. 2012, 3, 184–198. [Google Scholar] [CrossRef]
- Giacobe, N.A. Application of the JDL data fusion process model for cyber security. In Proceedings of the Multisensor, Multisource Information Fusion: Architectures, Algorithms, and Applications 2010, SPIE, Orlando, FL, USA, 7–8 April 2010; Volume 7710, pp. 209–218. [Google Scholar]
- McGurk, H.; MacDonald, J. Hearing lips and seeing voices. Nature 1976, 264, 746–748. [Google Scholar] [CrossRef]
- Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the ICML, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
- Srivastava, N.; Salakhutdinov, R.R. Multimodal learning with deep boltzmann machines. Adv. Neural Inf. Process. Syst. 2012, 2012, 25. [Google Scholar]
- Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv 2016, arXiv:1606.01847. [Google Scholar]
- Makino, T.; Liao, H.; Assael, Y.; Shillingford, B.; Garcia, B.; Braga, O.; Siohan, O. Recurrent neural network transducer for audio-visual speech recognition. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 905–912. [Google Scholar]
- Petridis, S.; Stafylakis, T.; Ma, P.; Tzimiropoulos, G.; Pantic, M. Audio-visual speech recognition with a hybrid ctc/attention architecture. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 513–520. [Google Scholar]
- Zhou, P.; Yang, W.; Chen, W.; Wang, Y.; Jia, J. Modality attention for end-to-end audio-visual speech recognition. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6565–6569. [Google Scholar]
- Tao, F.; Busso, C. Gating neural network for large vocabulary audiovisual speech recognition. IEEE ACM Trans. Audio Speech Lang. Process. 2018, 26, 1290–1302. [Google Scholar] [CrossRef]
- Petridis, S.; Stafylakis, T.; Ma, P.; Cai, F.; Tzimiropoulos, G.; Pantic, M. End-to-end audiovisual speech recognition. In Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6548–6552. [Google Scholar]
- Ranganathan, H.; Chakraborty, S.; Panchanathan, S. Multimodal emotion recognition using deep learning architectures. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
- Pan, Z.; Luo, Z.; Yang, J.; Li, H. Multi-modal attention for speech emotion recognition. arXiv 2020, arXiv:2009.04107. [Google Scholar]
- Khare, A.; Parthasarathy, S.; Sundaram, S. Self-supervised learning with cross-modal transformers for emotion recognition. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Virtual, 19–22 January 2021; pp. 381–388. [Google Scholar]
- Liu, G.; Tan, Z. Research on multi-modal music emotion classification based on audio and lyric. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; Volume 1, pp. 2331–2335. [Google Scholar]
- Cambria, E.; Hazarika, D.; Poria, S.; Hussain, A.; Subramanyam, R. Benchmarking multimodal sentiment analysis. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, Hungary, 17–23 April 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 166–179. [Google Scholar]
- Lee, J.H.; Kim, H.J.; Cheong, Y.G. A multi-modal approach for emotion recognition of tv drama characters using image and text. In Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Korea, 19–22 February 2020; pp. 420–424. [Google Scholar]
- Ortega, J.D.; Senoussaoui, M.; Granger, E.; Pedersoli, M.; Cardinal, P.; Koerich, A.L. Multimodal fusion with deep neural networks for audio-video emotion recognition. arXiv 2019, arXiv:1907.03196. [Google Scholar]
- Dhaouadi, S.; Khelifa, M.M.B. A multimodal physiological-based stress recognition: Deep Learning models’ evaluation in gamers’ monitoring application. In Proceedings of the 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sousse, Tunisia, 2–5 September 2020; pp. 1–6. [Google Scholar]
- Bizzego, A.; Gabrieli, G.; Esposito, G. Deep neural networks and transfer learning on a multivariate physiological signal Dataset. Bioengineering 2021, 8, 35. [Google Scholar] [CrossRef]
- Ray, A.; Mishra, S.; Nunna, A.; Bhattacharyya, P. A Multimodal Corpus for Emotion Recognition in Sarcasm. arXiv 2022, arXiv:2206.02119. [Google Scholar]
- Lomnitz, M.; Hampel-Arias, Z.; Sandesara, V.; Hu, S. Multimodal Approach for DeepFake Detection. In Proceedings of the 2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 13–15 October 2020. [Google Scholar]
- Lewis, J.K.; Toubal, I.E.; Chen, H.; Sandesera, V.; Lomnitz, M.; Hampel-Arias, Z.; Prasad, C.; Palaniappan, K. Deepfake video detection based on spatial, spectral, and temporal inconsistencies using multimodal deep learning. In Proceedings of the 2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 13–15 October 2020. [Google Scholar]
- Mittal, T.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2823–2832. [Google Scholar]
- Khalid, H.; Kim, M.; Tariq, S.; Woo, S.S. Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, Virtual, 24 October 2021; pp. 7–15. [Google Scholar]
- Cai, Z.; Stefanov, K.; Dhall, A.; Hayat, M. Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. arXiv 2022, arXiv:2204.06228. [Google Scholar]
- Zhang, W.; Wu, Y.; Yang, B.; Hu, S.; Wu, L.; Dhelim, S. Overview of multi-modal brain tumor mr image segmentation. Healthcare 2021, 9, 1051. [Google Scholar] [CrossRef]
- Pemasiri, A.; Nguyen, K.; Sridharan, S.; Fookes, C. Multi-modal semantic image segmentation. Comput. Vis. Image Underst. 2021, 202, 103085. [Google Scholar] [CrossRef]
- Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
- Hong, D.; Yao, J.; Meng, D.; Xu, Z.; Chanussot, J. Multimodal GANs: Toward crossmodal hyperspectral–multispectral image segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5103–5113. [Google Scholar] [CrossRef]
- Dimitri, G.M.; Spasov, S.; Duggento, A.; Passamonti, L.; Lió, P.; Toschi, N. Multimodal and multicontrast image fusion via deep generative models. Inf. Fusion 2022, 88, 146–160. [Google Scholar] [CrossRef]
- Falvo, A.; Comminiello, D.; Scardapane, S.; Scarpiniti, M.; Uncini, A. A multimodal deep network for the reconstruction of T2W MR images. In Progresses in Artificial Intelligence and Neural Systems; Springer: Berlin/Heidelberg, Germany, 2021; pp. 423–431. [Google Scholar]
- Abdullah, S.M.S.A.; Ameen, S.Y.A.; Sadeeq, M.A.; Zeebaree, S. Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2021, 2, 52–58. [Google Scholar] [CrossRef]
- Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 1656–1660. [Google Scholar]
- Park, C.Y.; Cha, N.; Kang, S.; Kim, A.; Khandoker, A.H.; Hadjileontiadis, L.; Oh, A.; Jeong, Y.; Lee, U. K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations. Sci. Data 2020, 7, 1–16. [Google Scholar] [CrossRef]
- Mirsky, Y.; Lee, W. The creation and detection of deepfakes: A survey. Acm Comput. Surv. 2021, 54, 1–41. [Google Scholar] [CrossRef]
- Agarwal, S.; Farid, H.; Gu, Y.; He, M.; Nagano, K.; Li, H. Protecting World Leaders Against Deep Fakes. In Proceedings of the CVPR workshops, Long Beach, CA, USA, 16–20 June 2019; Volume 1, p. 38. [Google Scholar]
- Amerini, I.; Galteri, L.; Caldelli, R.; Del Bimbo, A. Deepfake video detection through optical flow based cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Virtual, 11–17 October 2019. [Google Scholar]
- Khalid, H.; Tariq, S.; Kim, M.; Woo, S.S. FakeAVCeleb: A novel audio-video multimodal deepfake dataset. arXiv 2021, arXiv:2108.05080. [Google Scholar]
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MA, USA, 17–23 July 2021; pp. 8821–8831. [Google Scholar]
- Qiao, T.; Zhang, J.; Xu, D.; Tao, D. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1505–1514. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Abady, L.; Dimitri, G.; Barni, M. Detection and Localization of GAN Manipulated Multi-spectral Satellite Images. In Proceedings of the ESANN 2022 Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium and Online Event, 5–7 October 2022. [Google Scholar]
- Yuan, K.; Zhuang, X.; Schaefer, G.; Feng, J.; Guan, L.; Fang, H. Deep-learning-based multispectral satellite image segmentation for water body detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7422–7434. [Google Scholar] [CrossRef]
- Abbessi, R.; Verrier, N.; Taddese, A.M.; Laroche, S.; Debailleul, M.; Lo, M.; Courbot, J.B.; Haeberlé, O. Multimodal image reconstruction from tomographic diffraction microscopy data. J. Microsc. 2022; online ahead of print. [Google Scholar] [CrossRef]
- Filipović, M.; Barat, E.; Dautremer, T.; Comtat, C.; Stute, S. PET reconstruction of the posterior image probability, including multimodal images. IEEE Trans. Med. Imaging 2018, 38, 1643–1654. [Google Scholar] [CrossRef]
Application | Model | Multi-Modal Features | Reference |
---|---|---|---|
Speech Recognition | HMM | visual + audio | McGurk et al. [29] |
Restricted Boltzamn machines | audio + video | Ngiam et al. [30] | |
Deep Boltzman Machines | image + text | Srivastava et al. [31] | |
Multimodal Compact Bilinear pooling (MCB) | visual + text | Fukui et al. [32] | |
RNN | audio + visual | Makino et al. [33] | |
Attention Based DL | audio + visual | Petridis et al. [34] | |
Seq2Seq | audio + visual | Zhou et al. [35] | |
DL with gating layer | audio + visual | Tao et al. [36] | |
Bidirectional Gated Recurrent Units | audio + visual | Petridis et al. [37] | |
Sentiment Analysis | Convolutional Deep Belief Network | face + body gesture + voice + physiological signals | Ranganathan et al. [38] |
cLSTM-MMA | visual + textual | Pan et al. [39] | |
transformers | audio + visual + text | Khare et al. [40] | |
LSTM | lyric + audio | Liu et al. [41] | |
CNN | visual + text + audio | Cambria et al. [42] | |
CNN | image + text | Lee et al. [43] | |
FC | audio + video | Ortega et al. [44] | |
LSTM and FC | physiological | Dhaouadi et al. [45] | |
FC | physiological | Bizzego et al. [46] | |
FC | text + audio + video | Ray et al. [47] | |
Forensic Applications | MLP, SincNet, Xecption | audio + video | Lomnitz et al. [48] |
LSTM + MLP | multimodal spectral features | Lewis et al. [49] | |
Siamese Networks | audio + video | Mittal et al. [50] | |
Xception, VGG, SincNet | video + audio | Khalid et al. [51] | |
CNN based | video + audio | Cai et al. [52] | |
Computer Vision | multiple DNN | brain MRI scans | Zhang et al. [53] |
visible, X-ray, thermal, infrared radiation | Mask R-CNN | Pemasiri et al. [54] | |
multiple DL | several sensors | Feng et al. [55] | |
GAN | multispectral | Hong et al. [56] | |
CNN separable convolutions | brain MRI T1 + MRI T2 + CT | Dimitri et al. [57] | |
Multi-Modal U-Net | T2W and FLAIR brain scans | Falvo et al. [58] |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dimitri, G.M. A Short Survey on Deep Learning for Multimodal Integration: Applications, Future Perspectives and Challenges. Computers 2022, 11, 163. https://doi.org/10.3390/computers11110163
Dimitri GM. A Short Survey on Deep Learning for Multimodal Integration: Applications, Future Perspectives and Challenges. Computers. 2022; 11(11):163. https://doi.org/10.3390/computers11110163
Chicago/Turabian StyleDimitri, Giovanna Maria. 2022. "A Short Survey on Deep Learning for Multimodal Integration: Applications, Future Perspectives and Challenges" Computers 11, no. 11: 163. https://doi.org/10.3390/computers11110163
APA StyleDimitri, G. M. (2022). A Short Survey on Deep Learning for Multimodal Integration: Applications, Future Perspectives and Challenges. Computers, 11(11), 163. https://doi.org/10.3390/computers11110163