Skip to Content
You are currently on the new version of our website. Access the old version .
SensorsSensors
  • Article
  • Open Access

21 February 2023

Effective Techniques for Multimodal Data Fusion: A Comparative Analysis

,
and
1
Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa Street 75, 00-662 Warsaw, Poland
2
WeSub, Adama Branickiego Street 17, 02-972 Warsaw, Poland
3
Faculty of Management, Warsaw University of Technology, Narbutta Street 85, 02-524 Warsaw, Poland
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Smart and Predictive Strategies in Data Fusion

Abstract

Data processing in robotics is currently challenged by the effective building of multimodal and common representations. Tremendous volumes of raw data are available and their smart management is the core concept of multimodal learning in a new paradigm for data fusion. Although several techniques for building multimodal representations have been proven successful, they have not yet been analyzed and compared in a given production setting. This paper explored three of the most common techniques, (1) the late fusion, (2) the early fusion, and (3) the sketch, and compared them in classification tasks. Our paper explored different types of data (modalities) that could be gathered by sensors serving a wide range of sensor applications. Our experiments were conducted on Amazon Reviews, MovieLens25M, and Movie-Lens1M datasets. Their outcomes allowed us to confirm that the choice of fusion technique for building multimodal representation is crucial to obtain the highest possible model performance resulting from the proper modality combination. Consequently, we designed criteria for choosing this optimal data fusion technique.

1. Introduction

The emergence of our hyper-connected and hyper-digitalized world (IoT, ubiquitous sensing, etc.) requires any education organization to have the ability to handle a system that produces huge amounts of different data. A key area of research in multimodal data is the process of building multimodal representations, the quality of which determines the modeling and prediction of organizational learning.
Multimodal learning involves working with data that contain multiple modalities, such as text, images, audio, and numerical or behavioral data. The interest in multimodal learning began in the 1980s and has gained popularity since then. In one of the first papers on the subject [1], the authors demonstrated that acoustic and visual data could be successfully combined in a speech recognition sensor system. By integrating both of the modalities they outperformed the audio data model.
Further research has shown that modalities can be complementary and a carefully chosen data fusion leads to significant model performance improvements [2,3,4]. Furthermore, modalities might naturally coexist, and the analyzed machine learning (ML) problem cannot be solved unless all modalities are considered [5]. Finally, recent improvements in multisensor solutions [6] have offered high-quality modalities, which require an appropriate fusion technique to fit to the ML problem.
Choosing, or even designing, the right data fusion technique is fundamental to determining the quality and performance of data fusion strategies to be undertaken by data scientist teams in organizations. Moreover, data fusion strategies are the central concept of multimodal learning. This concept could be applicable to numerous machine learning tasks with all kinds of modalities (a so-called universal technique). According to [4], a universal approach has not been established yet. The research work conducted by the authors shows that all current state-of-the-art data fusion models suffer from two main issues. Either they are task-specific or too complicated or they lack interpretability and flexibility.
Our paper explores different types of data (modalities) that can be gathered by sensors serving a wide range of sensor applications. Data processing in robotics is currently challenged by addressing the urgent research problem: how can we effectively build multimodal and common representation in sensor systems? Consequently, this topic impacts the effectiveness of machine learning modeling for sensor networks.
We believe that the development of a universal technique requires analysis and many research experiments to understand the differences between the existing and most often used data fusion techniques. We claim that effective multimodal data fusion depends on the type of technique used to build this representation. We further develop decision criteria that define this dependency. We examine the most common techniques: late fusion, early fusion, and sketch. In the first technique—late fusion—all the modalities are learned independently and are combined right before the model makes a decision. The second technique—early fusion—combines all modalities and then learns the model. The last one—sketch—is similar to early fusion; however, modalities are transformed into common space instead of being concatenated. We perform the comparison in classification tasks. Finally, we define criteria and recommendations for choosing a multimodal data fusion technique. The chosen criteria comprise (1) the impact each modality has on the analyzed machine learning (ML) problem, (2) the type of the task, and (3) the amount of memory used while in the training and predicting phase. We conduct experiments on three datasets, Amazon Reviews [7], MovieLens25M, and MovieLens1M [8]. They encapsulate three modalities—textual, visual, and graph data—each has potential because each can bring unique information to the ML model.
In our paper, we contribute to the contemporary discussion on how to investigate the application of multimodal data fusion techniques in different production settings. We confirm that the choice of data fusion technique is crucial for maximizing the performance of the multimodal model. Consequently, we show conditions under which the combination of modalities outperforms any one of them used separately. Finally, we design decision criteria for choosing the optimal data fusion technique in classification tasks.
Thus, in the following Section 2 we state the problems of multimodality and present the employed datasets (Section 3). Finally, we experiment with the chosen multimodal fusion techniques (Section 4 and Section 5), provide conclusions in Section 6, and identify areas for future research in Section 7.

3. Datasets

To evaluate different multimodal representations, we decided to compare them in classification problems, where we can quickly assess the amount of information each modality brought. Furthermore, we wanted to analyze production settings where the multimodality is expected to benefit the model, i.e., a model based on several modalities should obtain better performance than the unimodal model.
We found three datasets—Amazon Reviews, MovieLens25M, and MovieLens1M—applicable to our research design. They encapsulate three modality types in total: textual, visual, and graphs. Furthermore, in these datasets, each modality should be highly informative, e.g., it should be possible to classify a single product using any of its characteristics, i.e., the description, the title, or the image.
Table 2 presents the overall characteristics of the three used datasets (Amazon Reviews, MovieLens25M, and MovieLens1M) and the employed machine learning tasks. In our experiments, Amazon Reviews and Movielens25M were split at 0.6 / 0.2 / 0.2 into the training, validation, and test sets, respectively. With such a division, the results on test sets are representative (12,000 samples) while maintaining enough observations for training. The split of 0.8 / 0.2 (training and test sets, respectively) was used for MovieLens1M. Instead of one predefined validation set, MovieLens1M used the k-fold cross-validation (on the training set). We opted for cross-validation because this dataset was much smaller than the other two.
Table 2. Dataset characteristics.
The Amazon Reviews dataset consists of products from Amazon. The data used in experiments are only a small subset of the original dataset (jmcauley.ucsd.edu/data/amazon/, accessed on 15 January 2021), i.e., the “Clothing, Shoes, and Jewellery” category. Namely, we randomly sampled 5000 products per each of the following categories: Headdress, Boots, Jewellery, Wallet, Shirt, Dress, Underwear, Pants, Watch, Jacket, Sweater, and Luggage. Intuitively, products within this category should be divergent enough. Furthermore, products within the fashion field can be expected to contain meaningful descriptions, titles, and, most importantly, images. Missing modalities characterize the dataset extracted from Amazon Reviews. Around 25 % of descriptions are lost and 3 % of titles are empty; however, all the records contain images.
The MovieLens25M dataset is the largest of the MovieLens family (https://grouplens.org/datasets/movielens/, accessed on 7 February 2021). It consists of 25 million ratings given to 62,000 movies by 162,000 users. Each movie is associated with genres and with one IMDb identifier. The metadata was collected based on such an identifier (the movie’s poster and plot and the visual and textual embedding, respectively). Some movies are subscribed to incorrect IMDb identifiers and were therefore removed. After preprocessing, the dataset in our experiment consisted of 60,763 unique movies. The existing modalities are movie poster (image), movie plot (text), and also views (graph).
There are some missing modalities: every movie has a plot, 63 movies lack a poster, and users in MovieLens25M did not rate 3231 movies. Missing modalities are proportional to class sizes, meaning the most considerable number of missing modalities is in the Drama class. Here, classes are highly imbalanced. Around 40% of all observations are dramas and 27% are comedies, whereas film-noir and IMAX movies are associated with less than 1% of observations. There is also a high number of movies that are not subscribed to any specific category, the (no genres listed) class.
MovieLens1M is a smaller version of the previously described dataset and consists of 6040 users that have rated approximately 4000 movies. However, in addition to user ratings, their demographic data are also available. This allows for investigating dependencies between watched content and user characteristics such as gender, age, or occupation. After preprocessing for our experiment, only two modalities are utilized: textual (the plot) and visual (the poster). Again, the data were downloaded based on IMDb identifiers associated with movies. There are no missing modalities. Classes are imbalanced; around 71.7 % of users are males. This imbalance is considered during the evaluation using the MCC metric.

4. Research Design and Experiments

We performed three experiments with three classification tasks: multiclass, multilabel, and binary (see Table 2). The type of task was determined by the characteristics of the dataset prepared. The methods of carrying out the tests were described in each experiment (for textual embeddings, refer to: https://github.com/flairNLP/flair accessed on 15 January 2021).

4.1. Multiclass Classification—Amazon Reviews Dataset

Analyses on the Amazon Reviews Dataset consider the multiclass classification task based on two modalities—textual and visual. Classes are evenly distributed, and therefore the accuracy metric is used. The primary purpose of the experiments on this dataset is to compare multimodal fusion techniques based on neural networks:
  • The late fusion, where modalities are treated independently. Here, models’ architectures are based on [26];
  • Early fusion models, where embeddings of each modality are concatenated at the input level;
  • The sketch, including the classical and binarized versions.
Their overall architectures in our research design are visible in Figure 2. These approaches use Adam as the optimizer and a categorical cross-entropy loss function. The batch size was set to 32. All models were trained for ten epochs, except for the binarized sketch models, which were trained for 20 epochs. The learning rate differs per approach and equals 10 4 , 10 3 , 10 5 , and 10 4 in the late fusion, the early fusion, the classical, and the binarized sketch approach, respectively.
Figure 2. Here, we present the architectures used in the Amazon Reviews dataset. They are depicted in their trimodal forms. Bimodal and unimodal architectures look the same. The only difference is that unimodal models do not have a model on top of their outputs in the late fusion approach. Above, the output means the probabilities of observations belonging to respective classes.

4.1.1. Late Fusion Models

While deploying ideas from [26], slight modifications are required from multilabel to multiclass classification tasks, i.e., a different loss function or number of used neurons within layers. Both the description and title are represented with GloVe. Image embeddings are created with ResNet50. All multimodal models are identical. The outputs from unimodal models were fed into a dense layer, with 20 units and ReLU as the activation function. Finally, a 12-unit dense layer for product classification was added. Therefore, the only difference between bimodal and trimodal models is the input size, which was equal to 24 and 32, respectively.

4.1.2. Early Fusion

Here, each modality is firstly transformed into a numerical vector with BERT (bert-base-uncased) for both the titles and descriptions, resulting in 768 dimensional vectors. For image embeddings, ResNet50 was used. The last layer was removed, which results in 2048 dimensional vectors. In this approach, models have the same architecture, whether uni- or multi-modal. Therefore, the only difference is that in the multimodal case, appropriate embeddings are concatenated, resulting in input embedding sizes of 1536 (textual + textual), 2816 (textual + image), and 3584 (trimodal). The model architecture consists of three blocks: dense dropout layers, with 64 units; ReLU as the activation function; and the dropout layer, where the rate is set to 0.1 . Subsequently, one dense layer is added with 64 units and ReLU. However, in this case, the dropout is not applied. Finally, there is the 12-unit dense layer with softmax, which is responsible for classification.

4.1.3. Sketch Representation

Similar to the early fusion approach, BERT is used for the textual and ResNet50 for the visual data. Firstly, embeddings were transformed into sketches. The sketch depth was equal to 128 and the width was equal to 512. Then, sketches were flattened and concatenated in multimodal cases. The architecture is universal, despite the number of modalities. The model starts with two identical blocks consisting of a dense layer, a dropout layer, and batch normalization. The dense layer in the first block consists of 1024 neurons and the second consists of 512 neurons. After the second block, there is a 128-dimensional dense layer, followed by a dropout layer. These dense layers use ReLU, and the dropout rate is set to 0.2 . Finally, a 12-dimensional dense layer with softmax is responsible for classification. The binarized version uses the same architecture to provide a fair comparison with the classical sketch.

4.2. Multilabel Classification—MovieLens25M

The Movielens25M dataset considers multilabel genre classification; its classes are highly imbalanced and correlated. Therefore, threshold-independent metrics are used for evaluation: the area under curve (AUC) and the mean average precision (mAP). Such an approach makes analyzed solutions independent from class threshold selection at testing time. Furthermore, to take the class imbalance into account, micro versions of these metrics are used: micro-AUC and micro-mAP.
The sketch and the early fusion representations were considered for a comparison. They both exploit the same architecture, which is described in the following section. Figure 3 illustrates both of the approaches.
Figure 3. Architectures used in the MovieLens25M dataset.
Movielens25M encapsulates three modalities: graph, text, and images. The first one is encoded with Cleora. Cleora’s hyperparameters are set as follows: the embedding size was 1024 and the number of iterations was 1. According to [27], such parameters are optimal in preserving strong similarities between entities. Text data are represented with BERT (bert-base-cased) and visual data with ResNet50. Both sketches’ hyperparameters (depth and width) were set to 128.
The models consist of four dense layers. The first three use ReLU and have 1024, 512, and 128 neurons, respectively. The final, 20-dimensional layer is used for multilabel classification and therefore uses sigmoid. Models use the binary cross-entropy loss function the the Adam optimizer with a learning rate equal to 1 5 , and its other parameters are set as default. Each model was trained for 20 epochs, apart from multimodal early fusion models and unimodal graph models, which were trained for 30 epochs. The batch size was set to 64. Every model was trained five times using the training and validation data and evaluated on the test data to obtain robust results.

4.3. Binary Classification—MovieLens1M

The task on MovieLens1M is a binary classification, predicting people’s gender based on their behavioral data (here, movies they had watched).
In the sketch representation, every movie is represented with the sketch and the user as a sum of such sketches, see Figure 4. However, unlike the recommender scenario, the output is a number between 0 and 1, representing the probability that the user is a male or a female.
Figure 4. The architecture of gender classification. The user is represented as a sum of sketches. Then, a width-wise L2 normalization is performed. Finally, the sketch is flattened and fed into the classifier.
Every user has rated (watched) at least 20 movies. Movies are represented with the plot and the poster, as in MovieLens25M. Then, users are represented by the content they watch. Once a user sketch is created, it is flattened and fed into logistic regression with the regularization parameter C set to 0.1 . The sketch width was equal to 512, while the depth was adjusted to the modality type, equal to 210 for the movie plot and 128 for the movie poster. In the bimodal model, the flattened sketches are concatenated. As a result of the class imbalance, the Matthew correlation coefficient (MCC) is used for evaluation. Movie plots are embedded with BERT (bert-base-cased) and posters are embedded with ResNet50.

5. Results

Table 3, Table 4 and Table 5 shows our results for different tasks and datasets. We can see that only in one dataset—Amazon Reviews—do all the used modalities improve the final results. The other tests on MovieLens datasets show that posters reduce the results, and intuitively we can imagine that the posters do not have much substantial information about movies.
Table 3. Amazon Reviews dataset—results.
Table 4. MovieLens25M—results.
Table 5. MovieLens1M—gender classification results.
Since each modality’s impact on the model performance is not so clear in the Amazon Reviews dataset, we decided to explore them thoroughly concerning each category, see Figure 5. The analyzed results were obtained by the late fusion approach, which achieved the best results.
Figure 5. The late fusion models results concerning categories.
The bimodal and trimodal models outperform unimodal models in every category (accuracy metric). The most significant benefit from exploiting multiple modalities can be noticed within classes, where unimodal models yield close results, for instance, the Wallet class. On the other hand, categories, where one of the unimodal approaches is undeniably better, do not gain much from multimodality, e.g., the Watch class.
Surprisingly, despite the absence of around 25% of the data concerning the Sweater category, the description model significantly outperforms the image model within this class. This example shows that each modality, even ones with missing data, can yield unique information and be helpful until proven otherwise.
In most cases, the trimodal model does not significantly outperform the bimodal based on title and image. Furthermore, trimodal model training takes twice the time than that of the bimodal model. In the case of this thesis, this is not relevant, since it concerns training models on 48,000 observations. However, in real-world scenarios, the trimodal model could be too expensive for the slight improvement it provides. Hence, only titles and images can be considered valuable in this task.
In the case of gender classification, the highest MCC ( 0.543 ) indicates that the proposed architecture might be successfully used to solve demographic classification problems. It could be a powerful tool, for instance, for e-commerce or user-generated platforms. By modeling user characteristics, companies could personalize their offers to maximize their profits. On the other hand, users will benefit from personalized content. However, more experiments on larger datasets should be performed to confirm the usefulness in the production scenario.
To the best of the authors’ knowledge, our work classifies users based on the content of their interests for the first time (see experiment on MovieLens1M for gender classification). To date, the existing approaches [28] have considered only the behavioral data, which in the case of MovieLens1M is present as ratings that users give to movies. Our experiment shows that it is possible to model user characteristics with the content in which the users are interested. Though only simple classifiers were tested, it was proven that our approach is reasonable.
All the tests show that combining several modalities boosts model performance; however, the data must be meaningful for the given task. If possible, additional modalities are suggested, as they may provide new insight into the analyzed data. Our research concludes that multiple data modalities should be utilized cautiously, as some might yield little information (here, movie posters are redundant for genre and gender classification).
The main conclusions from the tests are summarized in Table 6 and explained in the following. These conclusions are made upon the following criteria: each modality’s influence on the analyzed machine learning (ML) problem, the type of the ML task, and the memory constraints during the training and predicting phase.
Table 6. Comparing technique selection criteria for building multimodal representations.
Tests on the Amazon Reviews dataset prove that late fusion in nearly all scenarios is the best. Here, the reason is probably the type of textual data. Many descriptions and titles contain words that are sufficient on their own for a good prediction. Therefore, in situations where one of the modalities is dominant, we believe that the best approach would be late fusion. The model efficiently exploits the information from the most informative modality and can expand it using additional modalities.
On the other hand, the early fusion approach can reveal information from the combination of the modalities (interaction between modalities), which can be interdependent. Therefore, in scenarios where several modalities affect the model equally, their interactions might reveal hidden information and hence should be learned.
Our experiments showed that the sketch performs worse in typical classification problems. However, this representation can be the least consuming in storing data representations. Furthermore, when the number of used modalities increased, the performances of sketch-based models almost always increased. Possibly, this improvement could be maintained until reaching state-of-the-art performance. If such behavior was confirmed, the sketch representation could offer ground-breaking perspectives in the world of Big Data with highly accurate models and optimized storage consumption.

6. Conclusions

This paper explored three of the most common techniques for building multimodal data representations, (1) the late fusion, (2) the early fusion, and (3) the sketch, and compared them in classification tasks. Our paper explored different types of data (modalities) that could be gathered by sensors serving a wide range of applications. Our experiments were conducted on Amazon Reviews, MovieLens25M, and MovieLens1M datasets. The results showed that a data fusion strategy is crucial for maximizing multimodal model efficiency. In the case of the Amazon Reviews dataset, the late fusion approach substantially outperformed the other systems by approximately 3 % .
Results on all datasets confirmed that multimodality could outperform any unimodal representation. The performance of models improved from 0.919 to 0.969 in accuracy on the Amazon Reviews dataset and from 0.907 to 0.918 in AUC on the MovieLens25M dataset. However, our experiments on both MovieLens datasets indicated the importance of meaningful input data to the given task. Analyses showed that movie posters did not contain much substantial information about movies.
To the best of our knowledge, we are the first to propose three unique selection criteria for choosing a technique appropriate to the ML task (see Table 6), which should be taken into account while building multimodal representation. These criteria are the impact each modality has on the task, the type of the task, and the amount of used memory. Furthermore, using these criteria, we summarized the three explored techniques, offering recommendations that can be used by all scientists working on any multimodal task.
Our work did not entirely exploit the problem of choosing the proper technique for building multimodal representations. However, we confirmed that such a choice matters, and we designed recommendation criteria that could support it. In the next section, we present the directions of future research which, in combination with our work, could set the basis for developing a universal multimodal fusion technique.

7. Future Research Directions

Given the recent improvements in sensor systems, more high-quality modalities are available. This creates both opportunities and challenges that should be addressed to develop a universal multimodal data fusion technique. We highlighted three major areas of research: (1) designing a multimodal benchmark, (2) extending multimodal applications, and (3) increasing multimodal interpretability.
Designing a multimodal benchmark, similar to GLUE [29] for NLP applications, is crucial for establishing the standards of multimodal representation research. It would confirm all conclusions made so far in previous and future works on this subject. Multibench [30] could be considered a promising candidate for such a benchmark. It is a collection of fifteen datasets which span across seven different applications. Extending Multibench with new applications and establishing state-of-the-art models along with a dynamic leaderboard could be the next step.
Identifying new multimodal applications is essential to create a universal data fusion technique. For instance, recent research on biomedical applications [31,32] exposed the limitations and strengths of existing multimodal fusion strategies. Consequently, exploring other applications would highlight the work that still needs to be carried out in the field of building multimodal representation.
Multimodal interpretability is the process of understanding multimodal deep neural nets [33,34]. Despite their outstanding performance, the complexity, opaqueness, and black-box nature of the deep neural nets limit their social acceptance and usability. Hence, understanding the impact of individual modalities, and especially their interactions, is fundamental for any multimodal application.

Author Contributions

Conceptualization, M.P., A.W. and S.S.-R.; methodology, M.P., A.W. and S.S.-R.; software, M.P.; validation, M.P.; formal analysis, M.P., A.W. and S.S.-R.; investigation, M.P., A.W. and S.S.-R.; resources, M.P.; data curation, M.P.; writing—original draft preparation, M.P., A.W. and S.S.-R.; writing—review and editing, M.P., A.W. and S.S.-R.; visualization, M.P.; supervision, A.W.; project administration, M.P.; funding acquisition, A.W. and S.S.-R. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the Centre for Priority Research Area Artificial Intelligence and Robotics of Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme (grant no 1820/27/Z01/POB2/2021) and The National Centre for Research and Development in Poland, Smart Growth Operational Program for 2014-2020, Digital Innovations: POIR.01.01.01-00-0066/22.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available: Amazon Reviews—https://jmcauley.ucsd.edu/data/amazon/, DOI: 10.1145/2872427.2883037, accessed on 15 January 2021 MovieLens1M, MovieLens25M—https://grouplens.org/datasets/movielens/, DOI: 10.1145/2827872, accessed on 7 February 2021.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
MDPIMultidisciplinary digital publishing institute
MLMachine learning
AUCArea under curve
NNNearest neighbour
mAPMean average precision
MCCMatthew correlation coefficient

References

  1. Yuhas, B.; Goldstein, M.; Sejnowski, T. Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 1989, 27, 65–71. [Google Scholar] [CrossRef]
  2. Baltrusaitis, T.; Ahuja, C.; Morency, L.P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
  3. Cao, W.; Feng, W.; Lin, Q.; Cao, G.; He, Z. A Review of Hashing Methods for Multimodal Retrieval. IEEE Access 2020, 8, 15377–15391. [Google Scholar] [CrossRef]
  4. Gao, J.; Li, P.; Chen, Z.; Zhang, J. A Survey on Deep Learning for Multimodal Data Fusion. Neural Comput. 2020, 32, 829–864. [Google Scholar] [CrossRef]
  5. Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
  6. Tsanousa, A.; Bektsis, E.; Kyriakopoulos, C.; González, A.G.; Leturiondo, U.; Gialampoukidis, I.; Karakostas, A.; Vrochidis, S.; Kompatsiaris, I. A Review of Multisensor Data Fusion Solutions in Smart Manufacturing: Systems and Trends. Sensors 2022, 22, 1734. [Google Scholar] [CrossRef]
  7. He, R.; McAuley, J. Ups and Downsm: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web, Montréal, QC, Canada, 11–15 May 2016; pp. 507–517. [Google Scholar]
  8. Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 2016, 5, 1–19. [Google Scholar] [CrossRef]
  9. Varshney, K. Trust in Machine Learning, Manning Publications, Shelter Island, Chapter 4 Data Sources and Biases, Section 4.1 Modalities. Available online: https://livebook.manning.com/book/trust-in-machine-learning/chapter-4/v-2/ (accessed on 23 March 2021).
  10. Zhang, Y.; Sidibé, D.; Morel, O.; Mériaudeau, F. Deep multimodal fusion for semantic image segmentation: A survey. Image Vis. Comput. 2021, 105, 104042. [Google Scholar] [CrossRef]
  11. El-Sappagh, S.; Abuhmed, T.; Riazul Islam, S.; Kwak, K.S. Multimodal multitask deep learning model for Alzheimer’s disease progression detection based on time series data. Neurocomputing 2020, 412, 197–215. [Google Scholar] [CrossRef]
  12. Jaiswal, M.; Bara, C.P.; Luo, Y.; Burzo, M.; Mihalcea, R.; Provost, E.M. MuSE: A Multimodal Dataset of Stressed Emotion. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; European Language Resources Association: Paris, France, 2020; pp. 1499–1510. [Google Scholar]
  13. Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; Rohrbach, M. Towards VQA Models That Can Read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8309–8318. [Google Scholar]
  14. Rychalska, B.; Basaj, D.B.; Dabrowski, J.; Daniluk, M. I know why you like this movie: Interpretable Efficient Multimodal Recommender. arXiv 2020, arXiv:2006.09979. [Google Scholar]
  15. Laenen, K.; Moens, M.F. A Comparative Study of Outfit Recommendation Methods with a Focus on Attention-based Fusion. Inf. Process. Manag. 2020, 57, 102316. [Google Scholar] [CrossRef]
  16. Salah, A.; Truong, Q.T.; Lauw, H.W. Cornac: A Comparative Framework for Multimodal Recommender Systems. J. Mach. Learn. Res. 2020, 21, 1–5. [Google Scholar]
  17. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Toronto, AB, Canada, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  18. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
  19. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
  20. Srivastava, N.; Salakhutdinov, R. Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 2014, 15, 2949–2980. [Google Scholar]
  21. Frank, S.; Bugliarello, E.; Elliott, D. Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. arXiv 2021, arXiv:2109.04448. [Google Scholar]
  22. Gallo, I.; Calefati, A.; Nawaz, S. Multimodal Classification Fusion in Real-World Scenarios. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 36–41. [Google Scholar] [CrossRef]
  23. Kiela, D.; Grave, E.; Joulin, A.; Mikolov, T. Efficient Large-Scale Multi-Modal Classification. arXiv 2018, arXiv:1802.02892. [Google Scholar] [CrossRef]
  24. Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput. 2022, 38, 2939–2970. [Google Scholar] [CrossRef] [PubMed]
  25. Dabrowski, J.; Rychalska, B.; Daniluk, M.; Basaj, D.; Goluchowski, K.; Babel, P.; Michalowski, A.; Jakubowski, A. An efficient manifold density estimator for all recommendation systems. arXiv 2020, arXiv:2006.01894. [Google Scholar]
  26. Wirojwatanakul, P.; Wangperawong, A. Multi-Label Product Categorization Using Multi-Modal Fusion Models. arXiv 2019, arXiv:1907.00420. [Google Scholar]
  27. Rychalska, B.; Basaj, D.; Dabrowski, J.; Daniluk, M. Cleora: A Simple, Strong and Scalable Graph Embedding Scheme. arXiv 2021, arXiv:2102.02302. [Google Scholar]
  28. De Cnudde, S.; Martens, D.; Evgeniou, T.; Provost, F. A benchmarking study of classification techniques for behavioral data. Int. J. Data Sci. Anal. 2020, 9, 131–173. [Google Scholar] [CrossRef]
  29. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Association for Computational Linguistics: Toronto, AB, Canada, 2018; pp. 353–355. [Google Scholar] [CrossRef]
  30. Liang, P.P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.Y.; Wu, P.; Lee, M.A.; Zhu, Y.; et al. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1); 2021. Available online: https://arxiv.org/abs/2107.07502 (accessed on 7 February 2023).
  31. Stahlschmidt, S.R.; Ulfenborg, B.; Synnergren, J. Multimodal deep learning for biomedical data fusion: A review. Briefings Bioinform. 2022, 23, bbab569. [Google Scholar] [CrossRef]
  32. Zhang, Y.D.; Dong, Z.; Wang, S.H.; Yu, X.; Yao, X.; Zhou, Q.; Hu, H.; Li, M.; Jiménez-Mesa, C.; Ramirez, J.; et al. Advances in multimodal data fusion in neuroimaging: Overview, challenges, and novel orientation. Inf. Fusion 2020, 64, 149–187. [Google Scholar] [CrossRef] [PubMed]
  33. Sleeman, W.C.; Kapoor, R.; Ghosh, P. Multimodal Classification: Current Landscape, Taxonomy and Future Directions. ACM Comput. Surv. 2023, 55, 1–31. [Google Scholar] [CrossRef]
  34. Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv 2022, arXiv:2209.03430. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.