Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation
Abstract
1. Introduction
2. Related Works
2.1. Video Content Classtification
2.2. Text-Based Sentiment Analysis
2.3. Visual-Based Sentiment Analysis
2.4. Multimodal Sentiment Analysis and Aggregation Approaches
2.5. Large Vision–Language Models
3. Materials and Methods
3.1. Datasets
3.2. Preprocessing
3.2.1. Utterance Segmentation
3.2.2. Audio Transcription and Text Extraction
3.2.3. Data Cleaning
3.2.4. Frame Extraction
3.3. Sentiment Annotation
3.4. Video-Level Sentiment Aggregation
3.5. Models
- LlavA-OneVision-7B: Multimodal model developed with a combination of CLIP (Contrastive Language-Image Pretraining) and LLaMA [14]. This model is designed to understand the relationship between text and frames by mapping them into a shared embedding space so that it is able to better capture the representation of sentiment from both modalities.LlavA-OneVision-7B is used to process text extracted from video and visual information from uniformly sampled frames obtained from utterance segmentation. This model was tested using zero-shot learning, where the model was used directly without additional training, to evaluate its ability to recognize sentiment from data that has never been seen before.
- Qwen2.5-VL-7B: Vision–language model with more sophisticated cross-modal attention capabilities to connect information from text and frames [17]. This model was developed to handle various natural language processing tasks with stronger visual comprehension.
3.6. Experimental Setup
- Computational Resource: This research leverages cloud computing resources powered by an A100 GPU with 40 GB of VRAM and 83 GB of system RAM.
- Unified Experimental Design for Multimodal Input Configurations: The experimental design includes three input configurations: (i) text-only, (ii) vision-only using uniformly sampled frames, and (iii) multimodal input combining both text and visual information. All configurations follow the same zero-shot prompting protocol, ensuring consistency across experiments.
- Frame-Resolution Trade-off: To investigate the impact of temporal and visual resolution on prediction performance, we designed an experiment that varied the number of video frames (n-frames) used as input while adjusting the image resolution to manage computational complexity. To determine the optimal number of frames set, we conducted a preliminary experiment using 50 randomly sampled videos from each episode to represent the overall data. The number of frames was incrementally set to 10 frames per sample. Due to increasing memory and processing demands with higher frame counts, the image resolution was proportionally reduced from 360p to 240p, and eventually to 144p. The models were evaluated under the same inference protocol and prompting settings across all configurations to ensure a fair comparison. Accuracy was recorded for each configuration to assess the trade-off between temporal information and image quality under resource constraints.
- Zero-Shot: The models are used directly without additional training to measure the extent to which the models can recognize sentiment based on built-in pretraining. The model receives input in text and frames and then produces a prediction of positive or negative sentiment. Since a zero-shot evaluation setup is employed without any model training or fine-tuning, the entire dataset is used for evaluation rather than being divided into training and test subsets. We conducted two experiments to evaluate the performance of LlavA-OneVision-7B and Qwen2.5-VL-7B in multimodal sentiment analysis. These experiments follow a zero-shot inference setup, meaning the models are tested without any task-specific fine-tuning. Experiments 1 and 2 assess each model’s ability to classify sentiment using only their pre-trained knowledge. Experiment 1 tests LlavA-OneVision-7B with raw text and image inputs extracted from video, evaluating how well the model leverages its pretrained multimodal understanding to infer sentiment. Experiment 2 applies the same setup to Qwen2.5-VL-7B, analyzing its ability to process rescaled video frames and accompanying text through its cross-modal attention framework. These experiments provide insight into the initial performance of each model when applied to previously unseen multimodal data.
- Unimodal Baselines: We include several additional baseline configurations to enable a comprehensive evaluation. First, for the text-only baselines, we evaluate multiple large language models, including Qwen3.5-4B and Phi-3.5-mini, using only transcribed text inputs. This setting is designed to assess the contribution of linguistic information independently of any visual cues. Second, for the video-only baselines, we evaluate LLaVA-NeXT-Video-7B as a dedicated visual model that operates solely on sampled video frames. This configuration allows us to examine the extent to which sentiment can be inferred from visual information alone. Finally, we consider unimodal variants of LVLMs, where the models are evaluated in both text-only and video-only modes. This analysis aims to provide further insight into modality-specific behavior within the same model architecture.
3.7. Evaluation Metrics
3.8. Reproducibility and Resources
4. Results
4.1. Exploring the Optimal n-Frames
4.2. Comparative Performance of the Models
4.2.1. Text-Only Performance (LLMs vs. LVLMs)
4.2.2. Video-Only Performance (Visual Capability of LVLMs)
4.3. Ablation Study on Input Modalities
4.4. Error Analysis
5. Discussion
5.1. Implications
5.2. Limitations
- Limitations of the Dataset: The dataset is limited in scope, as it is derived from utterance-level segments of a single Indonesian TV series. While this setting provides realistic conversational data, it may not capture the diversity of sentiment expressions across domains, languages, and content types. Therefore, the findings should be interpreted as an exploratory case study rather than a generalizable benchmark.
- Dependence on a single annotator: A key limitation of this study lies in the use of a single annotator without inter-annotator agreement or adjudication procedures. While this approach ensures internal consistency in labeling, it does not capture the variability of human interpretation, which is particularly important for subjective tasks such as sentiment analysis. As a result, the annotations should be interpreted as a single-annotator reference rather than a fully validated ground truth. This may introduce bias and affect the reliability of the evaluation. Future work should incorporate multiple annotators, measure inter-annotator agreement, and include adjudication protocols to improve annotation reliability and better capture the subjective nature of sentiment interpretation.
- Limitations of zero-shot inference: Although LVLMs provide promising results without task-specific fine-tuning, they may struggle with subtle or ambiguous sentiment cues, particularly when textual and visual signals conflict. This indicates that additional validation, and potentially domain-specific adaptation, may be required for robust real-world use.
- Computational cost of refining large-scale models: The analysis focuses on YouTube video data, which may not fully represent the diverse multimodal sentiment expressions across platforms. Generalisation of these findings can be improved by including a broader dataset covering various social media sources and content types. The computational cost of refining large-scale models remains a challenge, requiring large GPU resources that can limit accessibility for smaller research teams or organisations.
- Mismatch Between Annotation Cues and Model Inputs: Another limitation of this study lies in the mismatch between annotation and model inputs. While annotator may consider vocal cues such as intonation during labeling, the models operate only on textual and visual inputs. This discrepancy may introduce noise in the evaluation, as certain sentiment cues present during annotation are not accessible to the model. Future work may address this limitation by incorporating audio-aware multimodal models or restricting annotation criteria to match model inputs more closely.
- Absence of statistical significance testing and repeated runs: This study reports descriptive performance metrics for zero-shot inference, including accuracy, Macro-F1, precision, recall, and class-wise F1-score. However, we did not conduct repeated stochastic inference runs or formal statistical significance testing. Therefore, observed differences between input configurations and models should be interpreted cautiously, particularly when performance gaps are small. Future work should include repeated evaluations, confidence intervals, and paired statistical tests, such as bootstrap resampling or approximate randomization tests, to provide stronger evidence for performance differences.
5.3. Future Research
- Fine-Tuning with Domain-Specific Datasets and Stronger Evaluation Protocols: Although the evaluated models show promising zero-shot capabilities, fine-tuning with domain-specific datasets—particularly those containing child-friendly and inappropriate content samples—could significantly enhance their classification accuracy. Creating annotated multimodal datasets tailored for sentiment analysis in child safety applications would help improve model robustness and mitigate biases. Future work should also include stronger supervised and fine-tuned baselines, repeated evaluations, confidence intervals, and formal statistical significance testing, such as bootstrap resampling or paired significance tests, to provide more reliable evidence of performance differences across models and input modalities.
- Integration of Additional Modalities: The current study primarily focuses on text and visual sentiment analysis. Future research should explore the incorporation of audio-based sentiment analysis, as tone of voice, background sounds, and speech inflections can provide crucial emotional cues. Combining text, frames, and audio may lead to a more comprehensive understanding of sentiment in video content.
- Real-Time Processing and Efficiency Optimization: Given the computational demands of large-scale multimodal models, future studies should explore optimization techniques to enable real-time sentiment analysis. Efficient inference strategies, model distillation, and low-rank adaptation (LoRA) could be employed to reduce latency and improve deployment feasibility on resource-constrained devices.
- Application in Broader Content Moderation Use Cases: While the focuses on child safety, the insights gained from LVLM-based sentiment analysis can be extended to other content moderation applications, such as hate speech detection, misinformation filtering, and toxicity analysis. Future work could explore how these models perform in diverse content regulation scenarios, further refining their effectiveness across multiple domains.
- Incorporating Annotator Consensus and Label Reliability: A key direction for future research is to explicitly consider annotator agreement and consensus in dataset construction and evaluation. A key direction for future research is to explicitly consider annotator agreement and consensus in dataset construction and evaluation. Future work should focus on several key improvements. It is important to measure inter-annotator agreement using metrics such as Cohen’s Kappa or Fleiss’ Kappa. In addition, consensus-based labeling strategies should be introduced, where ambiguous samples are either re-evaluated. Furthermore, model performance should be analyzed with respect to high-agreement versus low-agreement samples.This approach would help distinguish whether classification errors arise from model limitations or from inherent ambiguity in the data.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
- Zhang, H. A comprehensive survey on multimodal sentiment analysis: Techniques, models, and applications. Adv. Eng. Innov. 2024, 12, 47–52. [Google Scholar] [CrossRef]
- Wu, S.; Wang, X.; Wang, L.; He, D.; Dang, J. Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content. arXiv 2024, arXiv:2412.10460. [Google Scholar] [CrossRef]
- Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
- Wang, N.; Wang, Q. Dynamic Weighted Gating for Enhanced Cross-Modal Interaction in Multimodal Sentiment Analysis. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 21, 1–19. [Google Scholar] [CrossRef]
- Xie, Y.; Zhu, Z.; Lu, X.; Huang, Z.; Xiong, H. InfoEnh: Towards Multimodal Sentiment Analysis via Information Bottleneck Filter and Optimal Transport Alignment. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9073–9083. [Google Scholar]
- Deng, Y.; Li, Y.; Xian, S.; Li, L.; Qiu, H. MuAL: Enhancing multimodal sentiment analysis with cross-modal attention and difference loss. Int. J. Multimed. Inf. Retr. 2024, 13, 31. [Google Scholar] [CrossRef]
- Wang, W.; Ding, L.; Shen, L.; Luo, Y.; Hu, H.; Tao, D. Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge. In Proceedings of the 32nd ACM International Conference on Multimedia; ACM: New York, NY, USA, 2024; pp. 2282–2291. [Google Scholar]
- Zhao, S.; Jiang, J.; Tang, W.; Zhu, J.; Chen, H.; Xu, P.; Schuller, B.W.; Tao, J.; Yao, H.; Ding, G. Multi-source multi-modal domain adaptation. Inf. Fusion 2025, 117, 102862. [Google Scholar] [CrossRef]
- Shi, Y.; Cai, J.; Liao, L. Multi-task learning and mutual information maximization with crossmodal transformer for multimodal sentiment analysis. J. Intell. Inf. Syst. 2024, 63, 1–19. [Google Scholar] [CrossRef]
- Chen, J.; Yu, K.; Wang, F.; Zhou, Z.; Bi, Y.; Zhuang, S.; Zhang, D. Temporal convolutional network-enhanced real-time implicit emotion recognition with an innovative wearable fNIRS-EEG dual-modal system. Electronics 2024, 13, 1310. [Google Scholar] [CrossRef]
- Al-Tameemi, I.K.S.; Feizi-Derakhshi, M.R.; Pashazadeh, S.; Asadpour, M. Interpretable multimodal sentiment classification using deep multi-view attentive network of image and text data. IEEE Access 2023, 11, 91060–91081. [Google Scholar] [CrossRef]
- Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
- Li, B.; Zhang, Y.; Guo, D.; Zhang, R.; Li, F.; Zhang, H.; Zhang, K.; Zhang, P.; Li, Y.; Liu, Z.; et al. Llava-onevision: Easy visual task transfer. arXiv 2024, arXiv:2408.03326. [Google Scholar]
- Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
- Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
- Tanantong, T.; Yongwattana, P. A convolutional neural network framework for classifying inappropriate online video contents. IAES Int. J. Artif. Intell. 2023, 12, 124–136. [Google Scholar] [CrossRef]
- Balat, M.; Gabr, M.; Bakr, H.; Zaky, A.B. TikGuard: A Deep Learning Transformer-Based Solution for Detecting Unsuitable TikTok Content for Kids. In 2024 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES); IEEE: Piscataway, NJ, USA, 2024; pp. 337–340. [Google Scholar] [CrossRef]
- Zhao, C.; Yang, L.; Kuang, J.; Yan, Z. Protecting Children from Violent Short Videos: A Child-Attentive Multimodal Multitask Learning Approach. In Pacific Asia Conference on Information Systems; AIS Electronic Library: Atlanta, GA, USA, 2025; Available online: https://aisel.aisnet.org/pacis2025/aiandml/aiandml/19/ (accessed on 1 May 2026).
- Xu, Y. Research for the methods of text sentiment analysis. IET Conf. Proc. 2025, 2024, 185–191. [Google Scholar] [CrossRef]
- Jiao, S. Research on text sentiment analysis in natural language processing. In Proceedings of the International Conference on Electrical Engineering and Intelligent Control (EEIC 2024); IET: Stevenage, UK, 2024; Volume 2024, pp. 161–167. [Google Scholar]
- Raza, A.A.; Habib, A.; Ashraf, J.; Javed, M. Semantic orientation based decision making framework for big data analysis of sporadic news events. J. Grid Comput. 2019, 17, 367–383. [Google Scholar] [CrossRef]
- Wilson, T.; Wiebe, J.; Hoffmann, P. Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis. Comput. Linguist. 2009, 35, 399–433. [Google Scholar] [CrossRef]
- Taboada, M.; Brooke, J.; Tofiloski, M.; Voll, K.; Stede, M. Lexicon-based methods for sentiment analysis. Comput. Linguist. 2011, 37, 267–307. [Google Scholar] [CrossRef]
- Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. arXiv 2002, arXiv:cs/0205070. [Google Scholar]
- Song, J.; Kim, K.T.; Lee, B.; Kim, S.; Youn, H.Y. A novel classification approach based on Naïve Bayes for Twitter sentiment analysis. KSII Trans. Internet Inf. Syst. TIIS 2017, 11, 2996–3011. [Google Scholar]
- Naz, S.; Sharan, A.; Malik, N. Sentiment classification on twitter data using support vector machine. In Proceedings of the 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI); IEEE: Piscataway, NJ, USA, 2018; pp. 676–679. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar] [CrossRef]
- Tai, K.S.; Socher, R.; Manning, C.D. Improved semantic representations from tree-structured long short-term memory networks. arXiv 2015, arXiv:1503.00075. [Google Scholar]
- Yue, W.; Li, L. Sentiment analysis using word2vec-cnn-bilstm classification. In Proceedings of the 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS); IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
- Huynh, V.T.; Yang, H.J.; Lee, G.S.; Kim, S.H. End-to-end learning for multimodal emotion recognition in video with adaptive loss. IEEE MultiMedia 2021, 28, 59–66. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1: Long and short papers, pp. 4171–4186. [Google Scholar]
- Machajdik, J.; Hanbury, A. Affective image classification using features inspired by psychology and art theory. In Proceedings of the 18th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2010; pp. 83–92. [Google Scholar]
- Verma, B.; Meel, P.; Vishwakarma, D.K. Visual Sentiment Recognition via Popular Deep Models on the Memotion Dataset. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT); IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
- Yang, J.; She, D.; Sun, M. Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17); IJCAI Inc.: Pasadena, CA, USA, 2017; pp. 3266–3272. [Google Scholar]
- Sai, P.T.; Sri, G.H.; Surekha, T.L. Sentiment Recognition in Images leveraging ResNet18 vs Vit Architecture. In Proceedings of the 2024 Second International Conference on Advances in Information Technology (ICAIT); IEEE: Piscataway, NJ, USA, 2024; Volume 1, pp. 1–7. [Google Scholar]
- Jhadi, K.; Tiwari, N.; Chawla, M. Visual Sentiment-based on FER for Improving Feedback Analysis using Transfer Learning. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
- Limami, F.; Hdioud, B.; Oulad Haj Thami, R. Contextual emotion detection in images using deep learning. Front. Artif. Intell. 2024, 7, 1386753. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Li, T. A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics 2026, 13, 10. [Google Scholar] [CrossRef]
- Li, H.; Lu, Y.; Zhu, H. Multi-modal sentiment analysis based on image and text fusion based on cross-attention mechanism. Electronics 2024, 13, 2069. [Google Scholar] [CrossRef]
- Zhan, Z.; Cao, D.; Chen, Z.; Cheng, H.; Yu, Z. Multimodal sentiment analysis based on slice aggregation and dynamic fusion. CCF Trans. Pervasive Comput. Interact. 2025, 7, 474–493. [Google Scholar] [CrossRef]
- Almousa, O.; Tashtoush, Y.; AlSobeh, A.; Zahariev, P.; Darwish, O. SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis. Big Data Cogn. Comput. 2026, 10, 49. [Google Scholar] [CrossRef]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- Nasution, A.H.; Onan, A. ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
- Guidotti, D.; Pandolfo, L.; Pulina, L. Discovering sentiment insights: Streamlining tourism review analysis with Large Language Models. Inf. Technol. Tour. 2025, 27, 227–261. [Google Scholar] [CrossRef]
- Dmonte, A.; Ko, E.; Zampieri, M. An Evaluation of Large Language Models in Financial Sentiment Analysis. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData); IEEE: Piscataway, NJ, USA, 2024; pp. 4869–4874. [Google Scholar]
- Água, M.; António, N.; Carrasco, M.P.; Rassal, C. Large Language Models Powered Aspect-Based Sentiment Analysis for Enhanced Customer Insights. Tour. Manag. Stud. 2025, 21, 1–19. [Google Scholar] [CrossRef]
- Zhou, C.; Song, D.; Tian, Y.; Wu, Z.; Wang, H.; Zhang, X.; Yang, J.; Yang, Z.; Zhang, S. A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis. arXiv 2024, arXiv:2412.02279. [Google Scholar] [CrossRef]
- Khalila, Z.; Nasution, A.H.; Monika, W.; Onan, A.; Murakami, Y.; Radi, Y.B.I.; Osmani, N.M. Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models. Int. J. Adv. Comput. Sci. Appl. 2025, 16. [Google Scholar] [CrossRef]
- Thresa Jeniffer, J.; Swetha, M.; Raghuvaran, E.; Deepa, R.; Surendran, R. Enhancing Sentiment Analysis with Multimodal Large Language Models. In 2025 6th International Conference for Emerging Technology (INCET); IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
- Tian, Y.; Song, Y.; Zhang, Y. Multimodal Aspect-Based Sentiment Analysis with Plugin-Enhanced Large Language Models. IEEE Trans. Neural Netw. Learn. Syst. 2026, 37, 1575–1589. [Google Scholar] [CrossRef] [PubMed]
- Nishimura, T.; Nakada, S.; Kondo, M. Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 21, 1–22. [Google Scholar] [CrossRef]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Pérez-Rosas, V.; Mihalcea, R.; Morency, L.P. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; Volume 1: Long Papers, pp. 973–982. [Google Scholar]
- Heydarian, M.; Doyle, T.E.; Samavi, R. MLCM: Multi-label confusion matrix. IEEE Access 2022, 10, 19083–19095. [Google Scholar] [CrossRef]





| Episode | Dataset Overview | Multimodal Characteristics | ||||
|---|---|---|---|---|---|---|
| Positive | Negative | Average Number of Words | Average Video Duration | |||
| Count | Percentage | Count | Percentage | Average (Words) | Average (s) | |
| Eps1 | 37 | 74% | 13 | 26% | 6.58 | 2.76 |
| Eps2 | 36 | 72% | 14 | 28% | 4.96 | 1.88 |
| Eps3 | 41 | 82% | 9 | 18% | 4.86 | 1.99 |
| Eps4 | 50 | 100% | 0 | 0% | 5.56 | 2.10 |
| Eps5 | 43 | 86% | 7 | 14% | 5.12 | 1.88 |
| Eps6 | 37 | 74% | 13 | 26% | 5.44 | 2.05 |
| Eps7 | 31 | 62% | 19 | 38% | 5.36 | 1.94 |
| All Episodes | Positive | Negative | Average Number of Words | Average Video Duration | ||
| 275 (78.57%) | 75 (21.43%) | 5.41 | 2.09 | |||
| Model | Acc. | P(+) | R(+) | F1(+) | P(−) | R(−) | F1(−) | Macro-F1 |
|---|---|---|---|---|---|---|---|---|
| Eps1 | ||||||||
| LLaVA-OneVision-7B | 74.00% | 0.96 | 0.68 | 0.79 | 0.50 | 0.92 | 0.65 | 0.72 |
| Qwen2.5-VL-7B | 26.00% | 0.00 | 0.00 | 0.00 | 0.26 | 1.00 | 0.41 | 0.21 |
| Phi-3.5-mini (3.8B) | 66.00% | 1.00 | 0.54 | 0.70 | 0.43 | 1.00 | 0.60 | 0.65 |
| Qwen3.5-4B | 62.00% | 1.00 | 0.49 | 0.65 | 0.41 | 1.00 | 0.58 | 0.62 |
| Eps2 | ||||||||
| LLaVA-OneVision-7B | 64.00% | 1.00 | 0.50 | 0.67 | 0.44 | 1.00 | 0.61 | 0.64 |
| Qwen2.5-VL-7B | 28.00% | 0.00 | 0.00 | 0.00 | 0.28 | 1.00 | 0.44 | 0.22 |
| Phi-3.5-mini (3.8B) | 44.00% | 1.00 | 0.22 | 0.36 | 0.33 | 1.00 | 0.50 | 0.43 |
| Qwen3.5-4B | 52.00% | 1.00 | 0.33 | 0.50 | 0.37 | 1.00 | 0.54 | 0.52 |
| Eps3 | ||||||||
| LLaVA-OneVision-7B | 56.00% | 1.00 | 0.46 | 0.63 | 0.29 | 1.00 | 0.45 | 0.54 |
| Qwen2.5-VL-7B | 18.00% | 0.00 | 0.00 | 0.00 | 0.18 | 1.00 | 0.31 | 0.15 |
| Phi-3.5-mini (3.8B) | 36.00% | 1.00 | 0.22 | 0.36 | 0.22 | 1.00 | 0.36 | 0.36 |
| Qwen3.5-4B | 50.00% | 1.00 | 0.39 | 0.56 | 0.26 | 1.00 | 0.42 | 0.49 |
| Eps4 | ||||||||
| LLaVA-OneVision-7B | 36.00% | 1.00 | 0.36 | 0.53 | 0.00 | 0.00 | 0.00 | 0.26 |
| Qwen2.5-VL-7B | 0.00% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Phi-3.5-mini (3.8B) | 22.00% | 1.00 | 0.22 | 0.36 | 0.00 | 0.00 | 0.00 | 0.18 |
| Qwen3.5-4B | 34.00% | 1.00 | 0.34 | 0.51 | 0.00 | 0.00 | 0.00 | 0.26 |
| Eps5 | ||||||||
| LLaVA-OneVision-7B | 68.00% | 1.00 | 0.63 | 0.77 | 0.30 | 1.00 | 0.47 | 0.62 |
| Qwen2.5-VL-7B | 14.00% | 0.00 | 0.00 | 0.00 | 0.14 | 1.00 | 0.25 | 0.12 |
| Phi-3.5-mini (3.8B) | 42.00% | 1.00 | 0.33 | 0.49 | 0.19 | 1.00 | 0.33 | 0.41 |
| Qwen3.5-4B | 58.00% | 1.00 | 0.51 | 0.68 | 0.25 | 1.00 | 0.40 | 0.54 |
| Eps6 | ||||||||
| LLaVA-OneVision-7B | 84.00% | 0.97 | 0.81 | 0.88 | 0.63 | 0.92 | 0.75 | 0.82 |
| Qwen2.5-VL-7B | 28.00% | 1.00 | 0.03 | 0.05 | 0.27 | 1.00 | 0.42 | 0.24 |
| Phi-3.5-mini (3.8B) | 74.00% | 0.96 | 0.68 | 0.79 | 0.50 | 0.92 | 0.65 | 0.72 |
| Qwen3.5-4B | 86.00% | 0.97 | 0.84 | 0.90 | 0.67 | 0.92 | 0.77 | 0.84 |
| Eps7 | ||||||||
| LLaVA-OneVision-7B | 62.00% | 1.00 | 0.39 | 0.56 | 0.50 | 1.00 | 0.67 | 0.61 |
| Qwen2.5-VL-7B | 38.00% | 0.00 | 0.00 | 0.00 | 0.38 | 1.00 | 0.55 | 0.28 |
| Phi-3.5-mini (3.8B) | 58.00% | 1.00 | 0.32 | 0.49 | 0.47 | 1.00 | 0.64 | 0.57 |
| Qwen3.5-4B | 52.00% | 1.00 | 0.23 | 0.37 | 0.44 | 1.00 | 0.61 | 0.49 |
| Model | Acc. | P(+) | R(+) | F1(+) | P(−) | R(−) | F1(−) | Macro-F1 |
|---|---|---|---|---|---|---|---|---|
| Eps1 | ||||||||
| LLaVA-OneVision-7B | 74.00% | 0.74 | 1.00 | 0.85 | 0.00 | 0.00 | 0.00 | 0.43 |
| Qwen2.5-VL-7B | 26.00% | 0.00 | 0.00 | 0.00 | 0.26 | 1.00 | 0.41 | 0.21 |
| LLaVA-NeXT-video-7B | 86.00% | 0.88 | 0.95 | 0.91 | 0.80 | 0.62 | 0.70 | 0.81 |
| Eps2 | ||||||||
| LLaVA-OneVision-7B | 72.00% | 0.72 | 1.00 | 0.84 | 0.00 | 0.00 | 0.00 | 0.42 |
| Qwen2.5-VL-7B | 28.00% | 0.00 | 0.00 | 0.00 | 0.28 | 1.00 | 0.44 | 0.22 |
| LLaVA-NeXT-video-7B | 58.00% | 1.00 | 0.42 | 0.59 | 0.40 | 1.00 | 0.57 | 0.58 |
| Eps3 | ||||||||
| LLaVA-OneVision-7B | 82.00% | 0.82 | 1.00 | 0.90 | 0.00 | 0.00 | 0.00 | 0.45 |
| Qwen2.5-VL-7B | 18.00% | 0.00 | 0.00 | 0.00 | 0.18 | 1.00 | 0.31 | 0.15 |
| LLaVA-NeXT-video-7B | 54.00% | 0.95 | 0.46 | 0.62 | 0.27 | 0.89 | 0.41 | 0.52 |
| Eps4 | ||||||||
| LLaVA-OneVision-7B | 100.00% | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.50 |
| Qwen2.5-VL-7B | 0.00% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| LLaVA-NeXT-video-7B | 52.00% | 1.00 | 0.52 | 0.68 | 0.00 | 0.00 | 0.00 | 0.34 |
| Eps5 | ||||||||
| LLaVA-OneVision-7B | 86.00% | 0.86 | 1.00 | 0.92 | 0.00 | 0.00 | 0.00 | 0.46 |
| Qwen2.5-VL-7B | 14.00% | 0.00 | 0.00 | 0.00 | 0.14 | 1.00 | 0.25 | 0.12 |
| LLaVA-NeXT-video-7B | 54.00% | 0.81 | 0.60 | 0.69 | 0.06 | 0.14 | 0.08 | 0.39 |
| Eps6 | ||||||||
| LLaVA-OneVision-7B | 74.00% | 0.74 | 1.00 | 0.85 | 0.00 | 0.00 | 0.00 | 0.43 |
| Qwen2.5-VL-7B | 26.00% | 0.00 | 0.00 | 0.00 | 0.26 | 1.00 | 0.41 | 0.21 |
| LLaVA-NeXT-video-7B | 90.00% | 0.94 | 0.92 | 0.93 | 0.79 | 0.85 | 0.81 | 0.87 |
| Eps7 | ||||||||
| LLaVA-OneVision-7B | 62.00% | 0.62 | 1.00 | 0.77 | 0.00 | 0.00 | 0.00 | 0.39 |
| Qwen2.5-VL-7B | 38.00% | 0.00 | 0.00 | 0.00 | 0.38 | 1.00 | 0.55 | 0.28 |
| LLaVA-NeXT-video-7B | 44.00% | 1.00 | 0.10 | 0.18 | 0.40 | 1.00 | 0.58 | 0.38 |
| Episode | Model | Text-Only | Vision-Only | Multimodal | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| F1(+) | F1(−) | Macro-F1 | F1(+) | F1(−) | Macro-F1 | F1(+) | F1(−) | Macro-F1 | ||
| Eps1 | LLaVA-OneVision-7B | 0.79 | 0.65 | 0.72 | 0.85 | 0.00 | 0.43 | 0.97 | 0.92 | 0.95 |
| Qwen2.5-VL-7B | 0.00 | 0.41 | 0.21 | 0.00 | 0.41 | 0.21 | 0.00 | 0.41 | 0.21 | |
| Eps2 | LLaVA-OneVision-7B | 0.67 | 0.61 | 0.64 | 0.84 | 0.00 | 0.42 | 0.91 | 0.82 | 0.87 |
| Qwen2.5-VL-7B | 0.00 | 0.44 | 0.22 | 0.00 | 0.44 | 0.22 | 0.00 | 0.44 | 0.22 | |
| Eps3 | LLaVA-OneVision-7B | 0.63 | 0.45 | 0.54 | 0.90 | 0.00 | 0.45 | 0.81 | 0.58 | 0.70 |
| Qwen2.5-VL-7B | 0.00 | 0.31 | 0.15 | 0.00 | 0.31 | 0.15 | 0.00 | 0.31 | 0.15 | |
| Eps4 | LLaVA-OneVision-7B | 0.53 | 0.00 | 0.26 | 1.00 | 0.00 | 0.50 | 0.78 | 0.00 | 0.39 |
| Qwen2.5-VL-7B | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Eps5 | LLaVA-OneVision-7B | 0.77 | 0.47 | 0.62 | 0.92 | 0.00 | 0.46 | 0.93 | 0.70 | 0.81 |
| Qwen2.5-VL-7B | 0.00 | 0.25 | 0.12 | 0.00 | 0.25 | 0.12 | 0.00 | 0.25 | 0.12 | |
| Eps6 | LLaVA-OneVision-7B | 0.88 | 0.75 | 0.82 | 0.85 | 0.00 | 0.43 | 0.96 | 0.89 | 0.92 |
| Qwen2.5-VL-7B | 0.05 | 0.42 | 0.24 | 0.00 | 0.41 | 0.21 | 0.00 | 0.41 | 0.21 | |
| Eps7 | LLaVA-OneVision-7B | 0.56 | 0.67 | 0.61 | 0.77 | 0.00 | 0.38 | 0.73 | 0.75 | 0.74 |
| Qwen2.5-VL-7B | 0.00 | 0.55 | 0.28 | 0.00 | 0.55 | 0.28 | 0.00 | 0.55 | 0.28 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hanafiah, A.; Monika, W.; Nasution, A.H.; Onan, A.; Murakami, Y.; Nasution, H.O. Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation. Digital 2026, 6, 40. https://doi.org/10.3390/digital6020040
Hanafiah A, Monika W, Nasution AH, Onan A, Murakami Y, Nasution HO. Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation. Digital. 2026; 6(2):40. https://doi.org/10.3390/digital6020040
Chicago/Turabian StyleHanafiah, Anggi, Winda Monika, Arbi Haza Nasution, Aytuğ Onan, Yohei Murakami, and Hafiza Oktasia Nasution. 2026. "Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation" Digital 6, no. 2: 40. https://doi.org/10.3390/digital6020040
APA StyleHanafiah, A., Monika, W., Nasution, A. H., Onan, A., Murakami, Y., & Nasution, H. O. (2026). Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation. Digital, 6(2), 40. https://doi.org/10.3390/digital6020040

