Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation

Hanafiah, Anggi; Monika, Winda; Nasution, Arbi Haza; Onan, Aytuğ; Murakami, Yohei; Nasution, Hafiza Oktasia

doi:10.3390/digital6020040

Open AccessArticle

Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation

by

Anggi Hanafiah

^1,*

,

Winda Monika

²

,

Arbi Haza Nasution

¹

,

Aytuğ Onan

³

,

Yohei Murakami

⁴

and

Hafiza Oktasia Nasution

⁵

¹

Department of Informatics Engineering, Universitas Islam Riau, Pekanbaru 28284, Indonesia

²

Department of Library Information, Universitas Lancang Kuning, Pekanbaru 28266, Indonesia

³

Department of Computer Engineering, Faculty of Engineering, Izmir Institute of Technology, Izmir 35430, Turkey

⁴

Faculty of Information Science and Engineering, Ritsumeikan University, Ibaraki, Osaka 567-8570, Japan

⁵

Department of Management, Universitas Riau, Pekanbaru 28292, Indonesia

^*

Author to whom correspondence should be addressed.

Digital 2026, 6(2), 40; https://doi.org/10.3390/digital6020040

Submission received: 18 February 2026 / Revised: 8 May 2026 / Accepted: 13 May 2026 / Published: 16 May 2026

Download

Browse Figures

Versions Notes

Abstract

Children increasingly consume online video content, creating a growing need for scalable approaches to support content moderation workflows. However, directly identifying harmful or policy-violating content, such as violence, sexual content, or self-harm, remains a complex task that typically requires specialized classifiers and domain-specific annotations. In this context, sentiment analysis can provide complementary information by capturing affective signals expressed through language and visual cues. This study does not treat sentiment polarity as a direct indicator of unsafe or policy-violating content. Instead, it explores multimodal sentiment analysis as an auxiliary triage signal that may help prioritize content for human review or identify segments requiring further inspection. This paper investigates the feasibility of using large vision–language models (LVLMs) for zero-shot multimodal sentiment analysis on utterance-aligned video segments. We evaluate two LVLMs, LLaVA-OneVision-7B and Qwen2.5-VL-7B, under three input settings: text-only, vision-only, and multimodal, using a conversational TV-series dataset consisting of short utterance-level video segments and transcripts. The results show that multimodal sentiment inference can provide useful screening signals without task-specific fine-tuning, although the benefits are model-dependent. LLaVA-OneVision-7B consistently outperforms Qwen2.5-VL-7B and benefits more clearly from combining textual and visual inputs, whereas Qwen2.5-VL-7B shows limited improvement across modality settings. We also analyze the trade-off between frame sampling and image resolution. Finally, we discuss limitations related to dataset scope, annotation subjectivity, class imbalance, and the need for broader validation before real-world deployment.

Keywords:

large vision–language models; zero-shot inference; multimodal sentiment analysis; video moderation triage; online aggression; violence prevention

1. Introduction

Sentiment analysis has become a crucial area in natural language processing (NLP), widely applied in opinion mining, customer feedback analysis, and emotion detection [1]. Traditional sentiment analysis has been performed using textual data [2], but with the increasing prevalence of multimodal content, there is a growing need for models that can process multiple modalities simultaneously [3]. Multimodal Sentiment Analysis (MSA) extends conventional sentiment analysis by incorporating multiple modalities, such as text and frames, to capture richer contextual and emotional information. Multimodal sentiment analysis (MSA) is an advanced branch of sentiment analysis that leverages multiple data modalities-such as text, frames, and videos-to improve sentiment classification and prediction [4]. This is critical because human emotions are often expressed not just through language, but also through visual signals such as facial expressions, gestures, and body language. Several studies have attempted to integrate multiple modalities in multimodal sentiment analysis employing techniques such as the Text-Guided Multimodal Fusion Module (TGMFM) and Dynamic Gating Module (DGM) in the DWGCI model [5,6,7], as well as masking techniques [8], domain adaptation [9], cross-modal attention and difference loss to fuse information from text and images [10]. These efforts also include the use of external knowledge, innovative techniques [11], explainability and visualization tools [12], and contributions to datasets and benchmarks [13]. However, these studies still face significant challenges, such as the risk of overfitting small data sets due to complex model architectures and a heavy reliance on high-quality labeled data for effective domain adaptation.

In addition, they struggle with noisy or conflicting input across modalities and depend on extensive external knowledge, which may not always be accessible or relevant. To address these limitations, the potential of large vision–language models (LVLMs) is systematically explored. LVLMs are trained on large-scale multimodal data, enabling them to learn rich cross-modal representations and perform vision–language tasks within a unified framework. Their end-to-end alignment of text and visual information can improve generalization compared to modular pipelines that rely on separately engineered features.

Multimodal sentiment analysis is conducted using LLaVA-One-Vision-7B and Qwen2.5-VL-7B. LlavA-OneVision-7B is a vision–language model that builds upon CLIP and LLaMA architectures, enabling it to process textual and visual data within a shared embedding space [14]. Meanwhile, Qwen2.5-VL-7B employs cross-modal attention mechanisms, allowing it to dynamically align information from text and frames for more accurate sentiment inference [15,16,17]. Both models are tested with zero-shot learning, where models classify sentiment without fine-tuning.

This work aims to evaluate the effectiveness of LlavA-OneVision-7B and Qwen2.5-VL-7B in classifying sentiment from Indonesian TV-series videos distributed via YouTube, segmented into utterances with aligned transcripts and uniformly sampled frames. This paper focuses on multimodal sentiment as an auxiliary signal for content screening. Specifically, it examines binary sentiment polarity classification (affective valence), aiming to determine whether an utterance conveys a positive or negative orientation, rather than identifying distinct emotion categories. While this study is motivated by the broader context of safer online video consumption, it does not aim to directly detect harmful or policy-violating content. Instead, we position multimodal sentiment polarity as an auxiliary signal that may complement existing moderation systems. Sentiment analysis captures affective cues expressed through language and visual context, which may help identify segments that warrant further inspection.

It is important to note that sentiment polarity is not equivalent to safety classification. Harmful content categories such as violence, self-harm, or sexual content require specialized detection models. Furthermore, this study is conducted on a dataset derived from a single Indonesian TV series and should therefore be interpreted as an exploratory case study in a controlled setting. The goal is not to establish generalizable conclusions about content moderation or multimodal sentiment robustness, but rather to evaluate the feasibility of using LVLMs to extract multimodal sentiment signals in a zero-shot setting. Within this scope, sentiment predictions may serve as a preliminary triage signal that could support, but not replace, broader moderation workflows.

The contributions are as follows: (i) a reproducible pipeline for constructing utterance-aligned multimodal samples from videos uniformly sampled frames and transcripts; (ii) providing a comprehensive multimodal evaluation across text, vision, and combined settings, (iii) a zero-shot evaluation comparing two LVLMs under a fixed prompting protocol for multimodal sentiment classification; and (iv) an aggregation heuristic that summarizes utterance-level sentiment predictions into a video-level score, intended as an auxiliary screening signal for content analysis rather than a direct safety classification mechanism. We focus on binary sentiment polarity classification (affective valence), where the objective is to determine whether an utterance expresses a positive or negative orientation, rather than identifying discrete emotion categories.

2. Related Works

2.1. Video Content Classtification

Recent studies on video content classification have focused on identifying harmful or inappropriate material, particularly in the context of online platforms and child safety. These approaches typically involve supervised learning models trained to detect specific categories such as violence, explicit content, or self-harm. Convolutional neural networks (CNNs), transformer-based architectures, and multimodal fusion techniques have been widely used to classify video content based on visual and textual feature [18,19,20].

While these methods are effective for predefined categories, they often require large annotated datasets and domain-specific labels, which may not generalize well across different contexts. In addition, such systems are typically designed for direct classification of harmful content, rather than providing intermediate signals that can support human moderation workflows. This limitation motivates the exploration of complementary signals, such as sentiment polarity, which may capture affective cues present in multimodal content.

2.2. Text-Based Sentiment Analysis

Traditional text-based sentiment analysis is a crucial component of natural language processing (NLP) that aims to interpret and classify emotions expressed in textual data. This process is essential for applications such as customer feedback analysis, social media monitoring, and public opinion tracking.

Early text sentiment analysis approaches primarily relied on dictionary-based methods, where words were assigned sentiment scores based on predefined lexicons [21,22]. These methods evolved to incorporate contextual relationships between words. For instance, predicted sentiment by using adjectives as indicators of positive or negative polarity [23,24]. Subsequent research refined this approach by introducing a Semantic Orientation Calculator (SO-CAL), which uses lexicons and takes into account reinforcement and disconfirmation factors to improve the accuracy of sentiment classification [25].

With the development of machine learning, sentiment analysis research was first conducted by Pang et al. [26], which focused on unimodal data representation to handle textual sentiment classification by introducing machine learning methods, including Naive Bayes [27], Support Vector Machine [28], and Maximum Entropy Classifier.

Research with a deep learning approach was conducted by Kim et al. [29] using the Convolutional Neural Network method in sentiment classification. Research conducted by Tai et al. [30] considers the structure of complex text features and introduces Tree LSTM for sentence sentiment classification. Research that proposes a hybrid model called Word2vec-BiLSTM-CNN was conducted by Yue et al. [31] utilised the capabilities of convolutional neural networks and Bidirectional Long-Short Term Memory to capture two-way dependencies in text.

Recent advances in large models have improved sentiment analysis performance with word embedding models and pre-trained models. Word embedding techniques such as Word2Vec [32] and GloVe [33] map words into high-dimensional vector spaces based on semantic similarity. BERT’s recognition further advances sentiment classification by utilising self-attention mechanisms to capture long-range dependencies and nuanced contextual meanings [34].

2.3. Visual-Based Sentiment Analysis

Visual-based sentiment analysis leverages machine learning and deep learning techniques to interpret emotions and sentiments from visual data. Visual sentiment analysis has evolved significantly, especially in its ability to infer emotions from images. The initial approach relied on low-level visual features such as colour and texture [35]. The integration of deep learning models, such as Convolutional Neural Networks (CNNs) [36].

The integration of deep learning models such as CNNs [36] for visual-based sentiment analysis has also gained traction, enabling the extraction of more abstract and high-level emotional features. Research by Yang et al. [37] developed a multi-task framework to optimise the recognition of visual emotions by considering several emotions in an image. Transfer learning have further improved visual sentiment analysis by selectively focusing on relevant image region [38]. These models are designed to capture intricate details and contextual information, enabling more accurate sentiment classification [39,40].

2.4. Multimodal Sentiment Analysis and Aggregation Approaches

Multimodal sentiment analysis (MSA) extends traditional text-based sentiment analysis by integrating multiple modalities such as text, images, and video to capture richer affective signals. This integration enables more comprehensive sentiment inference by leveraging complementary and shared information across modalities [41]. Prior research has demonstrated that multimodal fusion techniques—such as cross-modal attention, hierarchical fusion, and feature-level integration—can significantly improve sentiment classification performance compared to unimodal approaches.

Early work in multimodal sentiment analysis primarily focused on feature extraction and fusion strategies. For example, cross-attention-based models integrate textual and visual features to capture interactions between modalities, improving classification accuracy in multimodal datasets [42]. Similarly, hierarchical and dynamic fusion approaches have been proposed to model temporal and contextual dependencies across utterances, enabling better representation of multimodal signals in sequential data [43].

Beyond feature fusion, recent studies have explored aggregation mechanisms that summarize instance-level predictions into higher-level representations. Such approaches are particularly relevant in scenarios involving multi-utterance or long-form content, where sentiment is distributed across segments rather than expressed in a single instance. For example, slice-based aggregation methods group temporally correlated segments and combine them into higher-level representations to improve sentiment consistency over time [43]. These methods highlight the importance of capturing temporal structure and contextual coherence in multimodal sentiment analysis.

In large-scale sentiment analysis frameworks, aggregation is also used to summarize fine-grained predictions into document-level or corpus-level representations. For instance, the SiAraSent framework integrates multiple feature representations and deep learning models to produce robust sentiment predictions at scale, demonstrating how aggregation can support large-scale sentiment modeling pipelines [44]. However, such systems typically rely on extensive training data and are designed for deployment in well-defined domains.

Despite these advances, most existing aggregation approaches are tightly coupled with supervised learning settings and domain-specific datasets. Their applicability to zero-shot multimodal settings, particularly using large vision–language models (LVLMs), remains underexplored. Furthermore, aggregation mechanisms in prior work are often optimized for prediction accuracy rather than interpretability or exploratory analysis.

In contrast, this study adopts a lightweight aggregation strategy that summarizes utterance-level sentiment predictions into a video-level representation without additional training. Unlike prior work that aims to build deployable sentiment systems, the proposed aggregation is intended as an exploratory analytical tool. It does not assume alignment with moderation policies or safety labels, but instead provides a simplified representation of how sentiment signals are distributed across video content. This distinction is important to avoid overinterpretation and to position the contribution within an experimental and analytical context rather than an operational one.

2.5. Large Vision–Language Models

The recent emergence of LLMs has introduced a new paradigm in sentiment analysis. LLMs such as GPT-4 [45], LLaMA [46], Qwen [47], and DeepSeek [48] have demonstrated superior performance in NLP tasks and have been effectively used for sentiment analysis. These models can process vast amounts of unstructured text data, providing insights into customer sentiments and preferences without the need for extensive training or labeled datasets. Recent studies have compared human and LLM-generated annotations across multilingual NLP tasks, revealing that LLMs can produce competitive annotations in simpler tasks like sentiment analysis [49]. Recent studies have shown that LLMs outperform traditional deep learning models in sentiment analysis, particularly in cases where domain adaptation and contextual understanding are crucial. LLMs have been applied across various domains, including tourism [50], finance [51], hospitality [52,53], and Quranic studies [54], demonstrating their versatility and effectiveness in sentiment analysis tasks.

Large Language Models (LLMs) are increasingly incorporated into multimodal frameworks that jointly process text, audio, and visual inputs for sentiment analysis, enabling more comprehensive modeling of human affective expression. Recent studies demonstrate that integrating multiple modalities helps address limitations of text-only approaches, particularly in capturing ambiguity, sarcasm, and contextual nuances, thereby improving sentiment prediction accuracy and robustness [55]. For instance, multimodal LLM-based frameworks that fuse textual, visual, and audio features have shown enhanced performance in extracting nuanced emotional signals across diverse applications [55]. Similarly, plugin-enhanced architectures further improve multimodal understanding by aligning cross-modal representations and incorporating structured knowledge, allowing LLMs to better capture complex relationships between modalities and refine sentiment interpretation at a more granular level [56].

Large Vision–Language Models (LVLMs) are an extension of Large Language Models (LLMs) capable of processing and understanding multimodal data, such as text and images [57,58]. Unlike conventional NLP models that only handle textual input, LVLMs feature architectures that integrate visual understanding with natural language processing. These models rely on techniques such as cross-modal attention and shared embedding spaces to link textual representations with visual features from images or video frames.

One of the key advantages of LVLMs in multimodal sentiment analysis is their ability to capture emotional nuances that cannot be identified solely through text. For example, in a video, the dialogue may appear neutral or even positive, but facial expressions and body language may convey a different emotional tone. By combining text and visual processing, LVLMs can provide a more comprehensive sentiment analysis.

Compared to traditional supervised multimodal models, LVLMs offer improved generalization and reduced dependence on labeled datasets, making them particularly suitable for zero-shot evaluation scenarios. This capability is important in contexts where annotated multimodal data is limited or expensive to obtain. LVLMs, including Qwen2.5-VL-7B and LLaVA-OneVision-7B, are limited to video understanding through visual analysis and lack support for audio processing. This constraint arises from their architecture and the use of vision encoders, which extract information solely from video frames without any mechanisms for interpreting or processing audio content [14,17]. To incorporate audio understanding in video analysis, models with audio capabilities, such as Whisper AI, would need to be integrated.

However, challenges remain in using LVLMs, including high computational demands and limitations in understanding complex visual contexts. Models like LLaVA-OneVision-7B and Qwen2.5-VL-7B still rely on available training data, meaning their performance can vary depending on the type and quality of the data used. Additionally, interpreting sentiment from frames can be ambiguous if there is no strong correlation between textual and visual elements. To address these challenges, further research in fine-tuning techniques and multimodal dataset enhancements is needed to improve model accuracy and robustness across various scenarios.

3. Materials and Methods

Multimodal sentiment polarity classification is achieved by integrating text and image data from YouTube videos using LLaVA-OneVision-7B and Qwen2.5-VL-7B. The methodology consists of data preprocessing, model evaluation under a unified zero-shot prompting protocol, and performance analysis using a confusion matrix.

Multimodal inputs consist of textual transcripts and uniformly sampled frames extracted from video data. Since LVLMs cannot directly process raw video streams, each video is first segmented into utterance-level clips, as illustrated in Figure 1. For each utterance segment, we extract (i) a transcript using Whisper and (ii) a set of uniformly sampled frames representing the visual context.

The extracted transcripts and uniformly sampled frames are then provided to the LVLMs using fixed prompt templates for zero-shot sentiment classification. Specifically, Figure 2 presents the text-only prompt, Figure 3 presents the vision-only prompt, and Figure 4 presents the multimodal prompt that combines transcript and visual-frame inputs. Both models are evaluated under the same inference protocol, where each utterance segment is assigned a binary sentiment polarity label (positive or negative).

Although the overall evaluation setup is shared, minor model-specific preprocessing steps are applied to ensure compatibility with each model’s input requirements. For example, Qwen2.5-VL-7B requires image rescaling to match its expected input format, while LLaVA-OneVision-7B processes images in their original form. These differences do not alter the evaluation protocol but are necessary for proper model execution. The utterance-level predictions are subsequently aggregated into a video-level sentiment score, which is interpreted as an auxiliary screening signal for content analysis. Although multimodal cues may reflect underlying emotional expressions, both the model outputs and evaluation are strictly limited to binary sentiment polarity labels.

3.1. Datasets

The dataset, sourced from the Indonesian TV series “Tetangga Masa Gitu” on YouTube, was segmented into 1452 utterances. The selected videos were chosen based on predefined criteria to ensure the presence of rich textual and visual content suitable for sentiment analysis. Specifically, this study uses seven episodes (Episodes 1–7) of the series. From an initial pool of 1452 utterances, a stratified random sampling strategy was applied by selecting 50 utterances per episode, resulting in a final dataset of 350 utterances that capture diverse conversational contexts and character interactions.

To better understand the characteristics of the dataset, an exploratory data analysis (EDA) was conducted, as summarized in Table 1. The results show that the dataset is imbalanced toward positive sentiment, with positive labels dominating across most episodes. For instance, Episode 4 contains exclusively positive samples (100%), while other episodes exhibit varying proportions, with positive sentiment ranging from 62% to 86%. This imbalance reflects the natural conversational tone of the series, which tends to emphasize neutral-to-positive interactions.

In addition, the dataset consists of short conversational utterances, with an average transcript length of approximately 5.41 words per utterance, indicating concise dialogue typical of informal spoken interactions. From the visual perspective, each utterance corresponds to short video segments with an average duration of approximately 2.09 s, providing limited but meaningful visual cues such as facial expressions and gestures.

This dataset was not specifically curated to represent malicious or policy-violating content. The purpose of using this dataset is not to directly model unsafe content, but to evaluate the ability of LVLM to capture multimodal sentiment signals in realistic conversational contexts. The context provides a variety of emotional expressions and contextual interactions, which are useful for studying sentiment inference from a combination of textual and visual modalities. Speech was transcribed using OpenAI Whisper (base model) with default parameters, and the resulting transcripts were aligned with corresponding video segments using timestamp information. Each utterance was annotated with a binary sentiment polarity label (positive or negative) based on textual and visual cues. This dataset is designed to support multimodal sentiment analysis by combining information from two modalities: text (transcripts) and visual data (sampled frames). The videos vary in duration and are provided at a minimum resolution of 720p to ensure adequate visual quality. They include diverse facial expressions and gestures that can support sentiment interpretation.

3.2. Preprocessing

The preprocessing pipeline was implemented in Python 3.14.5. Audio transcription was performed using OpenAI Whisper (base model) configured for Indonesian, while frame extraction was conducted using PyAV. Text cleaning included the removal of empty segments, non-lexical utterances, incomplete fragments, and very short transcripts that were unlikely to carry meaningful sentiment cues. Video segmentation was based on start–end timestamps produced during transcription. For visual input, uniformly sampled frames were extracted at equal intervals from each utterance segment. The number of frames per segment was varied between 10 and 60 during the frame–resolution trade-off experiment, with the final selected configuration using 30 frames at 240p resolution.

3.2.1. Utterance Segmentation

Each video is segmented into utterance-level clips based on transcript time boundaries produced by Whisper. This segmentation is intended to reduce sentiment mixing within longer segments and improve alignment between textual and visual cues [59]. However, it does not guarantee that each utterance contains a single dominant sentiment, as mixed or ambiguous expressions may still occur in natural dialogue.

3.2.2. Audio Transcription and Text Extraction

Audio was transcribed using OpenAI Whisper configured for Indonesian. We store the transcript together with start–end timestamps to preserve alignment with video frames. The transcript is used as the text modality in multimodal sentiment inference.

3.2.3. Data Cleaning

We remove utterances that are unlikely to carry meaningful sentiment cues. Specifically, we exclude segments with extremely short transcripts, incomplete fragments, or non-lexical content (e.g., noise-only segments). This step reduces input noise and improves the reliability of sentiment labeling.

3.2.4. Frame Extraction

Since LVLMs cannot process raw video directly, we extract uniformly sampled frames for each utterance-level clip. Given a clip interval

[t_{s}, t_{e}]

, we sample N frames uniformly across the interval to preserve temporal coverage while controlling computation. We evaluate multiple values of N (Section 4.1) and adjust image resolution to fit GPU memory constraints.

3.3. Sentiment Annotation

In this study, the task is strictly defined as binary sentiment polarity classification (affective valence), where each utterance is labeled as either positive or negative. Although emotion-related cues such as happiness, anger, or sadness may be considered during the annotation process, these cues are used solely as interpretative guidance to help annotator understand the overall affective tone. Emotion categories are not part of the label space, and the model is not trained or evaluated on emotion classification. In the experimental setup, emotion-related information is optionally incorporated at the prompting level to enrich contextual understanding, but the final prediction remains a binary sentiment polarity label. This distinction is maintained throughout the study to ensure conceptual clarity between sentiment polarity and emotion recognition.

Sentiment labels serve as gold-standard references for evaluating LVLM predictions produced in a zero-shot inference setting. The annotation process in this study was conducted by a single annotator following a consistent labeling guideline, with the primary objective of ensuring internal consistency across the dataset in an exploratory setting. To mitigate subjectivity, the annotation guideline emphasizes observable cues from both textual content and visual context, and utterances with ambiguous or insufficient information were minimized during preprocessing.

It is important to note that the models receive only textual transcripts and visual inputs (uniformly sampled frames), without access to audio signals. During annotation, the annotator may consider multimodal cues such as facial expressions and contextual language, and in some cases vocal characteristics inferred from the video context. This introduces a potential modality mismatch between annotation and model input, which may contribute to evaluation noise, as certain cues available to the annotator are not accessible to the model.

Despite these measures, sentiment labeling remains inherently subjective. Therefore, the resulting annotations should be interpreted as a single-annotator reference rather than a consensus-based ground truth. While this design supports controlled comparison of model behavior under a fixed annotation condition, it does not capture the variability of human interpretation, which should be addressed in future work through multi-annotator protocols and agreement analysis.

3.4. Video-Level Sentiment Aggregation

To extend utterance-level predictions into a higher-level representation, we introduce a simple aggregation mechanism that summarizes sentiment outputs across all utterances within a video. Each utterance is assigned a binary sentiment label (positive or negative), and the video-level sentiment score is computed as the proportion of utterances classified as negative.

Formally, let

U = u_{1}, u_{2}, \dots, u_{n}

denote the set of utterances in a video, and let

y_{i} \in 0, 1

represent the predicted sentiment label for utterance

u_{i}

, where 1 indicates negative sentiment. The aggregated video-level score S is defined as:

S = \frac{1}{n} \sum_{i = 1}^{n} y_{i}

(1)

This formulation provides a continuous measure reflecting the relative prevalence of negative sentiment within a video. The aggregation is computationally simple and enables a compact summary of multimodal sentiment signals derived from utterance-level predictions.

It is important to emphasize that this aggregation mechanism is intended as a conceptual illustration rather than an operational decision-making tool. The resulting score does not directly correspond to content safety categories such as violence, self-harm, or policy violations. Instead, it reflects only the distribution of sentiment polarity as inferred by the model.

In this work, the aggregated score is used to explore how multimodal sentiment patterns may be represented at the video level. No predefined thresholds (e.g., allow, review, restrict) are applied, and the score is not interpreted as an actionable moderation signal. Any practical deployment of such an aggregation in content moderation workflows would require careful calibration against domain-specific ground truth, such as platform policies, human review decisions, or labeled moderation datasets.

The aggregation approach is conceptually related to prior sentiment modeling frameworks that summarize instance-level predictions into higher-level representations. However, in contrast to large-scale sentiment systems designed for deployment, the present work focuses on exploratory analysis in a controlled experimental setting. As such, the aggregated sentiment score should be interpreted as an auxiliary analytical signal that may support further investigation of video content, rather than as a standalone indicator of content safety or policy compliance.

3.5. Models

Two multimodal large vision–language models (LVLMs) to perform sentiment analysis based on information from texts and frames, namely LlavA-OneVision-7B and Qwen2.5-VL-7B. These two models were chosen for their ability to understand the relationship between text and visual modalities and to support zero-shot learning to improve sentiment classification accuracy.

LlavA-OneVision-7B: Multimodal model developed with a combination of CLIP (Contrastive Language-Image Pretraining) and LLaMA [14]. This model is designed to understand the relationship between text and frames by mapping them into a shared embedding space so that it is able to better capture the representation of sentiment from both modalities.
LlavA-OneVision-7B is used to process text extracted from video and visual information from uniformly sampled frames obtained from utterance segmentation. This model was tested using zero-shot learning, where the model was used directly without additional training, to evaluate its ability to recognize sentiment from data that has never been seen before.
Qwen2.5-VL-7B: Vision–language model with more sophisticated cross-modal attention capabilities to connect information from text and frames [17]. This model was developed to handle various natural language processing tasks with stronger visual comprehension.

Qwen2.5-VL-7B was used to process text and image data differently than LlavA-OneVision-7B. One of the main differences is the preprocessing requirement, where this model requires the rescale of image input to fit the accepted format. This model was also tested in zero-shot learning to ensure a fair comparison.

3.6. Experimental Setup

To systematically evaluate the effectiveness of multimodal sentiment analysis using Large Vision–Language Models (LVLMs), a series of experiments were conducted under a zero-shot inference setting. The experimental design aims to analyze how different modalities—text-only, vision-only, and multimodal inputs—affect sentiment classification performance across multiple episodes. Rather than focusing solely on overall accuracy, the experiments are designed to examine model robustness, modality-specific behavior, and class balance under realistic conversational video scenarios. In addition, the study compares several model families, including text-based Large Language Models (LLMs) and multimodal LVLMs, to better understand the contribution of visual information in sentiment inference. Since the proposed framework is positioned as an exploratory multimodal sentiment analysis approach and auxiliary triage-oriented signal rather than a fully supervised moderation system, all evaluations are performed without task-specific fine-tuning using a consistent prompt-based zero-shot protocol.

Computational Resource: This research leverages cloud computing resources powered by an A100 GPU with 40 GB of VRAM and 83 GB of system RAM.
Unified Experimental Design for Multimodal Input Configurations: The experimental design includes three input configurations: (i) text-only, (ii) vision-only using uniformly sampled frames, and (iii) multimodal input combining both text and visual information. All configurations follow the same zero-shot prompting protocol, ensuring consistency across experiments.
Frame-Resolution Trade-off: To investigate the impact of temporal and visual resolution on prediction performance, we designed an experiment that varied the number of video frames (n-frames) used as input while adjusting the image resolution to manage computational complexity. To determine the optimal number of frames set, we conducted a preliminary experiment using 50 randomly sampled videos from each episode to represent the overall data. The number of frames was incrementally set to 10 frames per sample. Due to increasing memory and processing demands with higher frame counts, the image resolution was proportionally reduced from 360p to 240p, and eventually to 144p. The models were evaluated under the same inference protocol and prompting settings across all configurations to ensure a fair comparison. Accuracy was recorded for each configuration to assess the trade-off between temporal information and image quality under resource constraints.
Zero-Shot: The models are used directly without additional training to measure the extent to which the models can recognize sentiment based on built-in pretraining. The model receives input in text and frames and then produces a prediction of positive or negative sentiment. Since a zero-shot evaluation setup is employed without any model training or fine-tuning, the entire dataset is used for evaluation rather than being divided into training and test subsets. We conducted two experiments to evaluate the performance of LlavA-OneVision-7B and Qwen2.5-VL-7B in multimodal sentiment analysis. These experiments follow a zero-shot inference setup, meaning the models are tested without any task-specific fine-tuning. Experiments 1 and 2 assess each model’s ability to classify sentiment using only their pre-trained knowledge. Experiment 1 tests LlavA-OneVision-7B with raw text and image inputs extracted from video, evaluating how well the model leverages its pretrained multimodal understanding to infer sentiment. Experiment 2 applies the same setup to Qwen2.5-VL-7B, analyzing its ability to process rescaled video frames and accompanying text through its cross-modal attention framework. These experiments provide insight into the initial performance of each model when applied to previously unseen multimodal data.
Unimodal Baselines: We include several additional baseline configurations to enable a comprehensive evaluation. First, for the text-only baselines, we evaluate multiple large language models, including Qwen3.5-4B and Phi-3.5-mini, using only transcribed text inputs. This setting is designed to assess the contribution of linguistic information independently of any visual cues. Second, for the video-only baselines, we evaluate LLaVA-NeXT-Video-7B as a dedicated visual model that operates solely on sampled video frames. This configuration allows us to examine the extent to which sentiment can be inferred from visual information alone. Finally, we consider unimodal variants of LVLMs, where the models are evaluated in both text-only and video-only modes. This analysis aims to provide further insight into modality-specific behavior within the same model architecture.

3.7. Evaluation Metrics

The Confusion Matrix was used to evaluate model performance. The confusion matrix is an evaluation method commonly used in classification tasks to measure model performance by comparing model predictions against actual labels [60]. The confusion matrix displays the number of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), which are then used to calculate other evaluation metrics, such as accuracy, precision, recall, and F1-score. The confusion matrix provides a detailed overview of how the model classifies sentiment in the dataset, enabling a more in-depth analysis of classification errors. If the model tends to produce a high number of False Positives (FP), it means that the model too often classifies negative sentiment as positive, which can lead to bias in sentiment interpretation. Conversely, a high number of False Negatives (FN) indicates that the model fails to recognize many true negative sentiments, which can lead to underestimation in the detection of more critical sentiments.

From the confusion matrix, we calculate accuracy as the ratio of the number of correct predictions to the total number of test samples. In addition, we use precision, which measures the proportion of correct positive predictions compared to all positive predictions, as well as recall, which measures the extent to which the model can capture all samples in a particular sentiment class. To balance precision and recall, we also calculate the F1-score, which is the harmonic mean of these two metrics.

3.8. Reproducibility and Resources

All computational resources and experimental configurations are documented to ensure transparency and reproducibility. All inference runs were conducted in Google Colab Pro on a single NVIDIA A100 GPU (40 GB VRAM) with 83 GB system RAM, using CUDA 12.5. We evaluated two large vision–language models, LLaVA-OneVision-7B and Qwen2.5-VL-7B, in a zero-shot setting without any fine-tuning or task-specific customization, consistent with the study’s evaluation protocol. All models were loaded and executed via the Hugging Face Transformers library. Because LVLMs do not directly ingest raw video streams, each video was preprocessed via utterance-level segmentation, audio transcription using Whisper, text cleaning, and uniformly sampled frames extraction to form the visual representation. For visual input, we extracted 30 frames at 240p per segment using uniform sampling (frames evenly spaced across the segment duration), following the configuration reported to balance temporal coverage and visual quality under computational constraints. Each model received paired transcript text and uniformly sampled frames and was prompted to output a binary sentiment label—“Positive” or “Negative”—for evaluation. We did not apply additional decoding-parameter tuning (e.g., temperature or top-p), so the reported results reflect the default inference behavior of our Transformers-based implementation. All scripts for video data preprocessing, utterance segmentation, audio transcription, uniformly sampled frame extraction, sentiment annotation, zero-shot inference, and evaluation are publicly available in an online repository (https://github.com/anggihanafiah-ops/Multimodal-Sentiment-Analysis, accessed on 8 May 2026).

4. Results

4.1. Exploring the Optimal n-Frames

The experimental results, illustrated in Figure 5, highlight a clear trade-off between prediction accuracy and image resolution as the number of frames increases. As the number of frames rises, computational complexity also increases, necessitating lower image resolutions to maintain efficiency.

The blue line in the Figure 5 represents prediction accuracy, which peaks at 68% when using 30 frames at 240p resolution. This configuration achieves the optimal balance, as it provides sufficient temporal information while maintaining acceptable visual quality under computational constraints. Increasing the frame count beyond 30 offers more temporal context but requires reducing the resolution (e.g., 144p at 60 frames), which leads to a decline in accuracy due to the loss of critical visual details. Conversely, using fewer frames (e.g., 10 or 20) retains higher resolution but fails to provide enough temporal information, resulting in suboptimal performance.

The red line represents the resolution used at each frame count, which progressively decreases to manage computational load: 360p at 10 frames, 240p from 20 to 50 frames, and 144p at 60 frames. This adjustment reflects the necessary trade-off between maintaining resolution and increasing temporal granularity.

Thus, the optimal configuration is 30 frames at 240p resolution, where model performance is maximized without overly sacrificing visual quality. These findings emphasize the importance of balancing temporal and spatial information to achieve the best accuracy while maintaining computational efficiency.

4.2. Comparative Performance of the Models

This section presents a comprehensive comparison of all evaluated models across text-only, video-only, and multimodal configurations, based on the results shown in Table 2 and Table 3. The analysis highlights both model-specific behavior and modality contributions to sentiment classification performance.

4.2.1. Text-Only Performance (LLMs vs. LVLMs)

As shown in Table 2 large language models (LLMs) such as Qwen3.5-4B and Phi-3.5-mini demonstrate competitive performance across multiple episodes, confirming that textual information is a strong baseline for sentiment analysis. For example, Qwen3.5-4B achieves the highest accuracy in Episode 6 (86%), outperforming several LVLM configurations. However, despite strong accuracy, text-only models often exhibit imbalanced behavior, with precision and recall skewed toward a single class. This is reflected in cases where precision or recall reaches 1.00 for one class while remaining low for the other, indicating limited robustness in handling diverse sentiment expressions. LVLMs operating in text-only mode (e.g., LLaVA-OneVision-7B) achieve slightly lower peak accuracy compared to the best-performing LLMs but provide more balanced performance across classes, suggesting better generalization under zero-shot conditions.

4.2.2. Video-Only Performance (Visual Capability of LVLMs)

In the video-only setting, LVLM-based models show distinct behavior. As shown in Table 3 LLaVA-OneVision-7B achieves high recall for the positive class across all episodes (often reaching 1.00), resulting in relatively high accuracy (up to 100% in Episode 4). However, this performance is misleading, as the model consistently fails to capture negative sentiment, with F1(−) equal to 0.00 in most cases. This indicates a strong bias toward the dominant class. In contrast, LLaVA-NeXT-video-7B demonstrates significantly improved performance in the video-only setting. It achieves the highest accuracy in several episodes (e.g., 86% in Episode 1 and 90% in Episode 6) and, more importantly, maintains non-zero F1 scores for both positive and negative classes. This results in substantially higher Macro-F1 scores (e.g., 0.81 in Episode 1 and 0.87 in Episode 6), indicating more balanced and reliable predictions. Qwen2.5-VL-7B, on the other hand, performs poorly across all video-only experiments, often producing trivial predictions with near-zero F1 scores for one class. This highlights significant differences in visual understanding capabilities across LVLM architectures.

4.3. Ablation Study on Input Modalities

Table 4 presents the ablation study comparing Text-Only, Vision-Only, and Multimodal configurations across all seven episodes. The evaluation focuses on class-wise F1-score and Macro-F1 to better assess the balance between positive and negative sentiment recognition. The results demonstrate that the inclusion of both textual and visual modalities consistently improves sentiment classification performance compared to using a single modality alone.

Overall, the Multimodal configuration achieved the highest performance in most episodes. As shown in Table 4, LLaVA-OneVision-7B obtained the best Macro-F1 scores under the Multimodal setting in Eps1 (0.95), Eps2 (0.87), Eps3 (0.70), Eps5 (0.81), Eps6 (0.92), and Eps7 (0.74). These results indicate that combining transcript information with visual context provides richer semantic and emotional cues for sentiment inference. The multimodal setting also consistently produced higher F1-scores for both positive and negative classes, suggesting improved balance in classification performance.

The Text-Only configuration showed relatively stable and balanced performance across episodes. For example, LLaVA-OneVision-7B achieved Macro-F1 scores of 0.72 in Eps1, 0.64 in Eps2, and 0.82 in Eps6. Compared to Vision-Only, the Text-Only setting produced more consistent negative sentiment detection, as reflected by higher F1(−) values across most episodes. This suggests that textual dialogue contains clearer sentiment indicators and contextual semantics that are important for recognizing conversational polarity.

In contrast, the Vision-Only configuration exhibited inconsistent performance despite occasionally achieving strong positive sentiment recognition. For instance, Vision-Only reached a perfect F1(+) score of 1.00 in Eps4 and high F1(+) values in Eps3 (0.90) and Eps5 (0.92). However, the negative sentiment performance frequently collapsed, with F1(−) values dropping to 0.00 in several episodes. Consequently, the Macro-F1 scores remained substantially lower than the Multimodal configuration, such as 0.43 in Eps1 and 0.42 in Eps2. These findings indicate that visual information alone is insufficient for robust sentiment understanding in conversational video content, particularly when negative emotions are subtle, implicit, or highly context-dependent.

Another important observation is the significant performance gap between LLaVA-OneVision-7B and Qwen2.5-VL-7B across all modality settings. Qwen2.5-VL-7B consistently produced very low F1(+) scores, frequently reaching 0.00 across Text-Only, Vision-Only, and Multimodal experiments. Although the model often achieved relatively higher F1(−) values, this behavior suggests a strong bias toward predicting the negative class. As a result, its Macro-F1 scores remained substantially lower than those of LLaVA-OneVision-7B in all episodes.

The ablation study highlights the complementary relationship between textual and visual information in multimodal sentiment analysis. Textual transcripts contribute contextual semantics and explicit emotional expressions, while visual frames provide facial expressions, gestures, and scene-level cues that strengthen sentiment interpretation. The integration of both modalities enables the model to capture more comprehensive emotional signals, resulting in improved classification robustness and better class balance.

These findings further suggest that multimodal LVLMs have strong potential as an initial screening signal for video content understanding. Rather than relying solely on explicit textual sentiment or isolated visual patterns, multimodal integration enables more context-aware interpretation of conversational dynamics, which is essential for scalable content filtering and automated video moderation systems.

4.4. Error Analysis

The error analysis reveals several important behavioral patterns and limitations across text-only, vision-only, and multimodal configurations. Table 4 summarizes Macro-F1 performance across modalities and episodes, enabling a detailed examination of model robustness, modality dependence, and class imbalance effects.

One of the most dominant error patterns is the tendency of several models to collapse toward predicting a single class. This issue is particularly severe in Qwen2.5-VL-7B, which consistently produces extremely low or near-zero F1(+) scores across all modalities and episodes. For example, in Episodes 1, 2, 3, 4, 5, and 7, the model achieves F1(+) = 0.00 in both text-only and multimodal settings, indicating a complete failure to recognize positive sentiment samples. Although the model occasionally achieves moderate F1(−) values (e.g., 0.55 in Episode 7), the resulting Macro-F1 scores remain very low, ranging only from 0.00 to 0.28 across most episodes.

This behavior strongly suggests prediction collapse caused by poor calibration under zero-shot inference. Instead of learning balanced decision boundaries, the model appears to overfit to the dominant class distribution at inference time. The issue is further amplified in imbalanced episodes, where the model can achieve non-trivial accuracy despite failing to classify one class entirely.

The vision-only configuration exposes another important limitation. LLaVA-OneVision-7B achieves extremely high F1(+) scores in nearly all episodes, reaching 1.00 in Episode 4 and above 0.85 in Episodes 1, 2, 3, 5, and 6. However, these results are accompanied by F1(−) = 0.00 across almost all episodes, resulting in relatively low Macro-F1 values despite strong positive-class performance.

This indicates that the visual-only model heavily favors positive sentiment predictions and struggles to identify negative emotional cues from sampled frames alone. The problem likely arises because facial expressions and gestures associated with negative sentiment are either subtle, visually ambiguous, or insufficiently represented in uniformly sampled frames.

The issue is particularly evident in Episode 4, where the model achieves perfect F1(+) = 1.00 but completely fails on the negative class. This demonstrates that strong performance on a single class can produce misleadingly high accuracy or F1 values when evaluated without balanced metrics such as Macro-F1.

Compared to the vision-only setting, text-only models generally produce more balanced predictions. LLaVA-OneVision-7B achieves moderate Macro-F1 values ranging from 0.26 to 0.82 across episodes, indicating that textual information provides relatively stable sentiment cues.

However, several text-only errors remain associated with implicit sentiment, sarcasm, contextual ambiguity, and short conversational utterances. In these situations, textual information alone is insufficient to fully capture emotional meaning. For example, in Episodes 3 and 7, text-only Macro-F1 decreases to 0.54 and 0.61 respectively, suggesting that sentiment interpretation depends partially on non-verbal context.

Similarly, Qwen2.5-VL-7B performs poorly in text-only mode, again exhibiting severe class imbalance behavior. These findings suggest that zero-shot sentiment classification using textual information alone remains highly dependent on model calibration and contextual understanding.

The multimodal configuration consistently provides the strongest and most balanced performance for LLaVA-OneVision-7B. Across all episodes, multimodal Macro-F1 scores outperform both text-only and vision-only settings, reaching 0.95 in Episode 1, 0.87 in Episode 2, 0.81 in Episode 5, and 0.92 in Episode 6.

Importantly, multimodal integration substantially improves negative sentiment recognition compared to the vision-only configuration. For example, in Episode 1, F1(−) improves from 0.00 in the vision-only setting to 0.92 in the multimodal setting. Similar improvements are observed in Episodes 2, 5, 6, and 7. This indicates that textual information helps compensate for missing or ambiguous visual cues, while visual information enriches contextual interpretation when text alone is insufficient.

The multimodal configuration also reduces extreme prediction bias by producing more balanced F1 scores between positive and negative classes. Unlike the unimodal settings, multimodal predictions are less likely to collapse toward a single class, demonstrating improved robustness and cross-modal complementarity.

Despite these improvements, multimodal models still exhibit several limitations. In Episode 4, multimodal performance remains relatively weak, with Macro-F1 reaching only 0.39 and F1(−) = 0.00. This indicates that multimodal integration cannot fully compensate for severe imbalance, ambiguous emotional context, or weak alignment between textual and visual signals.

Some errors also occur when the modalities provide conflicting information. For instance, dialogue may appear neutral or positive while facial expressions imply frustration, sarcasm, or discomfort. In such cases, the model may prioritize one modality over the other, resulting in incorrect predictions.

Another important limitation is the reliance on uniformly sampled frames. Since sampled frames do not necessarily capture the most emotionally informative moments, important visual sentiment cues may be omitted during extraction. This limitation may partially explain performance variability across episodes.

Overall, LLaVA-OneVision-7B demonstrates substantially stronger robustness than Qwen2.5-VL-7B across all modalities. While Qwen2.5-VL-7B repeatedly collapses toward trivial predictions, LLaVA-OneVision-7B maintains relatively stable Macro-F1 performance and benefits significantly from multimodal integration.

These differences suggest that multimodal sentiment classification performance depends not only on modality combination, but also on the effectiveness of multimodal alignment mechanisms within the LVLM architecture itself.

5. Discussion

It is important to note that sentiment-based screening has inherent limitations when applied to content moderation. Sentiment polarity does not directly indicate the presence of harmful or policy-violating content, and relying solely on sentiment may lead to misinterpretation of context. Therefore, the proposed approach should be interpreted as a supportive signal within a broader moderation pipeline, rather than a standalone decision-making system.

The results indicate that zero-shot LVLMs can provide a useful multimodal screening signal by capturing sentiment cues from both transcripts and uniformly sampled frames. However, the observed performance should not be interpreted as sufficient for fully automated moderation decisions. Instead, these models are best positioned as triage components that can prioritize videos for human review or route borderline cases to downstream, task-specific detectors. In operational settings, the final moderation decision should incorporate platform policy rules, risk thresholds, and additional safety classifiers beyond sentiment.

Across our experiments, the results demonstrate that LLaVA-OneVision-7B consistently outperforms Qwen2.5-VL-7B in terms of accuracy and Macro-F1 across all episodes, indicating its stronger capability in capturing multimodal sentiment patterns. However, a deeper analysis reveals that high accuracy alone does not necessarily reflect robust sentiment understanding, especially under imbalanced class distributions. In the multimodal setting, LLaVA-OneVision-7B achieves relatively strong performance for both positive and negative classes in most episodes, although its performance still varies depending on the distribution of sentiment labels and the ambiguity of the utterance-level cues. Conversely, Qwen2.5-VL-7B shows a stronger single-class prediction tendency, frequently identifying samples as negative while failing to recognize positive utterances. These findings highlight that model behavior is strongly influenced by class distribution and model-specific prediction tendencies.

The study also confirms that class imbalance plays a critical role in shaping model behavior. Episodes dominated by positive samples may lead to inflated accuracy scores, while minority-class detection remains a challenging aspect of evaluation. Furthermore, the results suggest that multimodal integration is model-dependent: LLaVA-OneVision-7B benefits from combining textual and visual cues, whereas Qwen2.5-VL-7B does not show comparable gains under the current zero-shot prompting setup. Overall, this work emphasizes the importance of class-aware evaluation and provides empirical evidence that multimodal sentiment models still require methodological improvements to achieve more balanced and reliable sentiment classification.

Finally, our frame–resolution analysis highlights a practical trade-off between temporal coverage and visual quality. Increasing the number of frames provides more temporal evidence but requires lower resolution under fixed computational budgets, which can reduce the clarity of fine-grained facial or scene cues. These findings suggest that efficient deployment should jointly consider sampling strategy, resolution, and inference cost rather than maximizing one factor alone.

Future work may explore extending this approach to multi-class emotion recognition; however, such extensions are beyond the scope, which focuses exclusively on sentiment polarity.

5.1. Implications

A key implication is that utterance-level multimodal sentiment predictions can be aggregated into a video-level score that supports moderation workflows. Rather than replacing existing moderation systems, LVLM-based sentiment screening may serve as a supportive signal for content triage, for example by helping prioritize videos for human review, flag uncertain segments for further inspection, or assist in routing borderline cases to specialized safety models. The thresholds presented are illustrative and should be calibrated to platform safety policies, age guidelines, and operational risk tolerance before deployment.

The findings of this study indicate that Large Vision-Language Models (LVLMs) hold significant potential as initial filtering or screening signals in video content analysis pipelines. In the multimodal setting, models such as LLaVA-OneVision-7B demonstrate strong performance in identifying dominant sentiment patterns, achieving high accuracy and near-perfect recall for the majority class. This capability suggests that LVLMs can be effectively utilized in early-stage filtering systems, where the primary objective is to rapidly process large volumes of video data and identify content that requires further analysis. In such scenarios, LVLMs function not as final decision-makers, but as high-level signal generators that provide preliminary insights into content characteristics.

A key advantage observed in this study is the ability of LVLMs to exhibit high sensitivity toward specific sentiment classes. LLaVA-OneVision-7B demonstrates consistently high recall for positive content, Qwen2.5-VL-7B shows strong recall for negative content. From a filtering perspective, this behavior can be leveraged to design systems that prioritize detection of target content categories (e.g., harmful, inappropriate, or emotionally intense content), reduce the search space for downstream processing, and enable efficient large-scale content monitoring. Thus, LVLMs can act as high-recall screening mechanisms, where missing relevant content is minimized, even if some misclassification occurs.

Our findings suggest that zero-shot LVLMs can serve as an auxiliary multimodal signal in settings where labeled training data are limited. However, applying sentiment-based screening to real-world child safety decisions requires additional validation, stronger taxonomies beyond sentiment polarity, and careful evaluation across diverse datasets and platforms.

5.2. Limitations

Although the present research has yielded promising results, several limitations should be noted:

Limitations of the Dataset: The dataset is limited in scope, as it is derived from utterance-level segments of a single Indonesian TV series. While this setting provides realistic conversational data, it may not capture the diversity of sentiment expressions across domains, languages, and content types. Therefore, the findings should be interpreted as an exploratory case study rather than a generalizable benchmark.
Dependence on a single annotator: A key limitation of this study lies in the use of a single annotator without inter-annotator agreement or adjudication procedures. While this approach ensures internal consistency in labeling, it does not capture the variability of human interpretation, which is particularly important for subjective tasks such as sentiment analysis. As a result, the annotations should be interpreted as a single-annotator reference rather than a fully validated ground truth. This may introduce bias and affect the reliability of the evaluation. Future work should incorporate multiple annotators, measure inter-annotator agreement, and include adjudication protocols to improve annotation reliability and better capture the subjective nature of sentiment interpretation.
Limitations of zero-shot inference: Although LVLMs provide promising results without task-specific fine-tuning, they may struggle with subtle or ambiguous sentiment cues, particularly when textual and visual signals conflict. This indicates that additional validation, and potentially domain-specific adaptation, may be required for robust real-world use.
Computational cost of refining large-scale models: The analysis focuses on YouTube video data, which may not fully represent the diverse multimodal sentiment expressions across platforms. Generalisation of these findings can be improved by including a broader dataset covering various social media sources and content types. The computational cost of refining large-scale models remains a challenge, requiring large GPU resources that can limit accessibility for smaller research teams or organisations.
Mismatch Between Annotation Cues and Model Inputs: Another limitation of this study lies in the mismatch between annotation and model inputs. While annotator may consider vocal cues such as intonation during labeling, the models operate only on textual and visual inputs. This discrepancy may introduce noise in the evaluation, as certain sentiment cues present during annotation are not accessible to the model. Future work may address this limitation by incorporating audio-aware multimodal models or restricting annotation criteria to match model inputs more closely.
Absence of statistical significance testing and repeated runs: This study reports descriptive performance metrics for zero-shot inference, including accuracy, Macro-F1, precision, recall, and class-wise F1-score. However, we did not conduct repeated stochastic inference runs or formal statistical significance testing. Therefore, observed differences between input configurations and models should be interpreted cautiously, particularly when performance gaps are small. Future work should include repeated evaluations, confidence intervals, and paired statistical tests, such as bootstrap resampling or approximate randomization tests, to provide stronger evidence for performance differences.

5.3. Future Research

Future work includes expanding evaluation to diverse datasets (different genres, languages, and platforms), adding supervised baselines and task-specific classifiers, performing prompt and threshold sensitivity analyses, and investigating cost-effective inference strategies for large-scale moderation pipelines.

Future research directions include:

Fine-Tuning with Domain-Specific Datasets and Stronger Evaluation Protocols: Although the evaluated models show promising zero-shot capabilities, fine-tuning with domain-specific datasets—particularly those containing child-friendly and inappropriate content samples—could significantly enhance their classification accuracy. Creating annotated multimodal datasets tailored for sentiment analysis in child safety applications would help improve model robustness and mitigate biases. Future work should also include stronger supervised and fine-tuned baselines, repeated evaluations, confidence intervals, and formal statistical significance testing, such as bootstrap resampling or paired significance tests, to provide more reliable evidence of performance differences across models and input modalities.
Integration of Additional Modalities: The current study primarily focuses on text and visual sentiment analysis. Future research should explore the incorporation of audio-based sentiment analysis, as tone of voice, background sounds, and speech inflections can provide crucial emotional cues. Combining text, frames, and audio may lead to a more comprehensive understanding of sentiment in video content.
Real-Time Processing and Efficiency Optimization: Given the computational demands of large-scale multimodal models, future studies should explore optimization techniques to enable real-time sentiment analysis. Efficient inference strategies, model distillation, and low-rank adaptation (LoRA) could be employed to reduce latency and improve deployment feasibility on resource-constrained devices.
Application in Broader Content Moderation Use Cases: While the focuses on child safety, the insights gained from LVLM-based sentiment analysis can be extended to other content moderation applications, such as hate speech detection, misinformation filtering, and toxicity analysis. Future work could explore how these models perform in diverse content regulation scenarios, further refining their effectiveness across multiple domains.
Incorporating Annotator Consensus and Label Reliability: A key direction for future research is to explicitly consider annotator agreement and consensus in dataset construction and evaluation. A key direction for future research is to explicitly consider annotator agreement and consensus in dataset construction and evaluation. Future work should focus on several key improvements. It is important to measure inter-annotator agreement using metrics such as Cohen’s Kappa or Fleiss’ Kappa. In addition, consensus-based labeling strategies should be introduced, where ambiguous samples are either re-evaluated. Furthermore, model performance should be analyzed with respect to high-agreement versus low-agreement samples.
This approach would help distinguish whether classification errors arise from model limitations or from inherent ambiguity in the data.

Given the moderate accuracy observed in a zero-shot setting and the limited dataset scope, readiness for fully automated enforcement is not claimed. Any real-world use should treat LVLM outputs as supportive signals and validate performance under platform-specific safety requirements.

6. Conclusions

This study examined the feasibility of using large vision-language models (LVLMs), specifically LLaVA-OneVision-7B and Qwen2.5-VL-7B, for zero-shot multimodal sentiment analysis on utterance-aligned conversational video segments. The evaluation considered text-only, vision-only, and multimodal settings to assess the contribution of each input modality.

The findings show that multimodal integration can improve sentiment classification performance, but the benefit is model-dependent. LLaVA-OneVision-7B generally benefits from combining textual transcripts and visual frames, achieving stronger overall and class-balanced performance than its unimodal configurations in most episodes. In contrast, Qwen2.5-VL-7B shows limited sensitivity to input modality changes and tends to produce less balanced predictions under the current zero-shot prompting setup.

These results indicate that LVLMs can extract useful affective signals from conversational video data without task-specific fine-tuning. However, the analysis also shows that accuracy alone is insufficient for evaluating performance in imbalanced sentiment datasets. Class-wise metrics and Macro-F1 remain necessary to identify model-specific prediction tendencies and to avoid overestimating performance based on dominant sentiment classes.

From a practical perspective, this study does not position sentiment polarity as a direct indicator of harmful or policy-violating content. Instead, LVLM-based multimodal sentiment analysis is better understood as an auxiliary triage signal that may help prioritize videos or utterance-level segments for further review. Such systems should therefore be integrated into broader moderation pipelines involving calibrated thresholds, specialized safety classifiers, and human validation.

Given the limited dataset and exploratory nature of the study, the findings should not be generalized to broader domains without further validation. Future work should include larger and more diverse datasets, multiple annotators with agreement analysis, and extended multimodal inputs such as audio. Additionally, improvements in prompt design, model calibration, and visual sampling strategies may further enhance performance. Overall, this work provides an initial step toward understanding how LVLMs can be used for multimodal sentiment analysis in realistic video settings, while highlighting important directions for future research.

Author Contributions

Conceptualization, A.H., A.H.N., W.M. and H.O.N.; methodology, A.H., A.H.N. and W.M.; software, A.H. and A.H.N.; validation, A.H., A.H.N., W.M., H.O.N. and A.O.; formal analysis, A.H., A.H.N., W.M. and H.O.N.; investigation, A.H. and A.H.N.; resources, A.H., A.H.N. and Y.M.; data curation, A.H. and H.O.N.; writing—original draft preparation, A.H., A.H.N. and W.M.; writing—review and editing, A.H., A.H.N., W.M. and H.O.N.; visualization, A.H. and A.H.N.; supervision, A.O. and Y.M.; funding acquisition, A.H. and A.H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Directorate of Research and Community Service Universitas Islam Riau with contract number 928/KONTRAK/P-K-U/DPPM-UIR/10-2024.

Institutional Review Board Statement

Not appliable.

Informed Consent Statement

Not appliable.

Data Availability Statement

The data presented in this study are openly available in https://github.com/anggihanafiah-ops/Multimodal-Sentiment-Analysis, accessed on 2 May 2026.

Acknowledgments

During the preparation of this work, the author(s) used OpenAI ChatGPT 5.5 for language editing and refinement. After using this tool, the author(s) reviewed, revised, and edited the content as needed, and take(s) full responsibility for the content of the published article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
Zhang, H. A comprehensive survey on multimodal sentiment analysis: Techniques, models, and applications. Adv. Eng. Innov. 2024, 12, 47–52. [Google Scholar] [CrossRef]
Wu, S.; Wang, X.; Wang, L.; He, D.; Dang, J. Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content. arXiv 2024, arXiv:2412.10460. [Google Scholar] [CrossRef]
Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
Wang, N.; Wang, Q. Dynamic Weighted Gating for Enhanced Cross-Modal Interaction in Multimodal Sentiment Analysis. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 21, 1–19. [Google Scholar] [CrossRef]
Xie, Y.; Zhu, Z.; Lu, X.; Huang, Z.; Xiong, H. InfoEnh: Towards Multimodal Sentiment Analysis via Information Bottleneck Filter and Optimal Transport Alignment. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9073–9083. [Google Scholar]
Deng, Y.; Li, Y.; Xian, S.; Li, L.; Qiu, H. MuAL: Enhancing multimodal sentiment analysis with cross-modal attention and difference loss. Int. J. Multimed. Inf. Retr. 2024, 13, 31. [Google Scholar] [CrossRef]
Wang, W.; Ding, L.; Shen, L.; Luo, Y.; Hu, H.; Tao, D. Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge. In Proceedings of the 32nd ACM International Conference on Multimedia; ACM: New York, NY, USA, 2024; pp. 2282–2291. [Google Scholar]
Zhao, S.; Jiang, J.; Tang, W.; Zhu, J.; Chen, H.; Xu, P.; Schuller, B.W.; Tao, J.; Yao, H.; Ding, G. Multi-source multi-modal domain adaptation. Inf. Fusion 2025, 117, 102862. [Google Scholar] [CrossRef]
Shi, Y.; Cai, J.; Liao, L. Multi-task learning and mutual information maximization with crossmodal transformer for multimodal sentiment analysis. J. Intell. Inf. Syst. 2024, 63, 1–19. [Google Scholar] [CrossRef]
Chen, J.; Yu, K.; Wang, F.; Zhou, Z.; Bi, Y.; Zhuang, S.; Zhang, D. Temporal convolutional network-enhanced real-time implicit emotion recognition with an innovative wearable fNIRS-EEG dual-modal system. Electronics 2024, 13, 1310. [Google Scholar] [CrossRef]
Al-Tameemi, I.K.S.; Feizi-Derakhshi, M.R.; Pashazadeh, S.; Asadpour, M. Interpretable multimodal sentiment classification using deep multi-view attentive network of image and text data. IEEE Access 2023, 11, 91060–91081. [Google Scholar] [CrossRef]
Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Li, B.; Zhang, Y.; Guo, D.; Zhang, R.; Li, F.; Zhang, H.; Zhang, K.; Zhang, P.; Li, Y.; Liu, Z.; et al. Llava-onevision: Easy visual task transfer. arXiv 2024, arXiv:2408.03326. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Tanantong, T.; Yongwattana, P. A convolutional neural network framework for classifying inappropriate online video contents. IAES Int. J. Artif. Intell. 2023, 12, 124–136. [Google Scholar] [CrossRef]
Balat, M.; Gabr, M.; Bakr, H.; Zaky, A.B. TikGuard: A Deep Learning Transformer-Based Solution for Detecting Unsuitable TikTok Content for Kids. In 2024 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES); IEEE: Piscataway, NJ, USA, 2024; pp. 337–340. [Google Scholar] [CrossRef]
Zhao, C.; Yang, L.; Kuang, J.; Yan, Z. Protecting Children from Violent Short Videos: A Child-Attentive Multimodal Multitask Learning Approach. In Pacific Asia Conference on Information Systems; AIS Electronic Library: Atlanta, GA, USA, 2025; Available online: https://aisel.aisnet.org/pacis2025/aiandml/aiandml/19/ (accessed on 1 May 2026).
Xu, Y. Research for the methods of text sentiment analysis. IET Conf. Proc. 2025, 2024, 185–191. [Google Scholar] [CrossRef]
Jiao, S. Research on text sentiment analysis in natural language processing. In Proceedings of the International Conference on Electrical Engineering and Intelligent Control (EEIC 2024); IET: Stevenage, UK, 2024; Volume 2024, pp. 161–167. [Google Scholar]
Raza, A.A.; Habib, A.; Ashraf, J.; Javed, M. Semantic orientation based decision making framework for big data analysis of sporadic news events. J. Grid Comput. 2019, 17, 367–383. [Google Scholar] [CrossRef]
Wilson, T.; Wiebe, J.; Hoffmann, P. Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis. Comput. Linguist. 2009, 35, 399–433. [Google Scholar] [CrossRef]
Taboada, M.; Brooke, J.; Tofiloski, M.; Voll, K.; Stede, M. Lexicon-based methods for sentiment analysis. Comput. Linguist. 2011, 37, 267–307. [Google Scholar] [CrossRef]
Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. arXiv 2002, arXiv:cs/0205070. [Google Scholar]
Song, J.; Kim, K.T.; Lee, B.; Kim, S.; Youn, H.Y. A novel classification approach based on Naïve Bayes for Twitter sentiment analysis. KSII Trans. Internet Inf. Syst. TIIS 2017, 11, 2996–3011. [Google Scholar]
Naz, S.; Sharan, A.; Malik, N. Sentiment classification on twitter data using support vector machine. In Proceedings of the 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI); IEEE: Piscataway, NJ, USA, 2018; pp. 676–679. [Google Scholar]
Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar] [CrossRef]
Tai, K.S.; Socher, R.; Manning, C.D. Improved semantic representations from tree-structured long short-term memory networks. arXiv 2015, arXiv:1503.00075. [Google Scholar]
Yue, W.; Li, L. Sentiment analysis using word2vec-cnn-bilstm classification. In Proceedings of the 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS); IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Huynh, V.T.; Yang, H.J.; Lee, G.S.; Kim, S.H. End-to-end learning for multimodal emotion recognition in video with adaptive loss. IEEE MultiMedia 2021, 28, 59–66. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1: Long and short papers, pp. 4171–4186. [Google Scholar]
Machajdik, J.; Hanbury, A. Affective image classification using features inspired by psychology and art theory. In Proceedings of the 18th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2010; pp. 83–92. [Google Scholar]
Verma, B.; Meel, P.; Vishwakarma, D.K. Visual Sentiment Recognition via Popular Deep Models on the Memotion Dataset. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT); IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
Yang, J.; She, D.; Sun, M. Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17); IJCAI Inc.: Pasadena, CA, USA, 2017; pp. 3266–3272. [Google Scholar]
Sai, P.T.; Sri, G.H.; Surekha, T.L. Sentiment Recognition in Images leveraging ResNet18 vs Vit Architecture. In Proceedings of the 2024 Second International Conference on Advances in Information Technology (ICAIT); IEEE: Piscataway, NJ, USA, 2024; Volume 1, pp. 1–7. [Google Scholar]
Jhadi, K.; Tiwari, N.; Chawla, M. Visual Sentiment-based on FER for Improving Feedback Analysis using Transfer Learning. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Limami, F.; Hdioud, B.; Oulad Haj Thami, R. Contextual emotion detection in images using deep learning. Front. Artif. Intell. 2024, 7, 1386753. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Li, T. A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring. Informatics 2026, 13, 10. [Google Scholar] [CrossRef]
Li, H.; Lu, Y.; Zhu, H. Multi-modal sentiment analysis based on image and text fusion based on cross-attention mechanism. Electronics 2024, 13, 2069. [Google Scholar] [CrossRef]
Zhan, Z.; Cao, D.; Chen, Z.; Cheng, H.; Yu, Z. Multimodal sentiment analysis based on slice aggregation and dynamic fusion. CCF Trans. Pervasive Comput. Interact. 2025, 7, 474–493. [Google Scholar] [CrossRef]
Almousa, O.; Tashtoush, Y.; AlSobeh, A.; Zahariev, P.; Darwish, O. SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis. Big Data Cogn. Comput. 2026, 10, 49. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Nasution, A.H.; Onan, A. ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
Guidotti, D.; Pandolfo, L.; Pulina, L. Discovering sentiment insights: Streamlining tourism review analysis with Large Language Models. Inf. Technol. Tour. 2025, 27, 227–261. [Google Scholar] [CrossRef]
Dmonte, A.; Ko, E.; Zampieri, M. An Evaluation of Large Language Models in Financial Sentiment Analysis. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData); IEEE: Piscataway, NJ, USA, 2024; pp. 4869–4874. [Google Scholar]
Água, M.; António, N.; Carrasco, M.P.; Rassal, C. Large Language Models Powered Aspect-Based Sentiment Analysis for Enhanced Customer Insights. Tour. Manag. Stud. 2025, 21, 1–19. [Google Scholar] [CrossRef]
Zhou, C.; Song, D.; Tian, Y.; Wu, Z.; Wang, H.; Zhang, X.; Yang, J.; Yang, Z.; Zhang, S. A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis. arXiv 2024, arXiv:2412.02279. [Google Scholar] [CrossRef]
Khalila, Z.; Nasution, A.H.; Monika, W.; Onan, A.; Murakami, Y.; Radi, Y.B.I.; Osmani, N.M. Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models. Int. J. Adv. Comput. Sci. Appl. 2025, 16. [Google Scholar] [CrossRef]
Thresa Jeniffer, J.; Swetha, M.; Raghuvaran, E.; Deepa, R.; Surendran, R. Enhancing Sentiment Analysis with Multimodal Large Language Models. In 2025 6th International Conference for Emerging Technology (INCET); IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Tian, Y.; Song, Y.; Zhang, Y. Multimodal Aspect-Based Sentiment Analysis with Plugin-Enhanced Large Language Models. IEEE Trans. Neural Netw. Learn. Syst. 2026, 37, 1575–1589. [Google Scholar] [CrossRef] [PubMed]
Nishimura, T.; Nakada, S.; Kondo, M. Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 21, 1–22. [Google Scholar] [CrossRef]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Pérez-Rosas, V.; Mihalcea, R.; Morency, L.P. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; Volume 1: Long Papers, pp. 973–982. [Google Scholar]
Heydarian, M.; Doyle, T.E.; Samavi, R. MLCM: Multi-label confusion matrix. IEEE Access 2022, 10, 19083–19095. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed multimodal sentiment analysis pipeline.

Figure 2. A zero-shot prompt template for multimodal sentiment classification using textual input.

Figure 3. A zero-shot prompt template for multimodal sentiment classification using vision input.

Figure 4. A zero-shot prompt template for multimodal sentiment classification using multimodal input.

Figure 5. Trade-off Between Prediction Accuracy and Image Resolution.

Table 1. Summary statistics of the evaluation dataset across episodes, including class distribution and multimodal characteristics (utterance length and duration).

Episode	Dataset Overview				Multimodal Characteristics
	Positive		Negative		Average Number of Words	Average Video Duration
	Count	Percentage	Count	Percentage	Average (Words)	Average (s)
Eps1	37	74%	13	26%	6.58	2.76
Eps2	36	72%	14	28%	4.96	1.88
Eps3	41	82%	9	18%	4.86	1.99
Eps4	50	100%	0	0%	5.56	2.10
Eps5	43	86%	7	14%	5.12	1.88
Eps6	37	74%	13	26%	5.44	2.05
Eps7	31	62%	19	38%	5.36	1.94
All Episodes	Positive		Negative		Average Number of Words	Average Video Duration
All Episodes	275 (78.57%)		75 (21.43%)		5.41	2.09

Table 2. Text-only performance comparison across episodes (best values in bold).

Model	Acc.	P(+)	R(+)	F1(+)	P(−)	R(−)	F1(−)	Macro-F1
Eps1
LLaVA-OneVision-7B	74.00%	0.96	0.68	0.79	0.50	0.92	0.65	0.72
Qwen2.5-VL-7B	26.00%	0.00	0.00	0.00	0.26	1.00	0.41	0.21
Phi-3.5-mini (3.8B)	66.00%	1.00	0.54	0.70	0.43	1.00	0.60	0.65
Qwen3.5-4B	62.00%	1.00	0.49	0.65	0.41	1.00	0.58	0.62
Eps2
LLaVA-OneVision-7B	64.00%	1.00	0.50	0.67	0.44	1.00	0.61	0.64
Qwen2.5-VL-7B	28.00%	0.00	0.00	0.00	0.28	1.00	0.44	0.22
Phi-3.5-mini (3.8B)	44.00%	1.00	0.22	0.36	0.33	1.00	0.50	0.43
Qwen3.5-4B	52.00%	1.00	0.33	0.50	0.37	1.00	0.54	0.52
Eps3
LLaVA-OneVision-7B	56.00%	1.00	0.46	0.63	0.29	1.00	0.45	0.54
Qwen2.5-VL-7B	18.00%	0.00	0.00	0.00	0.18	1.00	0.31	0.15
Phi-3.5-mini (3.8B)	36.00%	1.00	0.22	0.36	0.22	1.00	0.36	0.36
Qwen3.5-4B	50.00%	1.00	0.39	0.56	0.26	1.00	0.42	0.49
Eps4
LLaVA-OneVision-7B	36.00%	1.00	0.36	0.53	0.00	0.00	0.00	0.26
Qwen2.5-VL-7B	0.00%	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Phi-3.5-mini (3.8B)	22.00%	1.00	0.22	0.36	0.00	0.00	0.00	0.18
Qwen3.5-4B	34.00%	1.00	0.34	0.51	0.00	0.00	0.00	0.26
Eps5
LLaVA-OneVision-7B	68.00%	1.00	0.63	0.77	0.30	1.00	0.47	0.62
Qwen2.5-VL-7B	14.00%	0.00	0.00	0.00	0.14	1.00	0.25	0.12
Phi-3.5-mini (3.8B)	42.00%	1.00	0.33	0.49	0.19	1.00	0.33	0.41
Qwen3.5-4B	58.00%	1.00	0.51	0.68	0.25	1.00	0.40	0.54
Eps6
LLaVA-OneVision-7B	84.00%	0.97	0.81	0.88	0.63	0.92	0.75	0.82
Qwen2.5-VL-7B	28.00%	1.00	0.03	0.05	0.27	1.00	0.42	0.24
Phi-3.5-mini (3.8B)	74.00%	0.96	0.68	0.79	0.50	0.92	0.65	0.72
Qwen3.5-4B	86.00%	0.97	0.84	0.90	0.67	0.92	0.77	0.84
Eps7
LLaVA-OneVision-7B	62.00%	1.00	0.39	0.56	0.50	1.00	0.67	0.61
Qwen2.5-VL-7B	38.00%	0.00	0.00	0.00	0.38	1.00	0.55	0.28
Phi-3.5-mini (3.8B)	58.00%	1.00	0.32	0.49	0.47	1.00	0.64	0.57
Qwen3.5-4B	52.00%	1.00	0.23	0.37	0.44	1.00	0.61	0.49

Table 3. Video-only performance comparison across episodes (best values in bold).

Model	Acc.	P(+)	R(+)	F1(+)	P(−)	R(−)	F1(−)	Macro-F1
Eps1
LLaVA-OneVision-7B	74.00%	0.74	1.00	0.85	0.00	0.00	0.00	0.43
Qwen2.5-VL-7B	26.00%	0.00	0.00	0.00	0.26	1.00	0.41	0.21
LLaVA-NeXT-video-7B	86.00%	0.88	0.95	0.91	0.80	0.62	0.70	0.81
Eps2
LLaVA-OneVision-7B	72.00%	0.72	1.00	0.84	0.00	0.00	0.00	0.42
Qwen2.5-VL-7B	28.00%	0.00	0.00	0.00	0.28	1.00	0.44	0.22
LLaVA-NeXT-video-7B	58.00%	1.00	0.42	0.59	0.40	1.00	0.57	0.58
Eps3
LLaVA-OneVision-7B	82.00%	0.82	1.00	0.90	0.00	0.00	0.00	0.45
Qwen2.5-VL-7B	18.00%	0.00	0.00	0.00	0.18	1.00	0.31	0.15
LLaVA-NeXT-video-7B	54.00%	0.95	0.46	0.62	0.27	0.89	0.41	0.52
Eps4
LLaVA-OneVision-7B	100.00%	1.00	1.00	1.00	0.00	0.00	0.00	0.50
Qwen2.5-VL-7B	0.00%	0.00	0.00	0.00	0.00	0.00	0.00	0.00
LLaVA-NeXT-video-7B	52.00%	1.00	0.52	0.68	0.00	0.00	0.00	0.34
Eps5
LLaVA-OneVision-7B	86.00%	0.86	1.00	0.92	0.00	0.00	0.00	0.46
Qwen2.5-VL-7B	14.00%	0.00	0.00	0.00	0.14	1.00	0.25	0.12
LLaVA-NeXT-video-7B	54.00%	0.81	0.60	0.69	0.06	0.14	0.08	0.39
Eps6
LLaVA-OneVision-7B	74.00%	0.74	1.00	0.85	0.00	0.00	0.00	0.43
Qwen2.5-VL-7B	26.00%	0.00	0.00	0.00	0.26	1.00	0.41	0.21
LLaVA-NeXT-video-7B	90.00%	0.94	0.92	0.93	0.79	0.85	0.81	0.87
Eps7
LLaVA-OneVision-7B	62.00%	0.62	1.00	0.77	0.00	0.00	0.00	0.39
Qwen2.5-VL-7B	38.00%	0.00	0.00	0.00	0.38	1.00	0.55	0.28
LLaVA-NeXT-video-7B	44.00%	1.00	0.10	0.18	0.40	1.00	0.58	0.38

Table 4. Ablation study of zero-shot sentiment classification performance across input modalities. Best results per row are highlighted in bold.

Episode	Model	Text-Only			Vision-Only			Multimodal
Episode	Model	F1(+)	F1(−)	Macro-F1	F1(+)	F1(−)	Macro-F1	F1(+)	F1(−)	Macro-F1
Eps1	LLaVA-OneVision-7B	0.79	0.65	0.72	0.85	0.00	0.43	0.97	0.92	0.95
Eps1	Qwen2.5-VL-7B	0.00	0.41	0.21	0.00	0.41	0.21	0.00	0.41	0.21
Eps2	LLaVA-OneVision-7B	0.67	0.61	0.64	0.84	0.00	0.42	0.91	0.82	0.87
Eps2	Qwen2.5-VL-7B	0.00	0.44	0.22	0.00	0.44	0.22	0.00	0.44	0.22
Eps3	LLaVA-OneVision-7B	0.63	0.45	0.54	0.90	0.00	0.45	0.81	0.58	0.70
Eps3	Qwen2.5-VL-7B	0.00	0.31	0.15	0.00	0.31	0.15	0.00	0.31	0.15
Eps4	LLaVA-OneVision-7B	0.53	0.00	0.26	1.00	0.00	0.50	0.78	0.00	0.39
Eps4	Qwen2.5-VL-7B	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Eps5	LLaVA-OneVision-7B	0.77	0.47	0.62	0.92	0.00	0.46	0.93	0.70	0.81
Eps5	Qwen2.5-VL-7B	0.00	0.25	0.12	0.00	0.25	0.12	0.00	0.25	0.12
Eps6	LLaVA-OneVision-7B	0.88	0.75	0.82	0.85	0.00	0.43	0.96	0.89	0.92
Eps6	Qwen2.5-VL-7B	0.05	0.42	0.24	0.00	0.41	0.21	0.00	0.41	0.21
Eps7	LLaVA-OneVision-7B	0.56	0.67	0.61	0.77	0.00	0.38	0.73	0.75	0.74
Eps7	Qwen2.5-VL-7B	0.00	0.55	0.28	0.00	0.55	0.28	0.00	0.55	0.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hanafiah, A.; Monika, W.; Nasution, A.H.; Onan, A.; Murakami, Y.; Nasution, H.O. Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation. Digital 2026, 6, 40. https://doi.org/10.3390/digital6020040

AMA Style

Hanafiah A, Monika W, Nasution AH, Onan A, Murakami Y, Nasution HO. Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation. Digital. 2026; 6(2):40. https://doi.org/10.3390/digital6020040

Chicago/Turabian Style

Hanafiah, Anggi, Winda Monika, Arbi Haza Nasution, Aytuğ Onan, Yohei Murakami, and Hafiza Oktasia Nasution. 2026. "Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation" Digital 6, no. 2: 40. https://doi.org/10.3390/digital6020040

APA Style

Hanafiah, A., Monika, W., Nasution, A. H., Onan, A., Murakami, Y., & Nasution, H. O. (2026). Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation. Digital, 6(2), 40. https://doi.org/10.3390/digital6020040

Article Menu

Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation

Abstract

1. Introduction

2. Related Works

2.1. Video Content Classtification

2.2. Text-Based Sentiment Analysis

2.3. Visual-Based Sentiment Analysis

2.4. Multimodal Sentiment Analysis and Aggregation Approaches

2.5. Large Vision–Language Models

3. Materials and Methods

3.1. Datasets

3.2. Preprocessing

3.2.1. Utterance Segmentation

3.2.2. Audio Transcription and Text Extraction

3.2.3. Data Cleaning

3.2.4. Frame Extraction

3.3. Sentiment Annotation

3.4. Video-Level Sentiment Aggregation

3.5. Models

3.6. Experimental Setup

3.7. Evaluation Metrics

3.8. Reproducibility and Resources

4. Results

4.1. Exploring the Optimal n-Frames

4.2. Comparative Performance of the Models

4.2.1. Text-Only Performance (LLMs vs. LVLMs)

4.2.2. Video-Only Performance (Visual Capability of LVLMs)

4.3. Ablation Study on Input Modalities

4.4. Error Analysis

5. Discussion

5.1. Implications

5.2. Limitations

5.3. Future Research

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI