Machine Learning with Self-Assessment Manikin Valence Scale for Fine-Grained Sentiment Analysis

Manik, Lindung Parningotan; Susianto, Harry; Dinakaramani, Arawinda; Pramanik, R. Niken; Suhardijanto, Totok

doi:10.3390/info16070562

Open AccessArticle

Machine Learning with Self-Assessment Manikin Valence Scale for Fine-Grained Sentiment Analysis

by

Lindung Parningotan Manik

^1,2,*

,

Harry Susianto

³

,

Arawinda Dinakaramani

⁴

,

R. Niken Pramanik

⁵

and

Totok Suhardijanto

^5,†

¹

Research Center for Data and Information Sciences, National Research and Innovation Agency, Bandung 40135, Indonesia

²

Faculty of Information Technology, Universitas Nusa Mandiri, Jakarta 13620, Indonesia

³

Faculty of Psychology, Universitas Indonesia, Depok 16424, Indonesia

⁴

Faculty of Computer Science, Universitas Indonesia, Depok 16424, Indonesia

⁵

Department of Linguistics, Faculty of Humanities, Universitas Indonesia, Depok 16424, Indonesia

^*

Author to whom correspondence should be addressed.

^†

The author passed away prior to the submission of this paper. This is one of his last works.

Information 2025, 16(7), 562; https://doi.org/10.3390/info16070562

Submission received: 29 April 2025 / Revised: 9 June 2025 / Accepted: 24 June 2025 / Published: 30 June 2025

Download

Browse Figures

Versions Notes

Abstract

Traditional sentiment analysis methods use lexicons or machine learning models to classify text as positive or negative. These approaches are unable to capture nuance or intensity in short or informal texts. We propose a novel method that uses the Self-Assessment Manikin (SAM) valence scale, which provides a continuous measurement of sentiment, ranging from extremely positive to extremely negative. We describe the development of a lexicon of emotion-laden words with SAM valence scales and investigate its application to fine-grained sentiment analysis. We also propose a lexicon-based polarity approach to complement textual features in machine learning models trained to predict a numerical sentiment label for a given text. This method is evaluated using a new dataset of short texts with sentiment labels based on expert ratings, which are predicted using various machine learning fusion mechanisms. The lexicon-based polarity method is found to provide improvements of 0.250, 0.999, and 0.261 in the mean squared error for classical machine learning, RNN, and transformer-based architectures, respectively.

Keywords:

sentiment analysis; fine-grained sentiment; lexicon-based polarity; self-assessment manikin; valence

1. Introduction

Sentiment analysis, also known as opinion mining, is a form of natural language processing (NLP) that attempts to determine the underlying emotional tone of a passage of text. It has received considerable attention, with diverse applications in studying social media [1,2,3,4], consumer feedback [5,6,7], film reviews [8,9], and political discourse [10]. Traditional sentiment analysis approaches have successfully classified text as positive, negative, or neutral. However, there is a growing demand for more nuanced and fine-grained approaches to capture the subtle variations in emotions expressed by individuals.

Machine learning techniques are widely used to perform sentiment analysis, employing large annotated datasets and models to achieve impressive results. However, most existing methods rely on coarse-grained sentiment labels and overlook many complexities of human emotional responses. Two persistent challenges hinder progress. First, numeric sentiment scores derived from psycholinguistic lexicons are interpretable but notoriously brittle when confronted with context, informal language, or domain shift. Second, deep contextual models excel at capturing subtle linguistic cues, but they often require large, task-specific datasets to generalize reliably and offer little transparency regarding the role of explicit affective knowledge. Bridging this gap is of fundamental importance, especially in low-resource or culturally diverse settings where labeled data are scarce. This research paper proposes a novel approach that incorporates continuous psychometric scores into machine learning models.

The first objective of this study is to evaluate whether lexicon-based sentiment analysis can predict nuanced affective scores. Several lexicon-based approaches, such as SentiStrength [1], Valence Aware Dictionary and Sentiment Reasoner (VADER) [2], and SentiWordNet [11,12], can generate continuous sentiment scores. However, the lexicons available for low-resource languages are limited. Typically, either the texts are translated into English or English lexicons are translated into the relevant language before performing the analysis. Moreover, the existing lexicons were created using a combination of human annotators and machine learning. Their accuracy depends strongly on the quality of seed words or initial annotations, and errors in the initial set can propagate to the expanded lexicon. In addition, automated techniques may miss subtle nuances in sentiment, necessitating manual adjustments to improve the quality of the lexicon. Thus, to achieve the first objective, we propose another lexicon-based method to predict sentiment granularity within a short text. The lexicons contain emotion-laden words and were created using a psychometric instrument called the Self-Assessment Manikin (SAM) [13], in which individuals self-evaluate their emotional response to stimuli using a dimensional approach, measuring the valence of emotions on a scale from 0 (extremely negative) to 9 (extremely positive). Although SAM has been applied extensively in psychology and affective computing, its potential for sentiment analysis remains largely unexplored.

The second objective of this study is to investigate whether a polarity score derived from the results of the first objective can serve as a useful feature for machine learning models to predict continuous sentiment labels. Recent work on hybrid approaches to sentiment analysis has yielded promising results [3,4,5,6,7,8,9,14,15,16]. However, machine and deep learning models still face challenges in distinguishing sentiment-bearing words and capturing fine-grained sentiment information, particularly in low-resource languages like Indonesian, for which large annotated corpora are scarce. In classical sentiment analysis, lexicon knowledge can compensate for limited training data [17]. Therefore, we explore various mechanisms to incorporate lexicon-based polarity into models for fine-grained sentiment analysis. By including SAM valence scores in machine learning models, we aim to achieve a more nuanced analysis than conventional binary classification. The proposed methodology seeks to identify a spectrum of emotions, ranging from strongly negative to neutral to strongly positive, to provide a more nuanced understanding of the sentiments expressed in the textual content. We also evaluate the models’ explainability by computing the importance of the fused feature.

Our contribution is twofold. First, we extend the existing Indonesian version of Affective Norms for English Words (ANEW) by providing valence scores for 750 Indonesian words. ANEW is a database that provides ratings of valence, arousal, and dominance for a large set of English words [18]. The Indonesian version of ANEW, which was introduced in [19], contains 1490 Indonesian words. Another lexicon called Indonesian Valence and Arousal Words (IVAW) [20] has 1024 words, which are direct translations of the English words in ANEW. These were both developed through extensive psycholinguistic experiments in which human participants assessed the emotional characteristics of a list of words.

Second, we investigate the importance of the polarity score derived from lexicon-based sentiment analysis as an additional input to textual features for various machine learning algorithms, including recent deep learning approaches such as recurrent neural network (RNN) and transformer-based architectures, in performing fine-grained sentiment analysis. Transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT) and its family, represent the state of the art in text classification tasks [21,22], including sentiment analysis. We seek to implement and assess multiple methods for combining text-derived features with scalar lexicon-based polarity, such as simple concatenation and element-wise gated fusion, to improve the performance of these models further.

The structure of the remainder of this paper is as follows: Section 2 provides an overview of related work and various sentiment analysis techniques. Section 3 describes the methodology and experimental setup used to evaluate the proposed approach. Section 4 presents the results, which are discussed in Section 5. Section 6 summarizes the findings, highlights this study’s contributions, and outlines potential future research directions.

2. Literature Review

This section describes the theoretical foundations of this study, including fine-grained and lexicon-based sentiment analysis, as well as SAM and its adaptation for sentiment analysis.

2.1. Fine-Grained Sentiment Analysis

In contrast to traditional sentiment analysis, which classifies text into broad categories such as positive, negative, and neutral, fine-grained sentiment analysis seeks to provide more specific and detailed sentiment labels. For example, sentiment may be classified as strongly positive, moderately positive, slightly positive, neutral, slightly negative, moderately negative, or strongly negative. Alternatively, sentiment intensities can be described with numerical values [23].

Researchers have shifted their focus to fine-grained sentiment analysis in recognition of the inadequacies of conventional methods in capturing the subtle variations of human emotions. According to [24], affect is a general construct that refers to physiological changes and states of mind, whereas an emotion is a prototypical form of affect that is generally encoded in a particular language. One area of study in fine-grained sentiment analysis is the investigation of multidimensional emotion models. It was argued in [25] that the emotional world is not restricted to a two-dimensional valence–arousal space and that additional emotional dimensions need to be considered. This prompted the incorporation of dimensional emotion models into sentiment analysis to capture a wider variety of emotions.

2.2. Lexicon-Based Sentiment Analysis

Some early studies focused on developing lexicon-based methods for determining the polarity of emotions. For example, each word in the VADER lexicon is assigned a valence score representing its emotional polarity and intensity [2]. The scores range from

- 1

(the most negative) to

+ 1

(the most positive), with 0 representing neutrality. In the SentiStrength lexicon, the score of each word ranges from

- 5

(extremely negative) to

+ 5

(extremely positive) [1].

Since each word in the lexicons is mapped to a numerical value, these sentiment analysis methods also assign an overall numerical value to a text, in addition to a classical sentiment category. For example, VADER’s compound score represents the overall sentiment of a given text, ranging from

- 1

(most negative) to

+ 1

(most positive). Similarly, one of SentiStrength’s outputs is a sentiment score ranging from

- 4

(extremely negative) to

+ 4

(extremely positive).

However, these lexicons were created using a semi-supervised approach. For example, the VADER and SentiStrength lexicons were created using a combination of manual annotation and automatic methods. Although manual annotation of sentiment scores for a large number of words can be time-consuming and expensive, this approach ensures a high level of precision in assigning sentiment labels, as human annotators can understand the context and nuances of each word, resulting in accurate sentiment scores. It can also provide fine-grained sentiment scores, allowing for a more comprehensive analysis of intensity, capturing subtle variations in sentiment expressions.

2.3. Self-Assessment Manikin

SAM is a psychometric tool in which human annotators self-assess their emotional state along the valence dimension, from extremely negative to extremely positive [13]. The scale is defined with manikins: pictures of smiling, neutral, and frowning faces. Respondents indicate their emotional response to a given stimulus, a word in this context, by selecting a number between one (extremely negative) and nine (extremely positive) [18].

2.4. Machine Learning-Based Sentiment Analysis

Recent years have seen significant advancements in sentiment analysis, driven by a growing interest in understanding human emotions and opinions using textual data. Text is typically classified into discrete positive, negative, and neutral sentiment categories using conventional machine learning techniques, such as support vector machines (SVMs) or naive Bayes classifiers. Through seminal works in opinion mining and sentiment analysis, ref. [26] popularized the use of machine learning algorithms for sentiment classification.

Although lexicon-based approaches are computationally efficient, they are unable to capture context-specific nuance. With the advent of deep learning, neural network models have shown remarkable success in various NLP tasks, including sentiment analysis. Using transformer-based models, which are based on the multihead attention mechanism, ref. [27] introduced a deep and broad pretraining approach for text representation, demonstrating state-of-the-art efficacy across various NLP tasks. These models have demonstrated remarkable capabilities in capturing contextual information. However, challenges remain in distinguishing sentiment-bearing words and capturing fine-grained sentiment information.

2.5. Integrating Lexicon Features into Deep Learning Pipelines

Recently, researchers have explored various architectures that integrate lexicons and deep learning. For example, in [28], BiLSTM models were enhanced by incorporating sentiment lexicons through attention mechanisms, leading to improved performance in text classification tasks. Similarly, in [7], a hybrid model was proposed called LeBERT, which combines a sentiment lexicon and BERT embeddings with a CNN classifier. In [29], a sentiment- and context-aware hybrid deep neural network was developed that incorporates a wide-coverage sentiment lexicon alongside BERT for text sentiment classification. Moreover, ref. [16] describes a method that integrates a RoBERTa transformer with a CNN and a lexicon-based module for analyzing bullet screen comments. Each of these studies found that integrating lexicon-derived sentiment features into machine learning models can achieve more robust sentiment classification with improved accuracy.

Although prior studies have used lexicons in classification settings, few have applied them to continuous sentiment prediction, particularly in low-resource settings. Integrating psychometric instruments such as the SAM valence scores is a promising strategy for achieving granular sentiment analysis [4]. Unlike traditional sentiment analysis, which categorizes text with discrete sentiment labels, SAM allows for a more fine-grained representation of emotions along the valence dimension. The proposed method combines the benefits of deep learning and SAM valence scales to capture the complexity and variability of emotional expressions.

In multimodal sentiment analysis and emotion recognition, late fusion (decision-level fusion) refers to training independent models for each modality or data source and combining their outputs for a final prediction [30]. One advantage of late fusion is that it circumvents the need for tightly synchronized or aligned features across modalities. However, late fusion may neglect cross-modal interactions since decisions are made mostly in isolation, and it can fail to capture intermodal correlations as effectively as joint feature fusion [31]. Despite this drawback, late fusion remains popular because of its simplicity and robustness to missing modalities. If one modality model fails or if some data are missing, others can still contribute.

3. Materials and Methods

This section presents the methods used to acquire datasets, implement the proposed polarity algorithms, set up the experiments, and evaluate the results.

3.1. Dataset Acquisition

We acquired two datasets: a set of emotion-laden words annotated with SAM valence scores and one of short Indonesian texts (posts on X, formerly Twitter) annotated with numerical sentiment intensities. The datasets were labeled by humans. These annotations might be subjective, leading to variations in perceived scores. To mitigate this, we obtained multiple annotations per word or text instance. Each annotator assigned each word or text instance a score from one to nine. The aim of the algorithms developed in this study was to predict the mean of these sentiment labels. We took steps to ensure the privacy and anonymity of data sources and obtained informed consent from the human annotators.

For the first dataset, we selected 750 Indonesian words at random from the X posts. Undergraduate students enrolled in Sastra Indonesia (Indonesian Studies) were hired to rate each word in the dataset using a SAM valence scale. Each word was annotated by at least 45 male and 45 female students. They were all native Indonesian speakers and received IDR 50,000 (USD 5) as a token of appreciation. They were instructed to rate their emotional response from 1 (very negative, “you feel unhappy, upset, melancholic, despairing, or bored”) to 9 (very positive, “you feel happy, excited, pleased, or hopeful”) for each word. This instruction was a modification of that in [19]. Each word was then labeled with a valence score equal to the mean of the valence responses.

For the second dataset, we chose 2851 posts that contain at least one word in the lexicon acquired in this study or that in [19]. Each post was annotated independently by three linguistic experts, who were native speakers and teachers of Indonesian; they received IDR 2,000,000 (USD 200) as a token of appreciation. They were instructed to provide a sentiment intensity on the same scale as the SAM valence scale for the emotion-laden words, from 1 (very negative) to 9 (very positive), for each post.

We used two metrics to assess the acquired datasets: the standard deviation (SD) and the intraclass correlation coefficient (ICC). The SD can provide insight into variability, which indirectly reflects consistency. The ICC, specifically ICC2, was used to assess consistency and agreement among annotators. Generally, ICC values are interpreted as follows:

0.00–0.40: poor agreement;
0.41–0.60: moderate agreement;
0.61–0.80: good agreement;
0.81–1.00: excellent agreement.

3.2. Analysis Methodology

Our investigation began with preparing the datasets and ensuring split integrity. Before performing the analysis, preprocessing steps were carried out: all letters were converted to lowercase, the texts were tokenized, tokens that only contained numbers or punctuation were removed, URLs and links were removed, typing errors were corrected, and slang was replaced with standard language. Then, we curated a common partition into training, validation, and test sets for all experiments to ensure that any performance differences observed between approaches were attributable to architectural choices rather than fortuitous overlaps of training and test data or distributional drift.

The analysis began with a lexicon-based polarity estimation step because, unlike opaque neural embeddings, valence scores derived from a psycholinguistic lexicon provide an interpretable, human-validated sentiment signal. We used a lexicon-based polarity algorithm to obtain an interpretable scalar sentiment prior derived from the affective norms of the emotion-laden words we acquired. This algorithm yields a noise-robust sentiment intensity estimate without requiring labeled training data. In our pipeline, this prior acts as domain-agnostic auxiliary knowledge that can be fused with learned text features.

Textual features were then extracted with three complementary encoders—term frequency–inverse document frequency (TF-IDF), random word embeddings, and contextual transformer embeddings—because each captures fundamentally different linguistic evidence. We also benchmarked three model families (conventional regressors, RNNs, and transformers) to control for architecture capacity. Combining TF-IDF and conventional regressors allowed us to establish the limits to the abilities of classical machine learning when given only surface term importance and polarity. We also coupled random word embeddings with RNNs to test sequence-sensitive architectures that learn representations from scratch. Because embeddings are not pretrained, any improvement attributable to polarity fusion suggests that external priors can compensate for data scarcity in low-resource settings by guiding early representation learning. Furthermore, transformer-based contextual embeddings provided a high-capacity pretrained baseline that encodes rich sentiment cues. For the deep learning approaches, such as RNNs and transformers, we explored two fusion methods.

We also investigated the impact of the lexicon-based polarity feature on the machine learning models by computing the feature importance value alongside the individual word importance. The values were obtained by calculating the mean absolute Shapley additive explanations (SHAP) value for each feature. SHAP is a method of explaining the output of a machine learning model that uses concepts from cooperative game theory to attribute a model’s predictions to its features [32]. SHAP values can be zero, positive, or negative. A zero value suggests that the feature has little impact on the prediction. A positive value indicates that the feature contributes positively to the prediction, pushing the polarity higher. A negative value indicates that the feature contributes negatively, pulling the polarity lower.

3.2.1. Lexicon-Based Polarity Algorithm

We performed a lexicon-based fine-grained sentiment analysis of emotion-laden words. The polarity was computed using a simple algorithm, shown in Algorithm 1, which relies on valence scores from lexicons to estimate the overall polarity of an input text. Thus, the scores are representative of the sentiment expressed by the words, and the algorithm uses these scores to calculate the polarity of the entire text. In this study, we used the first dataset combined with existing lexicons from [19] to estimate the texts’ polarity in the second dataset. We applied the algorithm to determine sentiment polarity from predefined valence scores associated with the words.

Algorithm 1: Lexicon-based polarity algorithm

Data: text T, lexicon of emotion-laden words L

Result: polarity p

The algorithm takes as input a text T for which polarity needs to be computed and a lexicon L containing emotion-laden words. The goal is to compute the text’s overall polarity p from the valence scores of individual words present in the lexicon. Two variables are initialized:

c o u n t_{w}

, which keeps track of the number of words in the text that have valence scores, and

s u m_{p}

, which accumulates the sum of the valence scores of those words. The algorithm loops over each word

w_{i}

in the input text T. For each word

w_{i}

, it checks whether the lexicon L contains the word. If so, the algorithm proceeds with the polarity calculation. If the word

w_{i}

has a valence score in the lexicon,

c o u n t_{w}

is incremented by one to keep track of the number of words with valence scores. The algorithm looks up the valence score of the word

w_{i}

in the lexicon L and assigns it to the variable

v_{i}

. The valence score

v_{i}

of the word

w_{i}

is added to

s u m_{p}

. After processing all words in the text, the algorithm calculates and returns the polarity p by dividing the

s u m_{p}

by

c o u n t_{w}

.

3.2.2. Combination of Lexicon-Based Polarity and Machine Learning

We incorporated the polarity obtained by the lexicon-based polarity algorithm as an additional feature, alongside the textual features, in various machine learning algorithms. Textual features were extracted to represent the input text for the fine-grained sentiment analysis models. Three common feature extraction techniques were considered:

TF-IDF: We represented each text instance as a vector of term frequencies, disregarding word order, and used TF-IDF weighting to highlight important terms. The TF-IDF approach can yield superior classification accuracy because it focuses solely on word relevance at the level of individual terms [33].
Word embeddings: We constructed a vocabulary by converting each text into a sequence of word indices and padding them to a fixed length. These sequences were then fed into a trainable embedding layer, which was initialized randomly and learned from scratch, resulting in dense vector representations. The word embeddings tried to capture both syntactic and semantic features [34].
Contextual (word) embeddings: Taking advantage of recent advances in deep learning, we integrated transformer-based models to generate dynamic, context-aware word representations. Unlike static embeddings, these models tried to consider the entire sentence, adapting to the nuances of word usage within different contexts [22].

We explored various machine learning algorithms and deep learning architectures for fine-grained sentiment analysis with the SAM valence scores:

Conventional machine learning algorithms: We used linear support vector regression (SVR), regression tree (RT), random forest (RF), and gradient boosting (GB) algorithms to build a regression model that predicts the sentiment label. The TF-IDF method was used to extract textual features in addition to the lexicon-based polarity.
RNNs: We employed various RNN architectures, namely, vanilla RNN, long short-term memory (LSTM), and gated recurrent unit (GRU), to capture sequential information in the text by inputting word embeddings in addition to the lexicon-based polarity.
Transformer-based architectures: We used pretrained transformer-based models, namely, BERT [35], DistilBERT [36], and Robustly Optimized BERT Pretraining Approach (RoBERTa) [37]. In these architectures, transformer layers process the contextual embeddings alongside the lexicon-based polarity to generate rich representations.

3.2.3. Fusion of Text Features and Lexicon-Based Polarity for Deep Learning

For the RNN and transformer-based architectures, we used two fusion methods. In the first method, the final state of the architecture layer, which is the extracted text feature, is simply concatenated with the lexicon-based polarity scalar. Supposing that T is the text feature vector and p is the polarity scalar feature, the fused feature

f_{1}

is obtained by concatenating T and p:

f_{1} = [\begin{matrix} T \\ p \end{matrix}] .

(1)

The final output y is computed with a linear layer:

y = W f_{1} + b,

(2)

where W and b are the weight matrix and bias vector, respectively.

In the second method, the lexicon-based polarity is passed through a small multilayer perceptron (MLP), and then an element-wise gate is computed over the concatenated features to fuse them using a weighted combination. Then, a final fully connected layer produces the output. Supposing that W and b are the learnable parameters, the gate g is computed using

g = σ (W [\begin{matrix} T \\ p \end{matrix}] + b),

(3)

where

σ (\cdot)

denotes the sigmoid function, ensuring g has values in between 0 and 1. We experimented with two weighted combinations of T and p for obtaining the fused feature,

f_{2 a} = g ⊙ T + (1 - g) ⊙ v

(4)

and

f_{2 b} = (1 - g) ⊙ T + g ⊙ v,

(5)

where ⊙ represents element-wise multiplication.

The first method, a simple concatenation between textual features and the lexicon-based polarity, served as a minimal-cost integration scheme, increasing dimensionality but adding virtually no parameters. We hypothesized that a simple concatenation method might not capture modality-specific relevance or interaction. In contrast, the fused combination in the second method can dynamically balance the contributions of text and numeric inputs. This provides a learnable interaction mechanism that jointly considers textual features and the lexicon-based polarity. This adaptive fusion approach could lead to a lower mean squared error (MSE) and better generalization, especially when modalities contribute unequally to the target variable.

The baseline architecture without polarity (

f_{0}

), an architecture with polarity and the first fusion method (

f_{1}

), and an architecture with polarity and the second fusion method (

f_{2 a}

and

f_{2 b}

) are depicted in Figure 1. All architectures start with a word (or contextualized) embedding layer followed by a sequence encoder that can be realized with an RNN or transformer block. In the baseline architecture

f_{0}

, the contextual text representation T produced by this encoder is regularized through a dropout layer and sent directly to a fully connected layer, whose output node delivers the final prediction. The first fusion architecture augments this pipeline with external prior knowledge: after the same dropout stage, T is concatenated with a scalar lexicon-based polarity p, and the fused vector

f_{1}

is fed into a dense layer to generate the output, allowing the model to weigh statistical text cues and lexicon sentiment simultaneously. The most expressive design, the second fusion architecture, processes the two modalities more symmetrically: the polarity p first passes through its own dense-ReLU-dropout branch, producing a higher-dimensional embedding that can better match the scale of T; the transformed polarity vector is then concatenated with T, passes through a dense-sigmoid layer that yields a refined joint feature (

f_{2 a}

or

f_{2 b}

), and finally passes through a second dense layer to emit the prediction. This enables the network to learn nonlinear interactions between text semantics and lexicon polarity while retaining regularization at multiple points.

3.2.4. Evaluation Method and Experiment Setup

To evaluate the performance of the sentiment analysis models, we computed the MSE when applying the trained model to the test data. The MSE is calculated using

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2},

(6)

where n is the number of data points in the dataset,

y_{i}

is the actual (observed) value of the target variable for the ith data point, and

\hat{y_{i}}

is the predicted value of the target variable for the ith data point. The result is nonnegative, and smaller MSE values imply superior model performance because they indicate that the predictions are closer to the actual values.

The algorithms were implemented in Python 3. Several standard libraries, such as sklearn, were used to implement the conventional machine learning and neural network algorithms. For example, train_test_split from sklearn.model_selection was used to split the dataset automatically. To ensure that the observed differences in model performance were not caused by arbitrary randomization, we set all random seeds to the fixed value of 42, including the random_state parameter of train_test_split, random.seed from numpy, and manual_seed and cuda.manual_seed_all from torch. Furthermore, the benchmark and deterministic parameters from torch.backends.cudnn were set to false and true, respectively. These settings ensured that the generated models performed the same if the experiments were repeated several times because the initial models’ parameters had the same random values each time.

4. Results

This section describes the obtained dataset, presents sentiment label prediction results yielded by standalone lexicon-based methods and various fusion approaches, and evaluates the performance of these methods.

4.1. Datasets

Two datasets were acquired in this study: the emotion-laden words dataset and the labeled sentiment dataset. The raw data comprise posts on X (formerly Twitter) from June to October 2019 related to Indonesian railway transportation feedback.

4.1.1. Emotion-Laden Words Dataset

The lexicon acquired in this study contains 750 emotion-laden words, selected randomly from the raw dataset. It was ensured that the chosen words were not in [19]. The data were divided into three sets: A, B, and C. Each set contains 250 words and was assigned to different annotators. In this study, 136 students, whose ages ranged from 17 to 23, with a mean of 19.74, were hired to rate each word on the SAM valence scale. A sample of acquired emotion-laden words is presented in Table 1, and the distribution of annotators is shown in Table 2.

Histograms of the valence scores and standard deviations of valence responses between annotators for each word are shown in Figure 2. The low standard deviations suggest that the annotators tended to agree closely. Furthermore, the ICC2 value of each subset is shown in Table 3. These are between 0.55 and 0.56, suggesting moderate consistency among the annotators. High values of the F-statistic, between 61.5 and 62.8, and a p-value of 0.0 indicate significant variability between annotators when compared with the variability within each annotator’s scores, supporting a finding of moderate agreement. The confidence interval (CI) listed for each subset in the table is calculated such that, if the procedure were repeated, 95% of the resulting CIs would contain the true value.

In addition to the dataset acquired in this study, we used the one from [19], which contains 1490 words annotated by 56 university students, 30 female and 26 male. Since the raw data are not available, the ICC cannot be computed. However, the standard deviation and the mean of valence responses for each word are provided. Histograms of the standard deviations and valence scores are shown in Figure 3. The pattern looks similar to the datasets we acquired in this study. Combining the dataset acquired in this study and that of [19] yielded a lexicon of 2240 words, which was used for the subsequent experiments.

4.1.2. Labeled Sentiment Dataset

A sample of the labeled dataset is shown in Table 4. It can be seen that the positive and negative reviews have different sentiment intensities. The positive reviews have sentiment labels between 7 and 9, whereas the negative reviews have labels between 1 and 3. A box plot for the sentiment intensities given by the three experts, anonymized as X, Y, and Z, and a histogram of the standard deviations of sentiment intensities across experts are given in Figure 4. The plots look slightly different, but the similarity of the box plots for experts X and Z suggests that these experts were aligned in their judgments. High deviations could show areas of potential inconsistency of intensities across posts and greater disagreement among experts. However, like the valence responses, only a few instances had a higher deviation.

To investigate the consistency and agreement further, we computed the ICC for the second dataset. The results are presented in Table 5. An ICC2 of 0.576 suggests that the experts agreed to some extent but showed substantial variability in their judgments, like in the first dataset. The F-statistic and p-value suggest that the observed ICC was statistically significant, meaning that the moderate level of agreement among the experts was unlikely to be due to chance. The confidence interval of

[0.41, 0.69]

reinforces that the level of agreement was moderate, as it included values at both the lower and upper bounds of the moderate category.

Therefore, considering the standard deviation and ICC results, we decided to use all the data and ignore the possibility of outliers in the subsequent experiments. Each X post was annotated with a sentiment label, calculated as the average of sentiment intensities given by the experts. These served as the ground truth to evaluate the methods used to predict sentiment labels. A histogram of the sentiment labels is shown in Figure 5.

4.2. Lexicon-Based Polarity Analysis

A box plot and histogram of the results of predicting sentiment labels using the lexicon-based polarity approach are depicted in Figure 6. Both plots compare the prediction results with the ground truth. The corresponding MSE results are shown in Figure 7. The MSE over all predictions was 2.659. We also explored various methods to analyze the performance of predicting the sentiment of texts containing more words from the lexicons. For example, the MSE increased to 3.935 when predicting the sentiment of 275 texts containing at least ten words from the lexicons. However, the best performance was acquired when three or four words were used, for which the MSE was 2.6 over 2612 and 2222 texts, respectively.

4.3. Fusion Approaches

This section presents the results of various approaches to fusing lexicon-based polarity scores with textual features for conventional machine learning, RNN, and transformer-based architectures.

4.3.1. Combination of TF-IDF and Lexicon-Based Polarity

First, we built machine learning models by training conventional machine learning algorithms, such as linear support vector regression (SVR), regression tree (RT), random forest (RF), and gradient boosting (GB), using TF-IDF alone, as the baseline, as well as the combination of TF-IDF and the lexicon-based polarity. The method used to evaluate this hybrid approach is illustrated in Figure 8. Raw texts with sentiment labels were first prepared through standard preprocessing steps. The preprocessed texts were then converted into numerical vectors using TF-IDF, while a parallel branch consulted a sentiment lexicon to aggregate emotion-laden words into a single polarity score p. These statistical TF-IDF vectors and the polarity feature were concatenated, yielding a hybrid feature set paired with the original sentiment labels. The resulting dataset was split in an 80:20 ratio into training and test partitions. Each chosen machine learning regressor (SVR, RT, RF, and GB) was fitted to the training set, after which SHAP was applied to the trained model to quantify each feature’s contribution and enable interpretability. Finally, the model accuracy was evaluated on the held-out test set by computing the MSE, providing an unbiased measure of how closely the predicted sentiment intensities matched the true labels.

The MSE results are presented in Table 6. The largest absolute improvement in MSE when the lexicon-based polarity was included in the feature set, relative to the baseline, was obtained for RT, with a reduction of 0.25. The smallest improvement was observed for RF. The feature importance values are shown in Figure 9. In addition to the polarity, we identified the other five most important features for each experiment, which were words in the textual features, represented by w1, w2, w3, w4, and w5, with the highest SHAP values. These words were not necessarily the same across all experiments. The lexicon-based polarity was consistently found to be the most important feature when included in the feature set, with values of 0.1880, 0.3043, 0.3778, and 0.4975 for SVR, GB, RF, and RT, respectively.

4.3.2. Combination of Word Embeddings and Lexicon-Based Polarity

In the second experiment on hybrid approaches, we trained various RNN architectures with word embeddings. Instead of working with continuous embeddings directly, the input texts were first converted into one-hot encoded vectors. A custom model wrapper then computed the embedding on the fly by multiplying the one-hot input with the embedding matrix. The implementation allowed SHAP values to be calculated directly with the one-hot encoded inputs. SHAP values corresponding to the active hot index (the actual token) for each token in the sequence were extracted, aggregated, and averaged for each feature. Working with one-hot representations allowed the SHAP values to be tied directly to individual feature indices, providing a more direct interpretation of the impact of each feature.

The evaluation method used for this hybrid approach is illustrated in Figure 10. Raw texts with numeric sentiment labels were first preprocessed, then converted into fixed-length padded sequences suitable for recurrent neural networks; in parallel, the same preprocessed text was matched against a sentiment lexicon to aggregate an overall polarity score p. The padded token sequences and the polarity scalar were merged to form a composite feature set paired with the original labels, which were split in a 70:15:15 ratio into training, validation, and test partitions. Multiple RNN architectures (vanilla RNN, LSTM, and GRU) were trained on the training data and monitored with the validation set; once fitted, SHAP was employed to quantify the influence of individual tokens and the lexicon feature on each prediction, providing interpretability. Finally, the trained model’s generalization ability was assessed on the held-out test set by computing the MSE between the predicted and true sentiment intensities.

The MSE results are presented in Table 7. Using

f_{1}

, the performances were improved by 0.068, 0.079, and 0.345 from the baseline for vanilla RNN, LSTM, and GRU, respectively. Using

f_{2 a}

, the corresponding improvements in the MSE were 0.254, 0.841, and 0.726. The MSE differences between baseline,

f_{0}

, and

f_{2 b}

were 0.191, 0.999, and 0.696 for vanilla RNN, LSTM, and GRU, respectively.

Furthermore, when using

f_{1}

, the feature importance values of the lexicon-based polarity were 0.0306, 0.0348, and 0.0304 for vanilla RNN, LSTM, and GRU, respectively. When using

f_{2 a}

, the corresponding values were 0.1887, 0.2314, and 0.2028. Finally, for

f_{2 b}

, they were 0.1874, 0.2063, and 0.2021. Comparisons of feature importance between the polarity and the five most important words for each experiment, represented by w1, w2, w3, w4, and w5, in the text feature for

f_{1}

,

f_{2 a}

, and

f_{2 b}

are depicted in Figure 11. These words were not necessarily the same in all experiments.

4.3.3. Combination of the Transformer-Based Embeddings and the Lexicon-Based Polarity

In the next hybrid approach, we trained transformer-based architectures with various pretrained models and the lexicon-based polarity. The first model we used was IndoBERT, trained on twelve corpora of formal and informal Indonesian sentences, which comprise 23.43 GB of data, four billion words, and around 250 million sentences [38]. The second model was a DistilBERT variant (https://huggingface.co/cahya/distilbert-base-indonesian, accessed on 26 June 2025), which was conditioned with 522 MB of Indonesian Wikipedia pages and 1 GB of Indonesian newspaper articles. Lastly, we used a RoBERTa variant, (https://huggingface.co/flax-community/indonesian-roberta-base, accessed on 26 June 2025)) trained on the OSCAR dataset [39,40].

The method used to evaluate this hybrid approach is illustrated in Figure 12. Raw texts paired with numeric sentiment labels were first preprocessed and then passed to a transformer tokenizer to obtain dense contextualized vectors; in parallel, the same preprocessed texts were scored against an emotion-word lexicon to derive an aggregate polarity value p. These contextual features and the polarity scalar were concatenated into a hybrid feature set, which, together with the original labels, was partitioned in a 70:15:15 ratio into training, validation, and test sets. Transformer-based architectures were fine-tuned on the training data and monitored with the validation set; after fitting, a SHAP analysis quantified the contributions of each token and the lexicon for interpretability. Finally, the predictive quality was assessed on the held-out test data by computing the MSE between the model’s sentiment estimates and the ground-truth intensities.

The MSE results are presented in Table 8. When using

f_{1}

, the MSE increased by 0.075, 0.103, and 0.050 from the baseline for IndoBERT, DistilBERT, and RoBERTa, respectively. For

f_{2 a}

, the corresponding improvements were 0.120, 0.261, and 0.030. The MSE differences between the baseline,

f_{0}

, and

f_{2 b}

versions were 0.091, −0.115, and 0.101 for IndoBERT, DistilBERT, and RoBERTa, respectively. When using

f_{1}

, the corresponding feature importance values for the lexicon-based polarity were 0.0071, 0.0072, and 0.0051. When using

f_{2 a}

, these values were 0.2204, 0.1149, and 0.1559, and for

f_{2 b}

, they were 0.2096, 0.1488, and 0.1696. Figure 13 shows comparisons of feature importance between the polarity and the five most important words for each experiment, represented by w1, w2, w3, w4, and w5, in the text features for

f_{1}

,

f_{2 a}

, and

f_{2 b}

. These words were not necessarily the same across all experiments.

5. Discussion

In this section, we discuss the theoretical and practical implications of the results obtained with each of the proposed methods.

5.1. Theoretical Implications

In the following section, we discuss the theoretical implications of our findings, outlining how they extend, refine, and challenge existing conceptual frameworks in the field.

5.1.1. Lexicon-Based Polarity Algorithm

The performance of the lexicon-based polarity algorithm was comparable to that of the RT algorithm, which yielded an MSE of around 2.66. Moving from texts with one emotion-laden word found to three caused the MSE to reduce slightly to 2.60. However, Figure 7 shows that the MSEs increased when the feature number rose. This might be because short texts that naturally include many emotion-laden words were less common and might represent more extreme or heterogeneous sentiment expressions.

The lowest MSE was observed when the texts had around three to four emotion-laden words. This aligns with [41], in which it was observed that beyond a certain feature count, adding more features did not improve the performance or even harm it. There might be an optimal zone where a trade-off is achieved between having enough sentiment indicators and maintaining a large enough or representative sample.

5.1.2. Conventional Machine Learning Algorithms

The results shown in Table 6 indicate that incorporating the lexicon-based polarity as an additional feature consistently improved the predictive performance for all the models trained by classical machine learning algorithms. This finding agrees with [8], in which SentiWordNet scores were integrated as features in an SVM model for film reviews and blog posts, with the results showing that the model benefited from fusing lexical features with sentiment information. Studies in other low-resource languages, such as Bulgarian [9] and Arabic [3], yielded similar results. This also corroborates findings in [23], in which ensemble regression was applied to financial data, including lexicon features, to predict fine-grained sentiment scores, showing that even before the advent of transformers, ensembles with lexicons excelled at sentiment analysis.

From Figure 9, it can be seen that the polarity was the most important feature in all experiments. Furthermore, the feature had high importance values for the RT and RF algorithms. This suggests that it provided a strong and easily exploitable signal for these models. In particular, the RT model benefited significantly, indicating that the polarity was useful in the single-tree structure for making splits. However, the high importance value of the feature for RF did not translate into a large overall performance gain. One plausible explanation is that the marginal gain from including the polarity might have been small, even though the feature was used frequently. Moreover, it was found in [42] that tree SHAP values have similar biases to the Gini importance in RF models, tending to over-credit certain features without achieving a corresponding performance gain.

5.1.3. RNN Architectures

The improvements in the performance of the RNN architectures were similar to those of the classical machine learning algorithms; the MSE was reduced significantly when the lexicon-based polarity was included in the model. LSTM benefited the most, with improvements of 0.841 and 0.999 when using the fusion methods

f_{2 a}

and

f_{2 b}

, respectively. This indicates that the fusion features

f_{2 a}

and

f_{2 b}

were better than

f_{1}

, which can also be inferred from the higher feature importance values. This result aligns with [4], in which valence and arousal were incorporated into BiLSTM, which led to significantly improved performance in sentiment classification. This finding also corroborates that of [5], in which lexicon-based features were similarly integrated with word embeddings before or within an LSTM classifier, resulting in a high accuracy because the lexicon’s prior polarity knowledge guided the training of the network.

Despite their increased complexity, the RNN architectures did not outperform conventional machine learning algorithms. This agrees with [43,44], in which well-tuned SVM and RF models achieved performance similar to or better than RNNs, likely because the algorithms could effectively leverage the available features even with limited data. Moreover, unlike the transformer-based architectures, we did not use pretrained embeddings, such as Word2Vec [45] or GloVe [46], but trained the RNN architectures from scratch. Using pretrained weights could significantly boost the performance of RNNs, especially in low-resource settings, because they might help capture complex semantic and syntactic relationships.

5.1.4. Transformer-Based Architectures

The transformer-based models outperformed the others. However, unlike in the experiments with conventional machine learning algorithms and RNNs, mixed results were obtained when the lexicon-based polarity was incorporated into these models. In particular, when using

f_{1}

, the feature hurt the performance. The naive concatenation did not allow the model to learn complex interactions between the text and numeric modalities. This aligns with [15], in which a similar late fusion with a simple concatenation was proposed, resulting in small performance improvements, if any.

Although the results obtained using

f_{2 a}

and

f_{2 b}

were mixed, the fused features generally improved the performance of transformer-based architectures. An intermediate nonlinear fusion step, such as a gating mechanism or an MLP, might help in adapting and weighing the importance of each modality. These results are aligned with [16], in which a similar late fusion strategy was proposed: the contributions of a lexicon-based classifier and a RoBERTa+CNN classifier were weighted explicitly, yielding a more nuanced bullet screen comment classifier that outperformed a pure RoBERTa model. The sentiment was computed via both approaches, and then their outputs were weighted by giving more weight to lexicon signals for short texts. However, like LeBERT [7], the performance improvement of using

f_{2 a}

and

f_{2 b}

was relatively small.

It can also be observed that the fused features

f_{1}

,

f_{2 a}

, and

f_{2 b}

had lower importance values in all transformer-based architectures. When pretrained models were fine-tuned on the dataset, the benefit of adding the lexicon-based polarity became marginal or even negligible. Although it is intuitively expected that lexicon features could be useful, their contribution is often overshadowed by robust text features like deep embeddings, reducing their utility unless the text features are limited [14]. This suggests that the lexicon signals were largely redundant with what the model learned from the text itself.

In the future, it may be worthwhile to try incorporating the polarity before or during the transformer encoding, as an additional token embedding. For example, in [6], a two-channel CNN–LSTM model was proposed in which one branch processed text sequences while another processed lexicon indications, and these were merged in parallel (an intermediate fusion). This parallel early fusion led to an improved performance on user reviews by combining semantic features with lexicon cues. Furthermore, in [47], early fusion was explored in a multimodal context, and a Tabular-Text Transformer (TTT) was proposed. In this model, all numeric features are encoded via a bespoke distance-to-quantile embedding scheme, which is then fed alongside textual embeddings into transformer layers that perform full pairwise self-attention across modalities, improving performance in tasks like financial risk assessment and medical diagnosis.

5.2. Practical Implications

The interpretability findings have several practical implications regarding the dominance of lexicon-based polarity, the trade-offs between added complexity as well as performance gains, and the practicality of the SAM valence score for fine-grained sentiment analysis.

5.2.1. The Effect of Lexicon-Based Polarity

The lexicon-based polarity alone was not sufficient to predict the sentiment intensity in our study. This result is not aligned with [48], in which lexicon-based methods alone could achieve near-state-of-the-art performance in the classical sentiment analysis. This could be due to the difference in the problems, where a continuous label estimation is naturally harder than classification. In our case, the polarity needed to be combined with machine learning to yield better performance. Our SHAP-based interpretability analysis revealed that the polarity feature dominated both the conventional and transformer models, albeit in different ways. In the linear and tree-based regressors, polarity often emerged as the most influential input: its SHAP values were much larger than those of the TF-IDF terms, indicating that once the model knows the average valence of words, it leans heavily on that cue, sometimes at the expense of more nuanced textual patterns. It is aligned with [49], in which it is indicated that a curated polarity feature could receive the highest importance in traditional models like SVR or Random Forest, outweighing hundreds of TF-IDF features. By contrast, the transformer models showed a more balanced distribution in which the polarity still contributed meaningfully, but its relative importance was tempered by contextual embeddings.

In scenarios with limited labeled examples or constrained hardware, classical regressors augmented with a lexicon-based polarity feature often suffice. Because the polarity exerts a strong influence, a simple pipeline (compute polarity, extract TF-IDF, fit a machine learning model) was preferred under the condition of slightly lower performance. When ample annotated data and computing resources are available, however, transformer-based models remain preferable. By integrating polarity with contextual embeddings, they can avoid overreliance on a single feature, making them robust to domain shift. In all cases, lexicon-based polarity should be treated as a primary feature since it generalizes well across domains. Its human-curated valence scores provide a stable prior, and domain-specific augmentations (such as adding slang or industry jargon) could further improve performance in specialized applications.

5.2.2. Fusion of Lexicon-Based Polarity into Deep Neural Networks

The incremental gains afforded by integrating SAM valence scores into deep learning models must be weighed against the additional model complexity and resource demands, particularly in low-resource settings where computation, data, and annotation budgets are constrained. Although gated fusion could provide a minor MSE reduction, it introduces tens to hundreds of thousands of extra parameters, requiring longer inference times and more intricate hyperparameter tuning. In many production or edge-device deployments, these costs translate directly into slower responses, higher energy use, or increased hardware requirements. The complexity of neural network architectures and tuning could be a barrier to wider adoption [50].

By contrast, our simpler fusion method, the static concatenation, added only a few parameters and converged more quickly, making it far more amenable to scenarios with limited GPU access or strict latency constraints. In light of this, we recommend a tiered approach. The simplest fusion should be implemented first to validate that the lexicon-based polarity prior delivers value. In addition to a simple concatenation, another simple variant, like residual fusion, may be worth considering. After that, a more elaborate gating mechanism or deep fusion network should only be pursued if the application’s performance requirements and resource budget justify the engineering overhead. This pragmatic balance should ensure that the innovative use of SAM valence scores remains both impactful and broadly accessible across diverse deployment environments.

5.2.3. Continuing or Stopping Human-Rated Valence

Even in an era where the transformer-based systems can approximate sentiment scores with impressive accuracy, there remain compelling reasons to continue collecting human-rated valence, for example, via the SAM scale. No matter how powerful a transformer becomes, its internal understanding of emotion is ultimately a statistical abstraction learned from existing text corpora. In contrast, SAM ratings, where real people explicitly report how pleasant or unpleasant they find a word on a standardized scale, remain the closest approach to directly measuring felt emotion. Collecting fresh SAM data, ideally from the target population, provides a true ground truth against which to calibrate or fine-tune models. Without updated human ratings, we risk accumulating subtle biases: for example, sarcasm or novel slang could fool a transformer into misestimating valence, whereas a fresh set of SAM responses from actual users would flag that discrepancy.

Moreover, a SAM run from 2000 may no longer reflect how people in 2025 interpret certain common words. If we stop collecting new SAM data, our lexicon-based priors become stale. By contrast, regularly soliciting SAM judgments, especially in under-resourced languages, ensures that our models continue to match current human affect. For instance, Indonesian audiences might interpret certain loanwords or slang differently decade to decade; collecting SAM data every few years helps capture those shifts. Furthermore, in many high-stake settings, such as mental-health screening, legal text analysis, and medical narratives [51], domain experts demand transparency: “Why did the system think this paragraph was highly negative?” Having a library of SAM-anchored valence scores means that when the model flags a text as negative, we can say, “This phrase historically rates 1.8/9 on average for valence among actual respondents”, which can be far more compelling to a clinician or regulator than “The transformer’s hidden state leaned negative”. In practice, a model that highlights lexicon-derived contributions is easier to explain and trust than a purely black-box deep model [52].

Nevertheless, if a transformer-based model already matches or even slightly exceeds human-rated reliability in a particular domain, the incremental benefit of new SAM labels might be marginal. In settings where budget or volunteer fatigue are pressing concerns, it might be more cost-effective to rely on transformer-generated estimates rather than conduct large-scale SAM campaigns. Organizing a robust SAM study, recruiting participants, ensuring demographic representativeness, paying for responses, and cleaning data can be expensive and time-consuming. If the zero-shot output of a preexisting lexicon or transformer already performs well enough, for example, within an acceptable error margin for a given application, the return on investment of new SAM data could be low.

6. Conclusions

In this study, we have explored methods for fine-grained sentiment analysis in a low-resource language setting. Our two core research objectives were (1) to evaluate whether lexicon-based sentiment analysis can substantively predict sentiment intensities and (2) to investigate whether the polarity derived from the lexicon-based sentiment analysis can meaningfully augment machine-learned sentiment representations. By considering increasingly complex fusion methods, from classical regressors to RNNs and transformers, we have fulfilled both objectives, demonstrating that, even in the era of powerful transformers, human-curated affective priors retain unique value. We also have extended the Indonesian version of ANEW and used the acquired emotion-laden words labeled with SAM valence scores in the analysis. We have proposed a simple lexicon-based polarity algorithm and various approaches to infusing the computed polarity into various machine learning algorithms, as well as deep neural network architectures. Although we obtained mixed results, the fusion methods generally outperformed the baseline models.

Unlike previous studies that have focused on classification tasks, we have analyzed sentiment intensity regression performance from both architectural and interpretability perspectives, incorporating SHAP values to assess modality contributions. The lexicon-based polarity prior, although interpretable, yielded a higher MSE when used alone, confirming that raw valence scores cannot capture contextual nuances. However, the lexicon-derived polarity provided a strong predictive signal for the classical regressors. The polarity often dominated the TF-IDF features, allowing the regressors to achieve near-state-of-the-art accuracy. The transformer-based model, which still benefited from the polarity despite a lower SHAP magnitude, revealed that polarity can subtly influence deep feature representations in ways not captured by raw attribution scores.

Nevertheless, our experiments focused exclusively on Indonesian datasets and a single lexicon source. Although the general methodology should be transferable, the coverage of domain-specific lexicons remains a potential blind spot. Furthermore, by depending on a single valence dictionary, we may have overlooked nuanced sentiment aspects that a multilexicon ensemble could capture. In the future, we will explore the combination of multiple heterogeneous lexicons to cover broader semantic ground and evaluate fusion strategies that learn to weight the contribution of each lexicon. In addition to early and residual fusions, as suggested earlier, future studies should also consider other fusion approaches for deep neural networks. For example, hybrid architectures that combine early and late fusion or attention-based fusion layers could further enhance robustness and interpretability.

Author Contributions

Conceptualization, T.S., H.S. and L.P.M.; methodology, H.S. and L.P.M.; software, A.D. and L.P.M.; validation, L.P.M. and A.D.; formal analysis, A.D. and L.P.M.; data curation, T.S., R.N.P. and H.S.; writing—original draft preparation, L.P.M.; writing—review, A.D. and L.P.M.; investigation, L.P.M. and A.D.; resources, T.S., H.S. and N.P and editing, H.S., A.D. and R.N.P.; visualization, L.P.M.; supervision, T.S. and H.S.; project administration, T.S. and R.N.P.; funding acquisition, T.S. and R.N.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Universitas Indonesia via the Publikasi Terindeks Internasional (PUTI) 2020 research grant no. PENG-1/UN2.RST/PPM.00.00/2020.

Institutional Review Board Statement

The Committee on Research Ethics at the Faculty of Psychology, Universitas Indonesia, decided that the study complies with the ethical standards in the discipline of psychology, Universitas Indonesia’s Research Ethical Code of Conduct, and the Indonesian Psychology Association’s Ethical Code of Conduct (No: 927/FPsi.Komite Etik/PDP.04.00/2020 and date of approval: 16 November 2020).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.15291830.

Acknowledgments

In memoriam of Totok Suhardijanto, who has devoted his life to advancing computational linguistics in Bahasa Indonesia. May he rest in peace.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Thelwall, M.; Buckley, K.; Paltoglou, G. Sentiment Strength Detection for the Social Web. J. Am. Soc. Inf. Sci. Technol. 2012, 63, 163–173. [Google Scholar] [CrossRef]
Hutto, C.; Gilbert, E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proc. Int. AAAI Conf. Web Soc. Media 2014, 8, 216–225. [Google Scholar] [CrossRef]
Nagoudi, E.M.B. ARB-SEN at SemEval-2018 Task1: A New Set of Features for Enhancing the Sentiment Intensity Prediction in Arabic Tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation, New Orleans, LA, USA, 5–6 June 2018; Apidianaki, M., Mohammad, S.M., May, J., Shutova, E., Bethard, S., Carpuat, M., Eds.; ACL: Stroudsburg, PA, USA, 2018; pp. 364–368. [Google Scholar] [CrossRef]
Cheng, Y.Y.; Chen, Y.M.; Yeh, W.C.; Chang, Y.C. Valence and Arousal-Infused Bi-Directional LSTM for Sentiment Analysis of Government Social Media Management. Appl. Sci. 2021, 11, 880. [Google Scholar] [CrossRef]
Gupta, V.; Rattan, P. Enhancing Sentiment Analysis in Restaurant Reviews: A Hybrid Approach Integrating Lexicon-Based Features and LSTM Networks. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 185–198. [Google Scholar]
Li, W.; Zhu, L.; Shi, Y.; Guo, K.; Cambria, E. User reviews: Sentiment analysis using lexicon integrated two-channel CNN–LSTM family models. Appl. Soft Comput. 2020, 94, 106435. [Google Scholar] [CrossRef]
Mutinda, J.; Mwangi, W.; Okeyo, G. Sentiment Analysis of Text Reviews Using Lexicon-Enhanced Bert Embedding (LeBERT) Model with Convolutional Neural Network. Appl. Sci. 2023, 13, 1445. [Google Scholar] [CrossRef]
Singh, V.; Piryani, R.; Uddin, A.; Waila, P. Sentiment analysis of Movie reviews and Blog posts. In Proceedings of the 2013 3rd IEEE International Advance Computing Conference (IACC), Ghaziabad, India, 22–23 February 2013; pp. 893–898. [Google Scholar] [CrossRef]
Kapukaranov, B.; Nakov, P. Fine-Grained Sentiment Analysis for Movie Reviews in Bulgarian. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, Bulgaria, 7–9 September 2015; Mitkov, R., Angelova, G., Bontcheva, K., Eds.; Incoma Ltd.: Shoumen, Bulgaria, 2015; pp. 266–274. [Google Scholar]
Manik, L.P.; Febri Mustika, H.; Akbar, Z.; Kartika, Y.A.; Ridwan Saleh, D.; Setiawan, F.A.; Atman Satya, I. Aspect-Based Sentiment Analysis on Candidate Character Traits in Indonesian Presidential Election. In Proceedings of the 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), Tangerang, Indonesia, 18–20 November 2020; pp. 224–228. [Google Scholar] [CrossRef]
Esuli, A.; Sebastiani, F. SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 22—28 May 2006. [Google Scholar]
Baccianella, S.; Esuli, A.; Sebastiani, F. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC‘10), Valletta, Malta, 17–23 May 2010; Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D., Eds.; European Language Resources Association (ELRA): Paris, France, 2010. [Google Scholar]
Bradley, M.M.; Lang, P.J. Measuring emotion: The self-assessment manikin and the semantic differential. J. Behav. Ther. Exp. Psychiatry 1994, 25, 49–59. [Google Scholar] [CrossRef]
Deshmane, A.A.; Friedrichs, J. TSA-INF at SemEval-2017 Task 4: An Ensemble of Deep Learning Architectures Including Lexicon Features for Twitter Sentiment Analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; Bethard, S., Carpuat, M., Apidianaki, M., Mohammad, S.M., Cer, D., Jurgens, D., Eds.; ACL: Stroudsburg, PA, USA, 2017; pp. 802–806. [Google Scholar] [CrossRef]
Manik, L.P.; Susianto, H.; Dinakaramani, A.; Pramanik, N.; Suhardijanto, T. Can Lexicon-Based Sentiment Analysis Boost Performances of Transformer-Based Models? In Proceedings of the 2023 7th International Conference on New Media Studies (CONMEDIA), Bali, Indonesia, 6–8 December 2023; pp. 314–319. [Google Scholar] [CrossRef]
Liu, Y.; Wang, S.; Yu, S. A Bullet Screen Sentiment Analysis Method That Integrates the Sentiment Lexicon with RoBERTa-CNN. Electronics 2024, 13, 3984. [Google Scholar] [CrossRef]
Koto, F.; Beck, T.; Talat, Z.; Gurevych, I.; Baldwin, T. Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 17–22 March 2024; Graham, Y., Purver, M., Eds.; ACL: Stroudsburg, PA, USA, 2024; pp. 298–320. [Google Scholar]
Bradley, M.M.; Lang, P.J. Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings; Technical report; The Center for Research in Psychophysiology, University of Florida: Gainesville, FL, USA, 1999. [Google Scholar]
Sianipar, A.; van Groenestijn, P.; Dijkstra, T. Affective Meaning, Concreteness, and Subjective Frequency Norms for Indonesian Words. Front. Psychol. 2016, 7, 1907. [Google Scholar] [CrossRef]
Hulliyah, K.; Sukmana, H.T.; Bakar, N.S.A.; Ismail, A.R. Indonesian Affective Word Resources Construction in Valence and Arousal Dimension for Sentiment Analysis. In Proceedings of the 2018 6th International Conference on Cyber and IT Service Management (CITSM), Parapat, Indonesia, 7–9 August 2018; pp. 1–5. [Google Scholar] [CrossRef]
Manik, L.P.; Akbar, Z.; Mustika, H.F.; Indrawati, A.; Rini, D.S.; Fefirenta, A.D.; Djarwaningsih, T. Out-of-Scope Intent Detection on A Knowledge-Based Chatbot. Int. J. Intell. Eng. Syst. 2021, 14, 446–457. [Google Scholar] [CrossRef]
Kurniasih, A.; Manik, L.P. On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 927–934. [Google Scholar] [CrossRef]
Jiang, M.; Lan, M.; Wu, Y. ECNU at SemEval-2017 Task 5: An Ensemble of Regression Algorithms with Effective Features for Fine-Grained Sentiment Analysis in Financial Domain. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; Bethard, S., Carpuat, M., Apidianaki, M., Mohammad, S.M., Cer, D., Jurgens, D., Eds.; ACL: Stroudsburg, PA, USA, 2017; pp. 888–893. [Google Scholar] [CrossRef]
Barrett, L.F. How Emotions Are Made: The Secret Life of the Brain; Houghton Mifflin Harcourt: Boston, MA, USA, 2017. [Google Scholar]
Fontaine, J.R.J.; Scherer, K.R.; Roesch, E.B.; Ellsworth, P.C. The world of emotions is not two-dimensional. Psychol. Sci. 2007, 18, 1050–1057. [Google Scholar] [CrossRef]
Pang, B.; Lee, L. Opinion Mining and Sentiment Analysis; Foundations and Trends® in Information Retrieval Series; Now Publishers: Norwell, MA, USA, 2008. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; NIPS’17. pp. 6000–6010. [Google Scholar]
Ma, Y.; Peng, H.; Cambria, E. Targeted Aspect-Based Sentiment Analysis via Embedding Commonsense Knowledge into an Attentive LSTM. Proc. AAAI Conf. Artif. Intell. 2018, 32, 5876–5883. [Google Scholar] [CrossRef]
Khan, J.; Ahmad, N.; Khalid, S.; Ali, F.; Lee, Y. Sentiment and Context-Aware Hybrid DNN With Attention for Text Sentiment Classification. IEEE Access 2023, 11, 28162–28179. [Google Scholar] [CrossRef]
Cai, Y.; Li, X.; Zhang, Y.; Li, J.; Zhu, F.; Rao, L. Multimodal sentiment analysis based on multi-layer feature fusion and multi-task learning. Sci. Rep. 2025, 15, 2126. [Google Scholar] [CrossRef]
Li, H.; Lu, Y.; Zhu, H. Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism. Electronics 2024, 13, 2069. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; NIPS’17. pp. 4768–4777. [Google Scholar]
Sihombing, A.; Indrawati, A.; Yaman, A.; Trianggoro, C.; Manik, L.P.; Akbar, Z. A scientific expertise classification model based on experts’ self-claims using the semantic and the TF-IDF approach. In Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications, New York, NY, USA, 22–23 November 2022; IC3INA ’22. pp. 301–305. [Google Scholar] [CrossRef]
Manik, L.P.; Ferti Syafiandini, A.; Mustika, H.F.; Fatchuttamam Abka, A.; Rianto, Y. Evaluating the Morphological and Capitalization Features for Word Embedding-Based POS Tagger in Bahasa Indonesia. In Proceedings of the 2018 International Conference on Computer, Control, Informatics and its Applications (IC3INA), Tangerang, Indonesia, 1–2 November 2018; pp. 49–53. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:cs.CL/1910.01108. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:cs.CL/1907.11692. [Google Scholar]
Wilie, B.; Vincentio, K.; Winata, G.I.; Cahyawijaya, S.; Li, X.; Lim, Z.Y.; Soleman, S.; Mahendra, R.; Fung, P.; Bahar, S.; et al. IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; Wong, K.F., Knight, K., Wu, H., Eds.; ACL: Stroudsburg, PA, USA, 2020; pp. 843–857. [Google Scholar] [CrossRef]
Ortiz Su’arez, P.J.; Sagot, B.; Romary, L. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019, Cardiff, UK, 22 July 2019; pp. 9–16. [Google Scholar] [CrossRef]
Ortiz Su’arez, P.J.; Romary, L.; Sagot, B. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1703–1714. [Google Scholar] [CrossRef]
Rani, S.; Singh Gill, N.; Gulia, P. Analyzing impact of number of features on efficiency of hybrid model of lexicon and stack based ensemble classifier for twitter sentiment analysis using WEKA tool. Indones. J. Electr. Eng. Comput. Sci. 2021, 22, 1041. [Google Scholar] [CrossRef]
Loecher, M. Debiasing SHAP scores in random forests. AStA Adv. Stat. Anal. 2024, 108, 427–440. [Google Scholar] [CrossRef]
Das, R.K.; Islam, M.; Hasan, M.M.; Razia, S.; Hassan, M.; Khushbu, S.A. Sentiment analysis in multilingual context: Comparative analysis of machine learning and hybrid deep learning models. Heliyon 2023, 9, e20281. [Google Scholar] [CrossRef] [PubMed]
Ashbaugh, L.; Zhang, Y. A Comparative Study of Sentiment Analysis on Customer Reviews Using Machine Learning and Deep Learning. Computers 2024, 13, 340. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:cs.CL/1301.3781. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Moschitti, A., Pang, B., Daelemans, W., Eds.; ACL: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Bonnier, T. Revisiting Multimodal Transformers for Tabular Data with Text Fields. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; ACL: Stroudsburg, PA, USA, 2024; pp. 1481–1500. [Google Scholar] [CrossRef]
Catelli, R.; Pelosi, S.; Esposito, M. Lexicon-Based vs. Bert-Based Sentiment Analysis: A Comparative Study in Italian. Electronics 2022, 11, 374. [Google Scholar] [CrossRef]
Muhammad, S.H.; Brazdil, P.; Jorge, A. Incremental Approach for Automatic Generation of Domain-Specific Sentiment Lexicon. In Proceedings of the Advances in Information Retrieval; Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F., Eds.; Springer: Cham, Switzerland, 2020; pp. 619–623. [Google Scholar] [CrossRef]
Alahmadi, K.; Alharbi, S.; Chen, J.; Wang, X. Generalizing sentiment analysis: A review of progress, challenges, and emerging directions. Soc. Netw. Anal. Min. 2025, 15, 45. [Google Scholar] [CrossRef]
Loh, H.W.; Ooi, C.P.; Seoni, S.; Barua, P.D.; Molinari, F.; Acharya, U.R. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022). Comput. Methods Programs Biomed. 2022, 226, 107161. [Google Scholar] [CrossRef]
Islam, M.S.; Kabir, M.N.; Ghani, N.A.; Zamli, K.Z.; Zulkifli, N.S.A.; Rahman, M.M.; Moni, M.A. Challenges and future in deep learning for sentiment analysis: A comprehensive review and a proposed novel hybrid approach. Artif. Intell. Rev. 2024, 57, 62. [Google Scholar] [CrossRef]

Figure 1. Deep neural network architectures: (a) Baseline without p (

f_{0}

). (b) With

f_{1}

. (c) With

f_{2 a}

and

f_{2 b}

.

Figure 1. Deep neural network architectures: (a) Baseline without p (

f_{0}

). (b) With

f_{1}

. (c) With

f_{2 a}

and

f_{2 b}

.

Figure 2. Histogram of the first dataset: (a) SD of valence responses across annotators. (b) Valence scores.

Figure 3. Reprinted from [19]: (a) Histogram of standard deviation of valence responses across annotators. (b) Histogram of valence scores.

Figure 4. Sentiment intensities: (a) Box plot. (b) Histogram of the standard deviation.

Figure 5. Histogram of sentiment labels.

Figure 6. Sentiment label prediction results using the lexicon-based polarity algorithm: (a) Box plot. The orange line is the median, and the circles are outliers, individual points unusually far from the bulk of the data. (b) Histogram.

Figure 7. Plot of the number of emotion-laden words found in the text vs. MSE vs. number of texts.

Figure 8. Method used to evaluate machine learning algorithms.

Figure 9. Feature importance values of the lexicon-based polarity p and the five most important words in the conventional machine learning models.

Figure 10. Method used to evaluate RNN architectures.

Figure 11. Comparison of feature importance between text features and the lexicon-based polarity p for different RNN architectures: (a) With

f_{1}

. (b) With

f_{2 a}

. (c) With

f_{2 b}

.

Figure 11. Comparison of feature importance between text features and the lexicon-based polarity p for different RNN architectures: (a) With

f_{1}

. (b) With

f_{2 a}

. (c) With

f_{2 b}

.

Figure 12. Method used to evaluate transformer-based architectures.

Figure 13. Comparison of feature importance between text features and the lexicon-based polarity p for different transformer-based architectures: (a) With

f_{1}

. (b) With

f_{2 a}

. (c) With

f_{2 b}

.

Figure 13. Comparison of feature importance between text features and the lexicon-based polarity p for different transformer-based architectures: (a) With

f_{1}

. (b) With

f_{2 a}

. (c) With

f_{2 b}

.

Table 1. Sample of emotion-laden words.

Indonesian	English	Valence Score	Valence SD
bersih	clean	8.11	1.46
buruk	bad	2.37	1.66
cerdas	smart	8.02	1.54
komplain	complaint	3.02	1.45
konsisten	consistent	7.67	1.49
nakal	naughty	2.39	1.64
pandai	clever	7.78	1.87
panas	hot	3.15	1.59
penuh	full	5.24	1.99
saksi	witness	5.59	1.80

Table 2. Distribution of annotators for the first dataset.

Set	Men	Women	Total
A	20	26	46
B	11	34	45
C	12	33	45

Table 3. Intraclass correlation coefficient (ICC) values for the first dataset.

Set	ICC2	F-Statistic	95% CI
A	0.56099	62.784823	$[0.52, 0.61]$
B	0.546806	61.519679	$[0.5, 0.59]$
C	0.552751	62.640933	$[0.51, 0.6]$

Table 4. Sample of the labeled dataset.

Indonesian	English	X	Y	Z	Label
Min @KAI121 kereta api #Gumarang tujuan Jakarta-Surabaya itu sudah busuk sebusuknya kereta. banyak berhentinya, fasilitas payah. Memalukan. Segera peremajaan dan buang ke laut kereta busuk itu!	Admin @KAI121 the #Gumarang train from Jakarta to Surabaya is as rotten as a train can be. Lots of stops, poor facilities. Embarrassing. Begin refurbishment immediately and throw that rotten train into the sea!	1	2	1	1.33
@mrtjakarta AC di stasiun bundaran HI tidak nyala dan panas sekali di dalam peron. Kereta juga mengalami gangguan operasional. Gimana nih MRT Jakarta?Baru operasional udah ada gangguan operasional.	@mrtjakarta The AC at the HI roundabout station is not working, and it is very hot on the platform. The train is also experiencing operational disruptions. Why MRT Jakarta? It has only just started operating, and there are already operational disruptions.	4	3	2	3.00
Tadi liat eskalator di Stasiun Sudirman sdh berfungsi lagi. Terima kasih @CommuterLine smoga eskalator2 yg rusak di stasiun lainnya pun segera diperbaiki. Ayo jadikan stasiun commuter line jadi stasiun yg ramah lansia and ibu hamil.	Earlier, I saw that the escalator at Sudirman Station is working again. Thank you @CommuterLine, hopefully the broken escalators at other stations will be repaired soon. Let’s make commuter line stations friendly to older people and pregnant women.	8	7	8	7.67
@KAI121 orang tua dan anak2 bgtu nyaman naik keretaapi	@KAI121 parents and children are so comfortable riding the train	8	8	9	8.33

Table 5. Intraclass correlation coefficient (ICC) for the second dataset.

Metric	Value
ICC2	0.576
F-statistic	6.618
p-value	0.0
95% CI	[0.41, 0.69]

Table 6. Comparison of the performance of machine learning models when using TF-IDF alone and when combining TF-IDF with the lexicon-based polarity p.

Algorithm	MSE Without p	MSE with p	Diff.
SVR	1.231	1.138	0.093
RT	2.996	2.746	0.250
RF	1.492	1.485	0.007
GB	1.590	1.441	0.149

Table 7. Comparison of the performance of RNNs when using word embeddings alone and when combining them with the lexicon-based polarity p.

Architecture	MSE ( $f_{0}$ )	MSE ( $f_{1}$ )	$Δ (f_{0}, f_{1})$	MSE ( $f_{2 a}$ )	$Δ (f_{0}, f_{2 a})$	MSE ( $f_{2 b}$ )	$Δ (f_{0}, f_{2 b})$
Vanilla RNN	2.778	2.710	0.068	2.456	0.254	2.587	0.191
LSTM	2.778	2.699	0.079	1.937	0.841	1.779	0.999
GRU	2.612	2.267	0.345	1.886	0.726	1.916	0.696

Table 8. Comparison of the performance of transformer-based architectures when using contextual embeddings alone and when combining them with the lexicon-based polarity p.

Architecture	MSE ( $f_{0}$ )	MSE ( $f_{1}$ )	$Δ (f_{0}, f_{1})$	MSE ( $f_{2 a}$ )	$Δ (f_{0}, f_{2 a})$	MSE ( $f_{2 b}$ )	$Δ (f_{0}, f_{2 b})$
IndoBERT	1.014	1.089	−0.075	0.894	0.120	0.923	0.091
DistilBERT	1.295	1.398	−0.103	1.034	0.261	1.410	−0.115
RoBERTa	1.137	1.187	−0.050	1.107	0.030	1.036	0.101

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Manik, L.P.; Susianto, H.; Dinakaramani, A.; Pramanik, R.N.; Suhardijanto, T. Machine Learning with Self-Assessment Manikin Valence Scale for Fine-Grained Sentiment Analysis. Information 2025, 16, 562. https://doi.org/10.3390/info16070562

AMA Style

Manik LP, Susianto H, Dinakaramani A, Pramanik RN, Suhardijanto T. Machine Learning with Self-Assessment Manikin Valence Scale for Fine-Grained Sentiment Analysis. Information. 2025; 16(7):562. https://doi.org/10.3390/info16070562

Chicago/Turabian Style

Manik, Lindung Parningotan, Harry Susianto, Arawinda Dinakaramani, R. Niken Pramanik, and Totok Suhardijanto. 2025. "Machine Learning with Self-Assessment Manikin Valence Scale for Fine-Grained Sentiment Analysis" Information 16, no. 7: 562. https://doi.org/10.3390/info16070562

APA Style

Manik, L. P., Susianto, H., Dinakaramani, A., Pramanik, R. N., & Suhardijanto, T. (2025). Machine Learning with Self-Assessment Manikin Valence Scale for Fine-Grained Sentiment Analysis. Information, 16(7), 562. https://doi.org/10.3390/info16070562

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning with Self-Assessment Manikin Valence Scale for Fine-Grained Sentiment Analysis

Abstract

1. Introduction

2. Literature Review

2.1. Fine-Grained Sentiment Analysis

2.2. Lexicon-Based Sentiment Analysis

2.3. Self-Assessment Manikin

2.4. Machine Learning-Based Sentiment Analysis

2.5. Integrating Lexicon Features into Deep Learning Pipelines

3. Materials and Methods

3.1. Dataset Acquisition

3.2. Analysis Methodology

3.2.1. Lexicon-Based Polarity Algorithm

3.2.2. Combination of Lexicon-Based Polarity and Machine Learning

3.2.3. Fusion of Text Features and Lexicon-Based Polarity for Deep Learning

3.2.4. Evaluation Method and Experiment Setup

4. Results

4.1. Datasets

4.1.1. Emotion-Laden Words Dataset

4.1.2. Labeled Sentiment Dataset

4.2. Lexicon-Based Polarity Analysis

4.3. Fusion Approaches

4.3.1. Combination of TF-IDF and Lexicon-Based Polarity

4.3.2. Combination of Word Embeddings and Lexicon-Based Polarity

4.3.3. Combination of the Transformer-Based Embeddings and the Lexicon-Based Polarity

5. Discussion

5.1. Theoretical Implications

5.1.1. Lexicon-Based Polarity Algorithm

5.1.2. Conventional Machine Learning Algorithms

5.1.3. RNN Architectures

5.1.4. Transformer-Based Architectures

5.2. Practical Implications

5.2.1. The Effect of Lexicon-Based Polarity

5.2.2. Fusion of Lexicon-Based Polarity into Deep Neural Networks

5.2.3. Continuing or Stopping Human-Rated Valence

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI