Large Language Model-Based Topic-Level Sentiment Analysis for E-Grocery Consumer Reviews

Wangsa, Julizar Isya Pandu; Agung, Yudhistira Jinawi; Rahmi, Safira Raissa; Murfi, Hendri; Hariadi, Nora; Nurrohmah, Siti; Satria, Yudi; Za’in, Choiru

doi:10.3390/bdcc9080194

Open AccessArticle

Large Language Model-Based Topic-Level Sentiment Analysis for E-Grocery Consumer Reviews

by

Julizar Isya Pandu Wangsa

¹,

Yudhistira Jinawi Agung

¹,

Safira Raissa Rahmi

¹,

Hendri Murfi

^1,*

,

Nora Hariadi

¹

,

Siti Nurrohmah

¹,

Yudi Satria

¹ and

Choiru Za’in

²

¹

Department of Mathematics, Universitas Indonesia, Depok 16424, Indonesia

²

Computer Science and Information Technology, La Trobe University, Melbourne 32935, Australia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(8), 194; https://doi.org/10.3390/bdcc9080194

Submission received: 23 May 2025 / Revised: 16 July 2025 / Accepted: 21 July 2025 / Published: 23 July 2025

Download

Browse Figures

Versions Notes

Abstract

Customer sentiment analysis plays a pivotal role in the digital economy by offering comprehensive insights that inform strategic business decisions, optimize digital marketing initiatives, and improve overall customer satisfaction. We propose a large language model-based topic-level sentiment analysis framework. We employ a BERT-based model to generate contextualized vector representations of the documents, and then clustering algorithms are automatically applied to group documents into topics. Once the topics are formed, a GPT model is used to perform sentiment classification on the content related to each topic. The simulations show the effectiveness of this approach, where selecting appropriate clustering techniques yields more semantically coherent topics. Furthermore, topic-level sentiment polarization shows that 31.7% of all negative sentiment concentrates on the shopping experience, despite an overall positive sentiment trend.

Keywords:

topic-level sentiment analysis; large language models; clustering; consumer analytics; e-grocery

1. Introduction

Customer sentiment analysis has become crucial in the digital economy, because it provides deep insights influencing business strategy, digital marketing, and customer satisfaction. Companies can make more targeted, data-driven strategic decisions by mining opinions on social media and in online reviews. Studies have shown that leveraging sentiment analysis can support effective business decision-making [1]. In digital marketing, sentiment analysis is a powerful marketing research tool for understanding consumer preferences. It needs to be performed in real-time, thus guiding the adjustment of campaign strategies and target markets more accurately. In addition, by processing customer feedback automatically, companies can measure and improve customer satisfaction quickly. Sentiment analysis has been shown to provide valuable insights into customer satisfaction levels, preferences, and areas of service or product improvement [2]. Implementing customer sentiment analysis helps businesses design more responsive strategies, strengthen digital marketing effectiveness, and drive better customer experiences.

The increasing penetration of the internet and digital technologies has accelerated the development of Indonesia’s digital economy, with e-commerce emerging as the leading sector. In 2024, the e-commerce industry reached a gross merchandise value (GMV) of USD 65 billion, reflecting its significant role in the national economy [3]. Within this sector, e-groceries have gained prominence as consumers increasingly seek convenience and time efficiency in daily shopping. Platforms like Segari, Sayurbox, and KlikIndomaret have responded with innovations like real-time delivery and tailored shopping experiences [4], compelling sentiment analysis and topic detection for the e-grocery industry. These methods enable more profound, automated understanding of customer opinions, supporting data-driven decision-making for business improvement.

Transformer is a neural network model that has significantly transformed the field of natural language processing (NLP) and has become the foundational model for the development of large language models (LLMs) [5]. One of the essential aspects of a transformer is the self-attention mechanism for text representation in capturing contextual relationships among words in a text. Unlike previous text representation approaches, such as long short-term memory (LSTM) [6], Transformer allows more efficient parallel processing; thus, Transformer accelerates model training at a large scale. One of the pre-trained LLMs based on self-attention is bidirectional encoder representations from transformers (BERT) [7]. With an encoder-based architecture, BERT produces contextual text representations and becomes the foundation for many NLP tasks, including sentiment analysis.

Enhanced with a fully connected layer, BERT-NN outperforms other standard sentiment classification techniques, i.e., logistic regression, SentiWordNet, and LSTM [8]. In the age of generative AI, another pre-trained decoder-based LLM, such as a generative pre-trained transformer (GPT) [9], also has excellent potential to be adapted to sentiment analysis problems. The autoregressive mechanism of the models, which predicts the next word based on the probability distribution of previous tokens, allows the model to perform sentiment analysis by capturing context gradually. This mechanism enables the model to understand sentiment patterns by considering the entire sentence structure. Secent simulations show that generative models can compete with and sometimes exceed the accuracy of conventional methods [8,9].

Sentiment analysis is generally performed at the document level. Still, in some situations, more detailed approaches such as aspect-based sentiment analysis (ABSA) [10,11] or topic-level sentiment analysis (TLSA) [12,13,14,15,16,17,18,19,20] are needed to gain deeper insights into customer reviews. ABSA aims to recognize specific aspects in a text and determine the sentiment associated with them. In contrast, TLSA is more general, namely, analysis of overall sentiment towards a topic. Thus, TLSA is more global and suitable for public opinion analysis. Once each document has a topic and sentiment, the results can be aggregated to see the dominant sentiment per topic. TLSA enables businesses to identify sentiments based on various topics that matter, providing more relevant insights for decision-making.

In this paper, we propose an LLM-based TLSA framework: a BERT model is used to obtain vector representations of documents, and then clustering algorithms are applied to group documents into topics automatically. Once the topics are formed, a GPT model is used to perform sentiment classification on the content related to each topic. This work makes contributions from both methodological and practical perspectives:

We propose a TLSA framework that leverages the power of LLM-based models (BERT, GPT) for both subtasks: topic detection and sentiment analysis. The advantage is that the resulting topics are more meaningful, since contextual embeddings of BERT-based clustering can capture meaning beyond word co-occurrence of TFIDF-based LDA. Furthermore, sentiment analysis leverages transfer learning from pre-trained GPT models that already understand many nuances of language. This framework does not need a sentiment dictionary or labelled data training, unlike previous frameworks.
In this work, we demonstrate BERT-based soft clustering, which has not been considered before for topic detection. We comprehensively evaluate BERT-based clustering methods in the topic detection task of LLM-based TLSA. The results show that soft clustering, i.e., fuzzy c-means (FCM), produces more coherent topics in structured datasets. In contrast to soft clustering, HDBSCAN produces more diverse topics, and K-Means appears to be the most balanced method for achieving coherence and diversity. Thus, by selecting clustering methods, BERT-based clustering can produce more semantically coherent topics.
From a practical perspective, this research highlights the TLSA of the Indonesian e-grocery sector using real-world customer feedback data. The study identifies specific customer concerns such as affordability and packaging, fresh fruit and vegetable delivery, overall service experience, fresh produce quality, and fresh shopping satisfaction. Furthermore, topic-based sentiment polarization adds analytical depth by revealing what issues matter to customers and how these issues elicit divergent emotional reactions, enabling more targeted managerial responses. For example, this research reveals that 31.7% of all negative sentiment concentrates on the topic of the “shopping experience,” despite an overall positive sentiment trend.

The structure of this paper is as follows. In Section 2, we present the related works. The methods and methodology of the topic-level sentiment analysis are briefly explained in Section 3. We describe the result and discussion in Section 4. Finally, a general conclusion about the results is presented in Section 5.

2. Related Works

One of the TLSA approaches is the two-stage pipeline. The first stage is topic detection to extract latent topics from the text corpus, usually using a generative method such as LDA [21]. In the second stage, the resulting topics are analyzed for sentiment, e.g., using a lexicon-based method or a separate classifier, to determine whether each topic tends to be talked about positively or negatively.

Lexicon-based methods are widely used for sentiment analysis because they do not require labeled training data. After the topics are obtained, the sentiment score per topic is calculated using a sentiment dictionary. For example, Zhang et al. used LDA to find topics from Yelp restaurant reviews, then measured each topic’s sentiment with the VADER dictionary, and finally calculated the frequency of positive/negative words in the topic [14]. Qiao and Williams extracted topics from tweets about global warming using LDA and then applied the NRC emotion lexicon to estimate the total sentiment polarity of tweets on each topic [22]. Several other studies have also used LDA to find review topics on Twitter or even for airlines, then measured the sentiment per topic with tools such as SentiStrength, VADER, or TextBlob [16,17,18]. This lexicon-based approach is relatively simple and can be applied to various domains without requiring special model training. Still, its accuracy is highly dependent on the completeness of the sentiment dictionary and is less able to understand the context of the sentence.

Several other studies have used separate machine learning models for sentiment classification after obtaining topics, as an alternative to lexicons. Along with the development of deep learning, several text data representations that consider contextual aspects have been adopted for sentiment analysis in TLSA. For example, Jelodar et al. studied COVID-19 discussions on Reddit forums by first extracting topics via LDA and then training a long short-term memory (LSTM) model to classify user sentiment on each topic [17]. A similar approach was taken by Uthirapathy and Sandanam to Twitter data about climate change: they applied LDA to obtain key topics, then used a fine-tuned BERT model to categorize sentiment for each topic in the tweets [16]. Pathak et al. proposed a deep learning model that integrates topic extraction and sentiment classification at the sentence level. They employed the Online Latent Semantic Indexing (LSI) technique to represent topics as vectors and incorporated this topic information into a neural network using a topic-level attention mechanism to determine the sentiment of each sentence [23]. Gui et al. proposed a multi-task mutual learning framework that performs sentiment classification and topic detection jointly. Their model aligns word-level sentiment attention with topic distributions in a shared architecture, enabling the integration of topic and sentiment signals during training. This approach enhances sentiment prediction accuracy and the interpretability of the generated topics [24].

Alternative LSI and LDA algorithms have also been used to detect topics. Garcia and Berton used the Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture (GSDMM) and then measured topic sentiment with the semi-supervised CrystalFeel method [15]. However, these methods only work on non-negative data representation. Thus, these methods are less flexible in adapting to the latest text data representation methods, such as BERT, which considers the contextual aspects of sentences. Therefore, clustering becomes an alternative topic detection method, because it offers greater flexibility for integration with a broader range of text representations [25]. Tounsi et al. implemented a BERT-based clustering approach called BERTopic to identify topics and adopted a rule-based and lexicon-driven method to detect sentiment [26]. While BERTopic has leveraged BERT and clustering methods for topic detection, it is based on hard clustering, assuming the document pertains to only one topic [27]. This assumption may not align with the practical reality of documents containing diverse thematic content. In our TLSA framework, we also consider soft clustering, such as fuzzy c-means (FCM), for topic detection [28,29,30].

For sentiment analysis, our TLSA framework utilizes transfer learning from a pre-trained generative model that already understands the nuances of language, namely the large language model GPT, to perform sentiment classification of content related to each topic. Unlike BERT, GPT allows us to detect sentiments without training from scratch. Thus, it does not need labeled data training. The model also supports multiple languages natively to detect sentiments in different languages. In addition, the autoregressive model can explain the reasons behind the sentiment to produce a more descriptive interpretation. The performance of GPT models has been evaluated against several baseline methods to highlight their comparative advantages. The evaluation results indicated that GPT models excel in capturing and interpreting sentiment across diverse contexts, particularly due to their robustness in handling linguistic challenges such as ambiguity, negation, informal expressions, and modern abbreviations commonly found in social media discourse. These aspects have traditionally posed significant challenges in the field of sentiment analysis [31].

Our TLSA framework leverages the power of LLM-based models (BERT, GPT) for both subtasks: topic detection and sentiment analysis. The advantage is that the topics generated tend to be more meaningful, because the contextual embeddings from BERT-based clustering can capture meaning in the context beyond just word co-occurrence of TFIDF-based LDA. The comprehensive choice of clustering methods, both hard and soft clustering, can produce more semantically coherent topics. Furthermore, sentiment analysis leverages transfer learning from pre-trained GPT models that already understand many nuances of language and do not require training from scratch. This also means that this framework does not need a sentiment dictionary and labelled data training like the previous frameworks.

3. Methods

In our framework, we consider one of the TLSA approaches called the two-stage pipeline. The first stage performs BERT-based clustering for topic detection. Once the topics are formed by BERT-based clustering, Generative AI, such as GPT, is used to perform sentiment analysis of the content related to each topic in the second stage. The results are aggregated to see the dominant sentiment per topic (Algorithm 1).

Algorithm 1 Topic-level sentiment analysis.

Input: A, p, K
Output:

R_{+} (k), R_{-} (k), t_{k}

$r_{n k}, t_{k} = T o p i c_D e t e c t i o n (A, p, K)$
$s_{n k} = S e n t i m e n t_A n a l y s i s (A, r_{n k}, t_{k})$
$R_{+} (k), R_{-} (k) = S e n t i m e n t_T o p i c_A g g r e g a t i o n (s_{n k}, r_{n k}, K)$

3.1. Topic Detection

There are several steps in the topic detection methods, as shown in Algorithm 2. The first is the transformation step, where BERT transforms the textual data A into a contextual representation

\hat{A}

(Step 1). Secondly, reduction dimension methods reduce the dimension of text representations in Step 2, where

p

is the dimension and

\hat{A}

is a lower-dimensional representation of textual data. Next, we perform clustering methods in lower-dimensional space for clustering. In this step, we extract the membership indicators of each document (

n

) to each cluster (

k

) in lower-dimensional space, as shown in Step 3, where

K

is the number of clusters. We use these membership indicators and c-TFIDF to extract the most frequent words of each cluster (

t_{k}

) from the textual data in the original word space (

A

) (Step 4). Topic interpretation can be performed manually or using generative AI such as GPT (Step 5).

Algorithm 2 Clustering-based topic detection.

Input: A, p, K
Output:

r_{n k}, t_{k}

$\hat{A} = T e x t_R e p r e s e n t a t i o n (A)$
$\tilde{A} = D i m e n s i o n_R e d u c t i o n (\hat{A}, p)$
$r_{n k} = C l u s t e r i n g (\tilde{A}, K)$
$t_{k} = T o p i c_R e p r e s e n t a t i o n (A, r_{n k})$
$t_{k} = T o p i c_I n t e r p r e t a t i o n (A, t_{k})$

3.1.1. Text Representation

BERT is a pre-trained large language model with the architecture of some encoders of transformers [7]. Before proceeding with the encoders, a document is tokenized into a collection of tokens. The next step is adding the unique tokens [CLS], [SEP], and [PAD]. The [CLS] token is placed at the beginning of the input sequence and functions to represent the entire sentence, particularly in classification tasks. The [SEP] token is inserted at the end of a sentence or between two sentences to indicate separation, which is especially useful in sentence-pair tasks. The [PAD] token fills the sequence with empty tokens when the total number of tokens is less than a fixed length (e.g., 100 tokens), ensuring consistent input size for batch processing. If the document has more than 100 tokens, it will be truncated until it has exactly 100 tokens. Embedding is the process that converts each word into a vector. There are three components of embedding, i.e., token, segment, and position embedding. The final embedding is the sum of the three embeddings, which becomes input for the encoders of the BERT model. The central core of the encoders is a self-attention mechanism that aims to produce a contextual text representation.

There are two versions of BERT, i.e., BERTBASE and BERTLARGE. BERTBASE consists of 12 encoders and about 110 million parameters, while BERTLARGE has 24 encoders with about 340 million parameters. The pre-trained BERTBASE and BERTLARGE transform each document into contextual representation vectors of 768 and 1024 dimensions, respectively [7]. There is also a BERT developed specifically for the Indonesian language, known as IndoBERT. Its specifications resemble BERTBASE, but it has been trained with an Indonesian text corpus [32].

3.1.2. Dimension Reduction

Latent Semantic Analysis (LSA), Deep Autoencoder (DA), and Uniform Manifold Approximation and Projection (UMAP) are three commonly used dimensionality reduction methods in natural language processing. LSA uses singular value decomposition (SVD) to identify latent structures in text data and represents documents in a lower-dimensional space based on these latent structures [33]. DA is an artificial neural network consisting of two main components, an encoder and a decoder, each built from multiple hidden layers. Its main goal is to represent data in a lower-dimensional form (encoding) and then reconstruct the original data (decoding) from that representation [34]. UMAP is a nonlinear method based on topology and graph theory. It transforms high-dimensional datasets into a reduced-dimensional form with improved preservation of the data’s intrinsic local and global relationships, making it practical for visualization and clustering [35].

3.1.3. Clustering

K-Means, Fuzzy C-Means (FCM), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Hierarchical DBSCAN (HDBSCAN) are clustering methods used to group data based on feature similarities. K-Means is a partitioning method that divides data into k clusters by minimizing the distance between data points and the cluster centroid [36]. Fuzzy C-Means extends this concept by allowing each data point to have a degree of membership in several clusters, not just one [37]. DBSCAN uses data density to form clusters, recognize arbitrary cluster shapes, and identify outliers [38]. HDBSCAN is a development of DBSCAN that forms a cluster hierarchy and is more adaptive to density variations, resulting in more stable clusters without explicitly specifying radius parameters [39].

3.1.4. Topic Representation

c-TFIDF is a modification of the TFIDF method that can be used to find the topic representation from a group of textual data. From clustered textual data, c-TFIDF finds the highest word weights with Equation (1) [27].

w_{t, k} = t f_{t, k} \times \log (1 + \frac{K}{t f_{t}})

(1)

where

t f_{t, k}

is the number of words

t

on cluster

k

,

K

is the number of clusters, and

t f_{t}

is the frequency of words from each cluster. This modification added one to the equation to produce positive output. This process creates the top ten words that will be used to represent a topic [40].

3.1.5. Topic Interpretation

Top words from each topic are usually manually interpreted by humans to become complete sentences that interpret the topic. Generative AI, such as GPT, allows the process of interpreting the top words into topic sentences to be performed automatically. We provide a prompt containing top words and several documents from the same cluster representing the topic as context. With this context, GPT can better understand the subject being discussed and will produce relevant topic interpretation sentences based on the top words and documents provided. Table 1 gives an example of a prompt for GPT to interpret the topic of a collection of top words.

3.2. Sentiment Analysis

Generative AI, such as GPT, has excellent potential for sentiment analysis problems, combining self-attention and autoregression, allowing the model to understand context while producing accurate predictions intensely. Self-attention allows GPT to capture relationships between words and recognize essential elements that influence overall sentiment. Meanwhile, autoregression ensures that the model builds understanding gradually, with each token prediction based on the previous token, resulting in a more contextual and logical interpretation. With its ability to understand context and produce structured output, GPT has excellent potential to be adopted in various sentiment analysis applications [8,9]. With in-context learning (ICL), GPT may perform sentiment analysis at prompts without the high computational costs of fine-tuning, as shown in Table 2 [9]. The prompt can also be expanded to a specific topic, such as “I am conducting sentiment analysis on customer reviews of [APPLICATION] about the topic of [TOPIC]”.

3.3. Sentiment–Topic Aggregation

Sentiment–topic aggregation combines sentiment analysis and topic detection results to obtain the proportion of positive and negative sentiments on each detected topic. This aggregation process is stated in Algorithm 3.

Algorithm 3 Sentiment-topic aggregation.

Input:

s_{n k}

,

r_{n k}

, K
Output:

R_{+} (k)

,

R_{-} (k)

${r t}_{+} (k) = \sum_{n | s_{n k} = +} r_{n k}, \forall k = {1, 2, 3, \dots K}$
${r t}_{-} (k) = \sum_{n | s_{n k} = -} r_{n k}, \forall k = {1, 2, 3, \dots K}$
$R_{+} (k) = \frac{{r t}_{+} (k)}{({r t}_{+} (k) + {r t}_{-} (k))} x 100 %, \forall k = {1, 2, 3, \dots K}$
$R_{-} (k) = \frac{{r t}_{-} (k)}{({r t}_{+} (k) + {r t}_{-} (k))} x 100 %, \forall k = {1, 2, 3, \dots K}$

4. Experiments

This chapter presents the experiments in two main sections. First, Section 4.1 provides the analysis of BERT-based clustering for topic detection of the LLM-based TLSA framework. Section 4.1.1 examines the parameter sensitivity analysis of the BERT-based Fuzzy C-Means method, which has not been previously considered for topic detection, to understand the contribution of each parameter to model performance. Next, Section 4.1.2 performs a comparative evaluation of BERT-based clustering methods for topic detection.

Second, Section 4.2 provides a practical perspective, namely applying the LLM-based TLSA framework to the Indonesian e-grocery sector using real-world customer feedback data. This experiment aims to identify dominant topics in user reviews and measure sentiment polarization within each topic, generating in-depth thematic insights for decision-making in the digital economy.

4.1. Analysis of BERT-Based Clustering for Topic Detection

This simulation uses standard datasets, i.e., AG News, Yahoo! Answers, and R2 data. The dataset was initially collected by Zhang et al. [41]. AG News has four classes with 4000 data points, each containing 1000 data points. R2 has two classes with 5859 data points; the first class comprises 3724 data points, and the second includes 2125 data points. Yahoo! Answers has ten classes totaling 10,000; each class consists of 1000 data points. The preprocessing conducted combines all the lines in a document into one line. Secondly, any word with a hashtag pattern followed by a number is removed. The third preprocessing step is removing HTML or XML-related text and code from documents. Lastly, whitespaces are removed, and repetitive punctuation marks are replaced with a single punctuation mark.

4.1.1. Sensitivity Analysis of the BERT-Based Fuzzy C-Means

The FCM method works well for data dimensions up to five [42]. For dimensions greater than five, FCM should start with a well-initiated prototype. The dimension reduction method aims to control the dimensions of the FCM input. If the desired dimension is low, more information will be lost. Therefore, the data dimension is included for hyperparameter tuning, with a space of [2, 10]. The fuzzification constant is another parameter to be considered for the parameter tuning. Zhou et al. suggested that the fuzzification constant should be within the interval [2, 3.5] [43]. Subakti et al. use a degree of the fuzzification constant of 1.1 to solve text clustering problems [44]. Therefore, parameter tuning of the degree of fuzziness will be carried out with a parameter space of (1, 3.5]. The parameters and parameter space are shown in Table 3. Other parameters are set as constants, i.e., max iteration

(T)

equal to 10⁴ and a threshold

(ε)

of 10⁻⁵.

Figure 1 shows the sensitivity analysis of the BERT-based FCM model based on the first-order sensitivity index, total sensitivity index, and confidence intervals. Our simulation indicates that the total sensitivity index’s confidence interval has lower bounds greater than 0 for all parameters, i.e., X1, X2, and X3. This means that all parameters can be declared as sensitive parameters.

In the AG News dataset, parameters X2 and X3 significantly contribute to model performance, with the total sensitivity index of each approaching 0.6. Although X2 shows a relatively significant direct contribution (first order around 0.3), X3 contributes more through interactions with other parameters (low first order, but high total effect). Meanwhile, parameter X1 shows a minor contribution overall. This result indicates that the tuning parameters X2 and X3 will significantly improve model performance on the AG News dataset. In contrast, in the R2 dataset, parameter X1 is dominant, with first-order and total sensitivity indices approaching the maximum value (1.0) and narrow confidence intervals, indicating the transmission of its influence. Parameters X2 and X3 make minimal contributions, almost negligible in the sensitivity model. Therefore, for this dataset, optimization only needs to be focused on parameter X1, because other parameters do not play a significant role in model performance. In the Yahoo! Answer dataset, parameter X2 has an essential total effect (close to 1.0), but its direct contribution is only about 0.25, indicating its essential role in the interaction between parameters. Although its first order is low, parameter X3 also has a sizeable total effect (~0.65). Parameter X1 contributes moderately. Thus, tuning parameters X2 and X3 would be the right strategy to improve model performance on this dataset.

The results of this experiment indicate that only specific parameters make a dominant contribution to model performance on certain datasets. For example, parameters X2 and X3 contribute predominantly to AG News and Yahoo! Answers, while X1 is more dominant in the R2 dataset. These results indicate that a one-size-fits-all approach to parameter tuning is not optimal. Therefore, researchers can avoid exploring parameters insensitive to model performance, save computational resources, and speed up experimentation.

4.1.2. Comparative Evaluation of BERT-Based Clustering

The clustering methods produce topics represented by their top 10 most frequent words. Topic coherence is the commonly employed quantitative metric for assessing the interpretability of topics. Our simulations use a topic coherence measure called topic coherence-word2vec (TC-W2V) [45]. Suppose topic t consists of Z words that are

{t_{1}, t_{2}, \dots, t_{Z}}

; the TC-W2V of topic t is given by Equation (2).

T C = W 2 V (t) = \frac{1}{(\begin{matrix} n \\ 2 \end{matrix})} \sum_{j = 2}^{Z} \sum_{i = 1}^{j - 1} s i m i l a r i t y (w v_{j}, w v_{i}),

(2)

where wv_j and wv_i are vectors of the words t_j and t_i constructed by a word2vec model. In this simulation, we use pre-trained vectors trained on a part of the Google News dataset consisting of about 100 billion words (https://code.google.com/archive/p/word2vec/ (accessed on 16 May 2025)). This word2vec model contains 300-dimensional vectors for three million words and phrases.

Topic diversity refers to the percentage of unique words among all top words extracted from the topics. The quality score is obtained by multiplying the topic coherence score by topic diversity [46].

Model parameter tuning is performed by grid search for comparative evaluation. Parameter tuning was performed five times to obtain the five scores, and the final score was the average of the five scores. Table 4 shows the performance of the models in terms of topic detection. First, we compare the soft clustering FCM and the baseline hard clustering HDBSCAN. On the AG News dataset, the FCM method obtains slightly higher topic coherence than the HDBSCAN method, by 4.5%. This slight difference indicates that the FCM-based approach can produce more coherent topics than the HDBSCAN method in short news datasets such as AG News. This advantage is most likely due to the nature of FCM, which allows a document to have membership levels in more than one topic, thus capturing overlaps between topics that are more common in news datasets. Regarding topic diversity, all methods have identical scores. These results indicate that the methods can capture equivalent levels of topic diversity.

On the R2 dataset, the HDBSCAN method performs 3.11% better than FCM. This difference indicates that the HDBSCAN-based clustering approach, which groups text based on density in vector space, is more effective at capturing topic structure in this dataset. One possible reason is the characteristics of the R2 dataset, which has a more disjointed topic distribution, so the fuzzy clustering approach is less than optimal for separating documents into distinctive topics. Figure 2 shows the visualization of AG News and the R2 ground truth label with text data representation from BERT. However, when the fuzzification constant approaches one, K-Means can reach topic coherence better than HDBSCAN. Regarding topic diversity, all methods have maximum scores for the R2 dataset. These results indicate that the methods can capture total topic diversity in the R2 datasets.

For Yahoo! Answers, the FCM method again shows 4.13% superior topic coherence compared to HDBSCAN. However, HDBSCAN is far superior in terms of topic diversity, with a score of 0.946, compared to FCM, which only obtained 0.694. These results indicate that FCM tends to group documents into narrower clusters, resulting in more coherent topics but less differentiation. Higher topic coherence in FCM means that the words in each topic have stronger semantic relationships, so the resulting topics are more meaningful and easier to interpret. However, lower topic diversity indicates that many exact words are used in different topics, so the variation between topics is lower. This indicates that FCM is more likely to group documents into narrower clusters, resulting in more coherent topics but less diversity among topics. In contrast, higher topic diversity in HDBSCAN indicates that this method can better capture the diversity of concepts in the Yahoo! Answers dataset, which has various categories of questions and answers. However, lower topic coherence means that the resulting topics have more scattered words, which may be less meaningful or more challenging to interpret.

Meanwhile, regarding topic coherence and diversity, the K-Means method occupies a middle position between FCM and HDBSCAN on the Yahoo! Answer dataset. In terms of topic coherence, K-Means performs better than HDBSCAN. K-Means surpasses FCM, which usually excels in semantic relatedness between words due to its soft clustering approach. On the other hand, in terms of topic diversity, K-Means produces better topic diversity than FCM, which tends to create overlapping and less diverse topics. However, FCM’s topic diversity is still below HDBSCAN, which is very explorative in mapping the diversity of data structures. Therefore, K-Means can be considered a balance point, combining the advantages of both soft and hard clustering approaches. K-Means maintains strong semantics in each topic, like FCM, while offering more diverse topic separations, like HDBSCAN. This makes K-Means an effective method when a compromise is needed between semantic depth and thematic coverage in topic detection.

4.2. Analysis of Indonesian E-Grocery Customer Reviews

The era of the digital economy in Indonesia is growing rapidly, along with increasing adoption of technology and widespread internet penetration. Digitalization has driven various sectors to transform, including e-commerce. The e-commerce sector is the most significant contributor to Indonesia’s digital economy, with 11% growth, reaching a gross merchandise value (GMV) of US$65 billion in 2024 [3]. One of the increasingly popular trends is e-groceries, an online daily shopping service. The growth of e-groceries in Indonesia is driven by changes in people’s consumption patterns that increasingly prioritize convenience, efficiency, and security. Various e-grocery services make it easy to shop for household needs online: Segari, AlloFresh, KlikIndomaret, Sayurbox, and Alfagift. The e-commerce platforms have competed to provide e-grocery services with innovations such as instant delivery, subscription-based ordering, and personalized user experience [4].

The development of e-groceries in Indonesia has provided various benefits for the community, such as a more efficient shopping experience because they do not have to face traffic jams, rain, flooding, vehicle parking limitations, and fees, or carry heavy shopping items. In addition, e-grocery services also provide an omnichannel concept that facilitates the convergence of online and offline shopping environments, such as the “buy online, pick up in-store” feature that makes it easier for customers. With these various advantages, e-groceries are becoming a practical solution and are increasingly in demand by the community to meet their daily needs. Moreover, the development of e-groceries plays a role in creating sustainable and resilient cities by increasing the efficiency of food distribution, reducing food waste, and reducing carbon emissions through green logistics such as optimal route-based delivery and electric vehicles. In addition, e-groceries support city resilience by ensuring the accessibility of necessities, especially in emergencies, and strengthening local economies through partnerships with small farmers and suppliers. This integration enables a more efficient, inclusive, and environmentally friendly urban ecosystem [47].

Customer sentiment analysis in e-groceries applications is becoming increasingly important in improving user experience and business competitiveness in the digital era. With more and more customers leaving reviews and comments on e-groceries platforms, sentiment analysis can help companies understand user satisfaction, preferences, and common issues. This data allows businesses to respond quickly to complaints, improve service quality, and adjust marketing strategies based on customer opinions. In addition, sentiment analysis can also identify consumption trends and product preferences, which can be used for more efficient service personalization and stock management. With sentiment analysis tools, applications can provide more accurate and automated insights into customer perceptions, helping e-groceries maintain customer loyalty and drive business growth [48].

This simulation aimed to analyze Indonesia’s e-grocery customer sentiments in 2023, a phase of consolidation and intense competition in the e-grocery industry post-pandemic. In this phase, the trend of online shopping is increasing due to the improvement of digital infrastructure and changes in consumption patterns. The year 2023 is also a moment when e-grocery platforms face the challenges of inflation, logistics efficiency, and shifts in customer loyalty due to many choices of services. This analysis draws upon topic-level sentiment modeling to identify key aspects influencing customer satisfaction and dissatisfaction, thereby uncovering patterns potentially relevant to decision-making.

This simulation examines 3078 customer reviews of the Segari e-groceries application, gathered from the Play Store between 1 January 2023 and 31 December 2023. This experiment uses BERT-based clustering for topic detection and GPT for sentiment analysis. We also use GPT to interpret the topic representations generated by topic detection to avoid subjectivity. Finally, we perform sentiment–topic aggregation for comprehensive customer review analysis.

The optimal number of topics is determined by comparing coherence scores in the form of TC-W2V among topic numbers ranging from 1 to 20. The results of this coherence score comparison are shown in Figure 3, which shows that the optimal coherence score of 0.236 was obtained for five topics; thus, five topics were used in this study. The interpretation of each topic is presented in Table 5.

Next, we use zero-shot in GPT’s prompt for sentiment analysis. Zero-shot refers to a setting where no demonstrations are provided in the prompt. In our experiments, sentiment analysis was performed using the GPT-3.5-Turbo model via API access on OpenAI. We employed a zero-shot setting without any fine-tuning or additional training data. No parameter tuning was conducted, except for explicitly setting the temperature to 0 to ensure deterministic and reproducible outputs. This adjustment was made to eliminate randomness in the model’s responses and maintain consistency across evaluations, without altering the model’s underlying weight. The model is only given natural language instructions describing the task. We evaluated this zero-shot prompting method on labeled reviews from a related application, consisting of 1572 positive and 430 negative sentiments. The results showed that the zero-shot prompting method produced good results, with an accuracy of 87%, a sensitivity of 89%, and a specificity of 78%.

After classifying the sentiment of all 3078 reviews, 2600 reviews (84.5%) were found to have positive sentiment, while 478 reviews (15.5%) had negative sentiment, as shown in Figure 4. This distribution indicates that the e-grocery industry in Indonesia, especially Segari, received positive sentiment in 2023. However, 15.5% of customers still have negative sentiments. So, we need a more detailed analysis to obtain information about the negative sentiments. For this, we conduct further analysis using extracted topics. The percentages of positive and negative sentiments associated with each topic are compared, as shown in Figure 5.

4.2.1. Findings and Discussion

The sentiment distribution shows that, while most users expressed positive feedback, some sentiment polarization cannot be ignored. For example, while 84.5% of the reviews reflect positive sentiment overall, the shopping experience stands out with 31.7% negative sentiment. This finding is essential, as it signals organizational strengths and potential areas of vulnerability. More nuanced insights are captured through topic-level sentiment analysis, which reveals five dominant topics: affordability and packaging (Topic 1), fresh fruit and vegetable delivery (Topic 2), overall service experience (Topic 3), fresh produce quality (Topic 4), and fresh shopping satisfaction (Topic 5).

Each topic represents a cluster of customer concerns and perceptions that reflect underlying operational or service-related issues. For instance, sentiment polarity in Topic 2 and 4 suggests that delivery reliability and product freshness remain critical quality dimensions. Negative sentiment within these topics may indicate gaps in cold-chain logistics or inconsistent product sourcing. Conversely, positive feedback in Topic 1 and 5 implies that competitive pricing and the convenience of shopping for fresh products online are compelling value propositions that can be further strengthened.

From a decision analytics perspective, these insights identify organizational risks, notably customer churn due to lapses in delivery quality or perceived product degradation, and strategic opportunities, such as capitalizing on pricing strategies and refining the user shopping experience. Exploiting this historical review data allows Segari to take preemptive actions in risk mitigation, customer retention, and targeted investment.

In sum, the findings emphasize the value of topic-based sentiment analytics in providing actionable intelligence for e-grocery platforms. Integrating such topic-level sentiment analytics into customer experience dashboards can help managers detect emerging issues earlier and design targeted responses, aligning directly with the objectives of Decision Analytics in leveraging transactional data to inform strategic actions.

4.2.2. Potential Biases and Their Impact on the Conclusions

While the topic-level sentiment analysis of Segari’s e-grocery reviews yields valuable insights, it is essential to critically examine potential biases that may affect the validity and generalizability of the conclusions. These biases include language coverage limitations, platform skew, and sentiment polarity constraints.

The analysis focuses solely on customer reviews written in Indonesian as collected from the Google Play Store. This language constraint may overlook bilingual or English-language feedback, which could contain different expressions of dissatisfaction or nuanced praise. Moreover, GPT and BERT models used in this study may exhibit varied performance depending on the language used, especially if the pretraining data was skewed toward English. Consequently, specific sentiment cues or cultural expressions unique to Indonesians might not be fully captured or may be misinterpreted, leading to potential sentiment misclassification.

The dataset is exclusively sourced from the Play Store, potentially excluding perspectives from iOS users or other feedback channels such as social media, email complaints, or in-app support tickets. As user behavior and demographics vary across platforms, this limitation may introduce a sampling bias. For instance, Android users may be more price-sensitive or likelier to leave public feedback than users from other ecosystems. This bias may over- or under-represent specific sentiment patterns or topic concerns, thus influencing the thematic interpretation and sentiment proportions.

The sentiment classification employed a binary structure, positive or negative, with no provision for neutral or mixed sentiments. This forced polarity may flatten nuanced user feedback, such as a review stating, “the delivery was fast but the vegetables were not fresh,” which contains both positive and negative aspects. Excluding a neutral or compound sentiment class potentially obscures valuable ambiguity, which is especially relevant in user-generated content, where sentiment is often contextually mixed. As a result, the binary classification may overstate sentiment polarization and affect the perception of customer satisfaction levels.

These biases can influence the strategic inferences derived from the analysis. For example, over-represented positive sentiment might lead to overestimating user satisfaction, causing delayed recognition of service quality issues. Platform or demographic exclusion can limit the scope of customer understanding, reducing the efficacy of targeted interventions. Similarly, ignoring neutral sentiments can obscure early warning signs in customer feedback.

5. Conclusions

This study proposes a large language model-based topic-level sentiment analysis framework to explore customer perceptions of the Indonesian e-grocery sector. Through experiments on three benchmark datasets, the results show that the clustering methods have their strengths and weaknesses, depending on the characteristics of the data. Fuzzy C-Means effectively maintains semantic coherence, HDBSCAN excels in topic diversity, and K-Means exhibit the most balanced performance. Fuzzy C-Means extends this concept by allowing each data point to have a degree of membership in several clusters, not just one. This approach better reflects the practical nature of documents that often encompass multiple thematic elements. GPT performs sentiment analysis on command without high computational cost for refinement and also helps interpret topics based on the top words of each cluster. In addition to the methodological contribution, applying this method to the Segari e-grocery review data in Indonesia shows high applicability potential. By analyzing 3078 user reviews across the full calendar year of 2023, this research uncovers five core topics that provide structured understanding of the areas that drive customer satisfaction, such as affordability and ease of shopping, as well as those that pose operational risks, particularly in the domains of delivery and product quality. The findings reveal that customer sentiment is not uniformly distributed across topics, indicating that different components of the service lifecycle impact user satisfaction in varied ways. This heterogeneity underscores the need for tailored interventions, logistics, and quality control improvements to mitigate dissatisfaction and reinforce competitive pricing and service convenience to amplify strengths. Integrating topic-level sentiment analysis into decision-making processes enables e-grocery platforms like Segari to transition from reactive to proactive strategies.

6. Limitations and Future Work

6.1. Limitation

Despite the promising results of our proposed large language model-based topic-level sentiment analysis (TLSA) framework, this study has several limitations that should be acknowledged.

6.1.1. Binary Sentiment Classification

Our framework adopts a binary sentiment classification (positive/negative) via GPT zero-shot prompts. This approach simplifies implementation but fails to capture neutral or mixed sentiments, often in user-generated content. Reviews expressing satisfaction and complaints (e.g., “delivery was fast but items were damaged”) are forced into a single sentiment class, potentially biasing the sentiment–topic aggregation. Future work should consider multi-class or continuous sentiment scoring to reflect real-world sentiment nuance better.

6.1.2. Interpretability in Soft Clustering

While soft clustering using Fuzzy C-Means (FCM) allows more flexible topic memberships and often improves topic coherence, it complicates interpretability. The probabilistic nature of topic assignment can hinder the clarity of topic boundaries, making it challenging to present actionable insights to non-technical stakeholders. More interpretable soft clustering methods or hybrid strategies that balance flexibility and clarity could be explored in future research.

6.1.3. Computational Complexity

The framework combines large models (BERT for representation and GPT for sentiment and interpretation), dimensionality reduction, and clustering algorithms, resulting in significant computational demands. This limitation affects scalability, especially for real-time or large-scale applications. Future efforts may benefit from investigating lightweight or distilled transformer models and more efficient clustering pipelines without sacrificing performance.

6.1.4. Limited Domain and Language Scope

The empirical evaluation focuses on Indonesian e-grocery reviews from a single platform (Google Play Store). This scope limits generalizability to other domains, languages, or customer bases. Although IndoBERT partially addresses the language issue, further studies should evaluate the framework in multilingual and multi-platform contexts to ensure broader applicability.

6.2. Future Work

To address the identified limitations, future research should focus on enhancing the methodological flexibility of the framework. This includes implementing multilingual sentiment models, integrating multi-platform data sources, and applying multi-class sentiment classification techniques. These improvements aim to produce more comprehensive and inclusive sentiment insights for varied customer segments.

In addition, future directions may include (1) leveraging interpretable deep clustering or topic modeling approaches; (2) optimizing the pipeline for deployment in low-resource environments; (3) potentially using alternative models such as LLaMA, Alpaca, or Mixtral in future work, as these models are continuously evolving and may offer performance comparable to GPT with lower computational requirements.; and (4) validating the proposed framework across diverse languages, application domains, and user demographics. These enhancements will contribute to both the methodological robustness and the practical relevance of topic-level sentiment analysis (TLSA) in real-world decision-making systems.

Author Contributions

J.I.P.W.: methodology, software, investigation. S.R.R.: methodology, software, investigation. Y.J.A.: methodology, software, investigation. H.M.: conceptualization, supervision, writing—original draft preparation. N.H.: supervision. S.N.: supervision. Y.S.: validation. C.Z.: validation, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The Directorate of Research and Development, Universitas Indonesia, funded this research under Hibah PUTI Q1 2023 (Grant No. NKB-476/UN2.RST/HKP.05.00/2023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study were gathered from customer reviews of the Segari e-groceries application on the Play Store between 1 January 2023 and 31 December 2023.

Acknowledgments

We acknowledge using ChatGPT 3.5 to assist with text translation, and improve this manuscript’s clarity, flow, and grammar and using GPT for our research as described.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Grande-Ramírez, J.R.; Roldán-Reyes, E.; Aguilar-Lasserre, A.A.; Juárez-Martínez, U. Integration of Sentiment Analysis of Social Media in the Strategic Planning Process to Generate the Balanced Scorecard. Appl. Sci. 2022, 12, 12307. [Google Scholar] [CrossRef]
Nasrabadi, N.; Wicaksono, H.; Valilai, O.F. Shopping marketplace analysis based on customer insights using social media analytics. MethodsX 2024, 13, 102868. [Google Scholar] [CrossRef] [PubMed]
Indonesia.go.id., Konsumsi Masyarakat Masih Tinggi, Produk Lokal Paling Laku di Harbolnas 2024. 2025. Available online: https://indonesia.go.id/kategori/editorial/8889/konsumsi-masyarakat-masih-tinggi-produk-lokal-paling-laku-di-harbolnas (accessed on 24 March 2025).
Titipku Research Team. Indonesia E-Grocery Report 2022; Titipku Research Team: Tangerang, Indonesia, 2022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
Alaparthi, S.; Mishra, M. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey. arXiv 2020. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pretraining. 2018. Available online: https://gluebenchmark.com/leaderboard (accessed on 31 March 2025).
Krugmann, J.O.; Hartmann, J. Sentiment Analysis in the Age of Generative AI. Cust. Needs Solut. 2024, 11, 3. [Google Scholar] [CrossRef]
Qiu, Y.; Jin, Y. ChatGPT and fine-tuned BERT: A comparative study for developing intelligent design support systems. Intell. Syst. Appl. 2024, 21, 200308. [Google Scholar] [CrossRef]
Marrese-Taylor, E.; Velásquez, J.D.; Bravo-Marquez, F. A novel deterministic approach for aspect-based opinion mining in tourism products reviews. Expert Syst. Appl. 2014, 41, 7764–7775. [Google Scholar] [CrossRef]
Thet, T.T.; Na, J.-C.; Khoo, C.S.G. Aspect-based sentiment analysis of movie reviews on discussion boards. J. Inf. Sci. 2010, 36, 823–848. [Google Scholar] [CrossRef]
Zhang, S.; Ly, L.; Mach, N.; Amaya, C. Topic Modeling and Sentiment Analysis of Yelp Restaurant Reviews. Int. J. Inf. Syst. Serv. Sect. 2022, 14, 1–16. [Google Scholar] [CrossRef]
Garcia, K.; Berton, L. Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA. Appl. Soft Comput. 2021, 101, 107057. [Google Scholar] [CrossRef] [PubMed]
Uthirapathy, S.E.; Sandanam, D. Topic Modelling and Opinion Analysis On Climate Change Twitter Data Using LDA and BERT Model. Procedia Comput. Sci. 2023, 218, 908–917. [Google Scholar] [CrossRef]
Jelodar, H.; Wang, Y.; Orji, R.; Huang, S. Deep Sentiment Classification and Topic Discovery on Novel Coronavirus or COVID-19 Online Discussions: NLP Using LSTM Recurrent Neural Network Approach. IEEE J. Biomed. Health Inform. 2020, 24, 2733–2742. [Google Scholar] [CrossRef] [PubMed]
Kwon, H.-J.; Ban, H.-J.; Jun, J.-K.; Kim, H.-S. Topic Modeling and Sentiment Analysis of Online Review for Airlines. Information 2021, 12, 78. [Google Scholar] [CrossRef]
Carvache-Franco, O.; Carvache-Franco, M.; Carvache-Franco, W.; Iturralde, K. Topic and sentiment analysis of crisis communications about the COVID-19 pandemic in Twitter’s tourism hashtags. Tour. Hosp. Res. 2023, 23, 44–59. [Google Scholar] [CrossRef]
Abiola, O.; Abayomi-Alli, A.; Tale, O.A.; Misra, S.; Abayomi-Alli, O. Sentiment analysis of COVID-19 tweets from selected hashtags in Nigeria using VADER and Text Blob analyser. J. Electr. Syst. Inf. Technol. 2023, 10, 5. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar] [CrossRef]
Qiao, F.; Williams, J. Topic Modelling and Sentiment Analysis of Global Warming Tweets. J. Organ. End User Comput. 2021, 34, 18. [Google Scholar] [CrossRef]
Pathak, A.R.; Pandey, M.; Rautaray, S. Topic-level sentiment analysis of social media data using deep learning. Appl. Soft Comput. 2021, 108, 107440. [Google Scholar] [CrossRef]
Gui, L.; Leng, J.; Zhou, J.; Xu, R.; He, Y. Multi Task Mutual Learning for Joint Sentiment Classification and Topic Detection. IEEE Trans. Knowl. Data Eng. 2020, 34, 1915–1927. [Google Scholar] [CrossRef]
Nur, K.; Najahaty, I.; Hidayati, L.; Murfi, H.; Nurrohmah, S. Combination of singular value decomposition and K-means clustering methods for topic detection on Twitter. In Proceedings of the 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, 10–11 October 2015; Institute of Electrical and Electronics Engineers: New York, NY, USA, 2015; pp. 123–128. [Google Scholar] [CrossRef]
Tounsi, A.; Elkefi, S.; Bhar, S.L. Exploring the Reactions of Early Users of ChatGPT to the Tool using Twitter Data: Sentiment and Topic Analyses. In Proceedings of the 2023 IEEE International Conference on Advanced Systems and Emergent Technologies (IC_ASET), Hammamet, Tunisia, 29 April–1 May 2023; Institute of Electrical and Electronics Engineers: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar] [CrossRef]
Murfi, H. A scalable eigenspace-based fuzzy c-means for topic detection. Data Technol. Appl. 2021, 55, 527–541. [Google Scholar] [CrossRef]
Muliawati, T.; Murfi, H. Eigenspace-based fuzzy c-means for sensing trending topics in Twitter. AIP Conf. Proc. 2017, 1862, 030140. [Google Scholar] [CrossRef]
Murfi, H. The Accuracy of Fuzzy C-Means in Lower-Dimensional Space for Topic Detection. In Smart Computing and Communication; Qiu, M., Ed.; Springer International Publishing: Cham, Switzerland, 2018; pp. 321–334. [Google Scholar]
Kheiri, K.; Karimi, H. SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning. arXiv 2023, arXiv:2307.10234. [Google Scholar] [CrossRef]
Koto, F.; Lau, J.H.; Baldwin, T. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. arXiv 2021, arXiv:2109.04607. [Google Scholar] [CrossRef]
Dumais, S.T. Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 2004, 38, 188–230. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar] [CrossRef]
McQueen, J.B. Some methods of classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, SC, USA, 21 June–18 July 1965; pp. 281–297. [Google Scholar]
Bezdek, J.C.; Ehrlich, R.; Full, W. FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 1984, 10, 191–203. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise; AAAI Press: Munchen, Germany, 1996. [Google Scholar]
Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Macau, China, 14–17 April 2019; Springer: Berlin/Heidelberg, Germany, 2019; Volume 7819, pp. 160–172. [Google Scholar]
Joachims, T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), Nashville, TN, USA, 8—12 July 1997; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1997; pp. 143–151. [Google Scholar]
Zhang, X.; Zhao, J.; Lecun, Y. Text Understanding from Scratch. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation; MIT Press: Cambridge, MA, USA, 2015; pp. 649–657. [Google Scholar]
Winkler, R.; Klawonn, F.; Kruse, R. Fuzzy C-means in high dimensional spaces. Int. J. Fuzzy Syst. Appl. 2011, 1, 1–16. [Google Scholar] [CrossRef]
Zhou, K.; Fu, C.; Yang, S. Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation. Sci. China Inf. Sci. 2014, 57, 1–8. [Google Scholar] [CrossRef]
Subakti, A.; Murfi, H.; Hariadi, N. The performance of BERT as data representation of text clustering. J. Big Data 2022, 9, 15. [Google Scholar] [CrossRef]
O’Callaghan, D.; Greene, D.; Carthy, J.; Cunningham, P. An analysis of the coherence of descriptors in topic modeling. Expert Syst. Appl. 2015, 42, 5645–5657. [Google Scholar] [CrossRef]
Dieng, A.B.; Ruiz, F.J.R.; Blei, D.M. Topic Modeling in Embedding Spaces. arXiv 2019, arXiv:1907.04907. [Google Scholar] [CrossRef]
Liu, D.; Deng, Z.; Zhang, W.; Wang, Y.; Kaisar, E.I. Design of sustainable urban electronic grocery distribution network. Alex. Eng. J. 2021, 60, 145–157. [Google Scholar] [CrossRef]
Huang, H.; Zavareh, A.A.; Mustafa, M.B. Sentiment Analysis in E-Commerce Platforms: A Review of Current Techniques and Future Directions. IEEE Access 2023, 11, 90367–90382. [Google Scholar] [CrossRef]

Figure 1. Sensitivity analysis of the BERT-based FCM model is based on the first-order sensitivity index, total sensitivity index, and confidence interval.

Figure 2. t-SNE visualization of AG News (left) and R2 (right) ground truth label with text data representation from BERT.

Figure 3. Selecting the optimal number of topics based on coherence scores.

Figure 4. Distribution of customer sentiment regarding the Indonesia e-grocery service Segari in 2023.

Figure 5. Topic-level polarization of customer sentiment regarding the Indonesia e-grocery service Segari in 2023.

Table 1. The GPT prompt format used to interpret the topic of a collection of top words.

Prompt
I have a topic described by the following keywords: [TOP WORDS] The following documents are associated with the topic: [DOCUMENTS] Using the information above, identify a suitable short label for the given topic.

Table 2. The GPT prompt format used to determine the sentiment of customer reviews.

Prompt
I am conducting sentiment analysis on customer reviews of [APPLICATION]. Instruction: Answer with only one word: “positive” or “negative”! <text>[DATA]</text>

Table 3. Hyperparameters and hyperparameter space of BERT-based FCM.

Notation	Hyperparameter	Hyperparameter Space
$X_{1}$	Encoder Layer of BERT	{‘last_hidden’, ‘second_to_last’, ‘concat_last_four’}
$X_{2}$	The data dimension (p)	$[2, 10]$
$X_{3}$	The fuzzification constant of FCM (m)	$(1, 3.5]$

Table 4. The coherence, diversity, and quality scores for the topic detection methods. The simulations are conducted on the AG News, R2, and Yahoo! Answers datasets.

Datasets	Metrics	FCM	K-Means	DBSCAN	HDBSCAN
AG News	Topic Coherence	0.1924 ± 0	0.1912 ± 0.0006	0.1762 ± 0.0033	0.1841 ± 0.0022
	Topic Diversity	0.95 ± 0	0.925 ± 0	0.95 ± 0	0.95 ± 0
	Quality Score	0.1828 ± 0	0.1738 ± 0.0015	0.1673 ± 0.0032	0.1749 ± 0.0021
R2	Topic Coherence	0.1796 ± 0	0.1882 ± 0	0.1862 ± 0.0024	0.1852 ± 0.0024
	Topic Diversity	1 ± 0	1 ± 0	1 ± 0	1 ± 0
	Quality Score	0.1796 ± 0	0.1882 ± 0	0.1862 ± 0.0024	0.1852 ± 0.0024
Yahoo! Answers	Topic Coherence	0.2419 ± 0.0046	0.2653 ± 0.0081	0.2450 ± 0.0024	0.2323 ± 0.009
	Topic Diversity	0.694 ± 0.0089	0.878 ± 0.0075	0.844 ± 0.0049	0.946 ± 0.0156
	Quality Score	0.1678 ± 0.0010	0.2329 ± 0.0056	0.2068 ± 0.0021	0.2196 ± 0.0064

Table 5. The topic interpretations.

Topic	Interpretation
Topic 1	Affordability and packaging
Topic 2	Delivery of fresh vegetables and fruits
Topic 3	Experience and impressions of the service
Topic 4	Quality of fresh products
Topic 5	Satisfaction with shopping for fresh products

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wangsa, J.I.P.; Agung, Y.J.; Rahmi, S.R.; Murfi, H.; Hariadi, N.; Nurrohmah, S.; Satria, Y.; Za’in, C. Large Language Model-Based Topic-Level Sentiment Analysis for E-Grocery Consumer Reviews. Big Data Cogn. Comput. 2025, 9, 194. https://doi.org/10.3390/bdcc9080194

AMA Style

Wangsa JIP, Agung YJ, Rahmi SR, Murfi H, Hariadi N, Nurrohmah S, Satria Y, Za’in C. Large Language Model-Based Topic-Level Sentiment Analysis for E-Grocery Consumer Reviews. Big Data and Cognitive Computing. 2025; 9(8):194. https://doi.org/10.3390/bdcc9080194

Chicago/Turabian Style

Wangsa, Julizar Isya Pandu, Yudhistira Jinawi Agung, Safira Raissa Rahmi, Hendri Murfi, Nora Hariadi, Siti Nurrohmah, Yudi Satria, and Choiru Za’in. 2025. "Large Language Model-Based Topic-Level Sentiment Analysis for E-Grocery Consumer Reviews" Big Data and Cognitive Computing 9, no. 8: 194. https://doi.org/10.3390/bdcc9080194

APA Style

Wangsa, J. I. P., Agung, Y. J., Rahmi, S. R., Murfi, H., Hariadi, N., Nurrohmah, S., Satria, Y., & Za’in, C. (2025). Large Language Model-Based Topic-Level Sentiment Analysis for E-Grocery Consumer Reviews. Big Data and Cognitive Computing, 9(8), 194. https://doi.org/10.3390/bdcc9080194

Article Menu

Large Language Model-Based Topic-Level Sentiment Analysis for E-Grocery Consumer Reviews

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Topic Detection

3.1.1. Text Representation

3.1.2. Dimension Reduction

3.1.3. Clustering

3.1.4. Topic Representation

3.1.5. Topic Interpretation

3.2. Sentiment Analysis

3.3. Sentiment–Topic Aggregation

4. Experiments

4.1. Analysis of BERT-Based Clustering for Topic Detection

4.1.1. Sensitivity Analysis of the BERT-Based Fuzzy C-Means

4.1.2. Comparative Evaluation of BERT-Based Clustering

4.2. Analysis of Indonesian E-Grocery Customer Reviews

4.2.1. Findings and Discussion

4.2.2. Potential Biases and Their Impact on the Conclusions

5. Conclusions

6. Limitations and Future Work

6.1. Limitation

6.1.1. Binary Sentiment Classification

6.1.2. Interpretability in Soft Clustering

6.1.3. Computational Complexity

6.1.4. Limited Domain and Language Scope

6.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI