Next Article in Journal
Changes in Pitching Performance After Ulnar Collateral Ligament Reconstruction Differ Among Major League Baseball Starting and Relief Pitchers
Previous Article in Journal
Advances and Applications of Agricultural Spray Deposition Detection Technologies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Unified AI Framework for Turkish E-Commerce Review Analysis: Sentiment Classification, LLM-Based Summarization, and Fuzzy Evaluation

by
Erdal Özbay
1,
Feyza Altunbey Özbay
2,* and
Ahmet Bedri Özer
1
1
Department of Computer Engineering, Firat University, 23119 Elazig, Türkiye
2
Department of Software Engineering, Firat University, 23119 Elazig, Türkiye
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(12), 5849; https://doi.org/10.3390/app16125849 (registering DOI)
Submission received: 20 May 2026 / Revised: 6 June 2026 / Accepted: 9 June 2026 / Published: 10 June 2026

Abstract

The rapid growth of user-generated reviews on e-commerce platforms has created a significant decision-making challenge for both consumers and sellers, particularly in morphologically rich low-resource languages such as Turkish. This study proposes a unified artificial intelligence framework for Turkish e-commerce review intelligence by integrating transformer-based sentiment classification, instruction-tuned large language model summarization, and explainable fuzzy logic-based product evaluation within a single end-to-end architecture. A balanced dataset containing 183,333 Turkish reviews was constructed from Trendyol, Amazon Turkey, and Hepsiburada using LLM-assisted annotation and stratified downsampling. Experimental evaluations demonstrated that the fine-tuned BERTurk 128k model achieved a macro F1-score of 0.9243 on the held-out test set. To overcome the limitations of multilingual news-oriented summarization models on informal review text, the framework employed the Turkish instruction-tuned Kumru-2B model together with structured prompt engineering to generate sentiment-aware abstractive summaries. In addition, a Mamdani-type fuzzy inference system was designed to combine sentiment distribution, seller reliability, star ratings, and review volume into an interpretable product-level score. The complete pipeline was integrated into a FastAPI and React-based web platform capable of processing approximately 850 reviews in under 60 s. The findings demonstrate that domain-specific Turkish language models combined with explainable reasoning mechanisms can provide accurate, scalable, and human-interpretable decision support for large-scale e-commerce environments.

1. Introduction

The rapid growth of e-commerce has fundamentally changed the way consumers make purchasing decisions. Rather than relying on expert recommendations or physical inspection, buyers increasingly depend on the accumulated opinions of previous customers that are publicly available, searchable, and, in principle, freely interpretable. In practice, however, the sheer volume of user-generated reviews on major platforms renders manual reading infeasible. A moderately popular product on a Turkish e-commerce platform such as Trendyol or Hepsiburada may accumulate thousands of reviews within months of its listing. For a prospective buyer, synthesizing this volume of feedback into an actionable purchasing decision is an overwhelming task. For a seller or platform operator seeking to understand product-level satisfaction trends, it is even more so [1].
This review volume problem is not simply a matter of information overload. It reflects a deeper structural asymmetry: the mechanisms by which consumers generate feedback have scaled far faster than the mechanisms by which that feedback can be consumed and acted upon. Existing platform-level solutions aggregate star ratings, helpfulness votes, and sorting algorithms address the surface of this problem without resolving its underlying cause. A product with a 4.2-star average rating across 800 reviews conceals as much as it reveals: the distribution of sentiment across those reviews, the specific dimensions of the product that satisfied or disappointed buyers, and the reliability of the seller through whom the product was purchased are all invisible to a user who sees only the aggregate score [2].
The first dimension of the problem concerns sentiment classification: automatically determining the polarity of individual reviews at scale. Sentiment analysis has emerged as a critical field in Natural Language Processing (NLP) to extract and quantify subjective information from the ever-growing volume of digital text [3]. It offers a principled approach to this problem by enabling the automatic classification of review text into polarity categories, typically positive, neutral, or negative, without requiring human reading of individual comments. Transformer-based models, and BERT in particular, have substantially advanced the state of the art in sentiment classification across a wide range of languages and domains. For Turkish, a morphologically rich and agglutinative language that poses distinct challenges for natural language processing, early efforts focused on constructing lexical resources and rule-based models to handle its complex morphology [4,5]. More recently, the development of BERTurk, a BERT model pre-trained exclusively on Turkish corpora, has provided a strong foundation for domain-specific fine-tuning. However, sentiment classification alone does not solve the review volume problem: knowing that 84% of reviews for a given product are positive tells a user relatively little about why those reviews are positive, which product attributes they praise, or what the remaining 16% of reviewers found objectionable.
The second dimension concerns automatic summarization: condensing large review collections into human-readable outputs that explain why reviewers are satisfied or dissatisfied. Automatic summarization of review collections addresses this complementary need, but remains an open challenge in the Turkish NLP literature. Existing multilingual summarization models, trained predominantly on news corpora, fail to generalize to the informal, abbreviated, and domain-specific language of e-commerce reviews. The mismatch between pre-training distribution and deployment context is particularly acute for Turkish, where the morphological complexity of the language further complicates transfer from resource-rich settings [6].
A third dimension of the problem concerns the aggregation of heterogeneous quality signals into a single interpretable evaluation. Users do not only care about sentiment; they also consider star ratings, seller reliability, and the statistical confidence that comes from a large review sample. Combining these signals through a simple weighted average obscures the non-linear and context-dependent interactions between them. Fuzzy logic provides a framework for modeling this ambiguity explicitly, representing intermediate values through graded membership functions and aggregating signals through human-interpretable IF-THEN rules [7].
This paper addresses all three dimensions of the review volume problem through a unified end-to-end pipeline. The proposed system accepts a Trendyol product URL as input and produces three outputs: a three-class sentiment classification of every review on the product page, a natural language summary for each sentiment category generated by the Kumru-2B Turkish language model, and a fuzzy logic-based product score that integrates sentiment distribution, average star rating, seller reliability, and review count into a single interpretable evaluation. The pipeline is accessible through a React-based web interface and processes 850+ reviews in under 60 s.
The main contributions of this study can be summarized as follows:
(i)
Proposing the first unified Turkish e-commerce intelligence framework integrating sentiment classification, sentiment-aware summarization, and explainable fuzzy evaluation;
(ii)
Constructing and publicly releasing a large-scale, balanced Turkish e-commerce review dataset containing 183,333 manually validated LLM-annotated samples collected from three major platforms;
(iii)
Demonstrating that dataset diversity and class balance exert a greater influence on Turkish sentiment classification performance than tokenizer vocabulary expansion;
(iv)
Showing that instruction-tuned Turkish LLMs substantially outperform news-oriented multilingual summarization models in informal review summarization tasks;
(v)
Introducing an interpretable fuzzy inference mechanism capable of transforming heterogeneous review signals into transparent product-level evaluations.
Although previous studies have investigated Turkish sentiment classification [8,9,10,11], review summarization [12,13,14], or fuzzy evaluation [15,16,17] independently, the literature still lacks an integrated decision-support framework capable of simultaneously performing sentiment prediction, sentiment-aware abstractive summarization, and interpretable multi-criteria product evaluation within a unified Turkish NLP pipeline. Existing studies generally focus on isolated classification benchmarks and rarely address the downstream usability of sentiment outputs for real consumer decision-making scenarios. Furthermore, the interaction between large language model-based summarization and fuzzy reasoning mechanisms remains largely unexplored in Turkish e-commerce analytics. The present study addresses this gap by proposing a fully integrated and deployable framework that combines transformer-based sentiment analysis, instruction-tuned Turkish LLM summarization, and explainable fuzzy inference into a single operational architecture [18,19,20].
The remainder of this paper is organized as follows. Section 2 reviews related work in Turkish sentiment analysis, automatic text summarization, LLM-based annotation, and fuzzy logic for e-commerce evaluation. Section 3 describes the methodology, including data collection, model training, and system integration. Section 4 presents experimental results. Section 5 discusses the findings and their implications. Section 6 concludes with directions for future work.

2. Related Work

2.1. Sentiment Analysis in Turkish E-Commerce

Sentiment analysis of Turkish product reviews has attracted growing research interest in recent years, driven by the rapid expansion of the Turkish e-commerce market and the morphological complexity of the Turkish language. Initial studies in the Turkish domain utilized traditional machine learning algorithms, such as Support Vector Machines (SVM) and Naive Bayes, often applied to limited datasets or specific news domains [4,21]. Demircan et al. applied five machine learning models, including SVM and Random Forests, to Turkish e-commerce review data, finding that SVM and RF classifiers achieved the strongest performance on three-class sentiment classification tasks [8]. Savci and Das conducted a comparative study of sentiment analysis across Arabic, English, and Turkish e-commerce texts using RNN, CNN, and LSTM-based architectures, noting that Turkish presents distinct challenges compared to morphologically simpler languages due to its agglutinative structure [22].
The transition from sparse representations to dense vectors was facilitated by the introduction of word embedding techniques such as Word2Vec [23] and GloVe [24], which captured semantic relationships more effectively. Subsequently, the introduction of transformer-based pre-trained language models substantially advanced the state of the art. Açıkalın et al. [9] were among the first to apply BERT to Turkish sentiment analysis, proposing two strategies: direct fine-tuning of the multilingual BERT model on Turkish data, and sentiment classification following machine translation of Turkish texts into English. Their experiments on movie and hotel review datasets demonstrated that BERT-based approaches significantly outperform traditional methods. Schweter [25] subsequently released BERTurk, a set of BERT models pre-trained exclusively on large Turkish corpora, providing a stronger foundation for Turkish-specific downstream tasks. Fine-tuning BERTurk on Turkish e-commerce data was explored by Teke et al. [10], who trained and evaluated the model on a multi-domain dataset of 73,398 reviews collected from Trendyol across six product categories, reporting strong classification performance across product domains. More recently, Öcal [11] conducted a systematic comparison of BERT-based sentiment classification against star ratings as an alternative labeling signal on Turkish e-commerce data, providing evidence that text-based models capture sentiment nuances that star ratings alone cannot represent. Furthermore, the development of robust Turkish polarity lexicons like SentiTurkNet [5] has complemented these deep learning approaches by providing baseline semantic signals for resource-constrained settings.
Despite these advances, the existing literature on Turkish e-commerce sentiment analysis presents two notable gaps that the current study addresses. First, prior work has overwhelmingly focused on binary or three-class classification in isolation, without integrating the classification output into a downstream summarization or decision support pipeline. Second, the challenge of class imbalance, which is severe in e-commerce review data, where positive reviews typically dominate by a large margin, has received limited attention in the Turkish NLP literature.

2.2. Automatic Text Summarization of Customer Reviews

Automatic summarization of product reviews is a well-established research direction in the English-language NLP literature, but remains underexplored for Turkish. Automatic text summarization traditionally relied on extractive methods that rank sentences based on graph centrality, with TextRank [12] and LexRank [13] being the most prominent examples. Recently, abstractive summarization approaches based on sequence-to-sequence architectures, in particular the T5 [14], BART [26], and PEGASUS [27] families, have been widely adopted for multilingual summarization tasks. However, domain mismatch between pre-training corpora and deployment contexts remains a persistent challenge. Models trained on news corpora, which feature formal prose and well-formed sentences, frequently fail to produce coherent summaries of informal user-generated content characterized by incomplete sentences, colloquialisms, and domain-specific vocabulary.
The use of instruction-tuned large language models for structured summarization tasks has emerged as a promising alternative. Prompt engineering approaches in which the model is provided with explicit formatting constraints alongside the input text have been shown to substantially improve the consistency and relevance of generated outputs without requiring task-specific fine-tuning [28]. This approach is particularly attractive for low-resource languages such as Turkish, where labeled summarization corpora are scarce.

2.3. LLM-Based Data Annotation

The use of large language models as automatic data annotators has gained considerable traction as a cost-effective alternative to human labeling. Pangakis et al. [29] provided an early systematic evaluation of GPT-4 as an annotator across classification tasks from social science research, arguing that any automated annotation pipeline using LLMs must be validated against human-generated labels, as performance varies substantially across tasks and prompt formulations. Gilardi et al. [30] demonstrated that ChatGPT 4.0 outperforms crowd-sourced annotators on several political text classification tasks, while acknowledging that performance degrades on tasks requiring nuanced domain knowledge. More recent work has explored combining multiple LLMs or multiple prompts to improve annotation agreement as a proxy for label reliability [28,31]. In this study, GPT-4o mini and Gemini 2.5 Flash were used for sentiment labeling across the three platform-specific subsets of the dataset. The use of star ratings as a complementary signal for Amazon Turkey reviews, where available, follows recommendations in the literature that structured signals should be incorporated alongside free-form text to constrain the annotation space and reduce ambiguity.

2.4. Fuzzy Logic in E-Commerce Evaluation

Fuzzy logic has been applied in e-commerce contexts primarily for trust evaluation, service quality assessment, and multi-criteria decision support. Aggarwal [15] developed an early fuzzy expert system for evaluating e-commerce trustworthiness by aggregating multiple website-level indicators through Mamdani-type inference rules, demonstrating that fuzzy approaches better capture the inherent ambiguity of trust-related judgments than crisp aggregation methods. More recently, Şimşek and Güvendiren [16] applied soft computing techniques to measure e-commerce website service quality indices, showing that fuzzy aggregation of multiple quality dimensions yields more interpretable and human-aligned scores than weighted averages. In a closely related study, Golondrino et al. [17] combined sentiment analysis with fuzzy logic for usability evaluation, using sentiment scores derived from user feedback as one of several input variables to a Mamdani fuzzy inference system, a design pattern that directly informs the architecture of the Fuzzy Box module proposed in the current study.

3. Methodology

This study proposes an end-to-end pipeline for Turkish e-commerce review analysis consisting of four main components: (1) multi-platform data collection and preprocessing, (2) sentiment classification using a fine-tuned BERTurk model, (3) sentiment-aware summarization using the Kumru-2B language model, and (4) a fuzzy logic-based product evaluation module. Figure 1 illustrates the overall system architecture.

3.1. Data Collection and Preprocessing

To construct a comprehensive and domain-representative dataset, product reviews were collected from three major Turkish e-commerce platforms: Trendyol, Amazon Turkey, and Hepsiburada. A Selenium-based web scraper was developed to extract user reviews along with associated metadata, including star rating, seller information, and seller score. Initially, over 582,000 reviews were collected from Trendyol. Following a length-based filtering step in which reviews exceeding a predefined word count threshold were excluded to ensure consistency with the token length distribution of the other platforms, the Trendyol subset was reduced to 59,791 reviews. In total, 494,000 raw reviews were collected across all three platforms.
For sentiment labeling, two large language models were employed. Reviews from Trendyol were labeled using GPT-4o mini via the OpenRouter API, while Amazon Turkey and Hepsiburada reviews were labeled using Gemini 2.5 Flash. For Amazon Turkey reviews, both the review text and the star rating were provided to the model simultaneously to improve labeling accuracy. For Hepsiburada reviews, where star ratings were unavailable in a subset of the collected data, labeling was performed based solely on review content. Each review was assigned one of three sentiment categories: positive, neutral, or negative.
Following the labeling phase, a standard text preprocessing pipeline was applied. This included whitespace normalization, removal of non-Turkish characters, and spell correction using Gemini 2.5 Flash to address the high frequency of orthographic errors commonly found in informal Turkish user-generated content. Duplicate reviews were removed using exact text matching across platforms.
The labeled dataset exhibited a severe class imbalance, particularly in the Trendyol subset, where the positive-to-negative ratio reached 16.6:1. To address this, stratified downsampling was applied such that the final training corpus contained an equal number of samples per class. The resulting merged dataset comprised 183,333 reviews distributed equally across the three sentiment categories (61,111 per class), drawn from all three platforms. Table 1 and Table 2 summarize the data collection and filtering statistics, and the sentiment label distribution before and after stratified downsampling, respectively. The fully preprocessed and balanced dataset used in this study has been made publicly available to support future research [32].

3.2. Sentiment Classification

Sentiment classification was performed using BERTurk, a BERT-based language model pre-trained exclusively on Turkish corpora. The underlying architecture follows the Transformer framework introduced by Vaswani et al. [33]. Two variants of BERTurk were evaluated: the standard 32,000-token vocabulary model (bert-base-turkish-cased) and a larger 128,000-token vocabulary model (bert-base-turkish-128k-cased). The latter was developed to better handle the agglutinative morphological structure of Turkish, where a single word can encode information that would require several words in English. For instance, a word such as “kullanamıyorum” (“I cannot use it”) is tokenized into three subword units by the 32k model, whereas the 128k model represents it as a single token, preserving semantic integrity more effectively.
Both models were fine-tuned for three-class sequence classification (positive, neutral, negative) using the merged dataset of 183,333 reviews. The dataset was partitioned into training, validation, and test sets following a 70/15/15 stratified split, ensuring that the class distribution was preserved across all partitions. The fine-tuning process leverages the bidirectional representations learned during pre-training, similar to other cross-lingual architectures like XLM-R [34].
Training was conducted on a Google Colab environment equipped with an NVIDIA A100 80 GB GPU. The AdamW optimizer was used with a learning rate of 2 × 10−5 and a weight decay of 0.01. A linear learning rate scheduler with a 10% warm-up ratio was applied over the total number of training steps. Given the perfectly balanced class distribution achieved through stratified downsampling, class weights were nonetheless incorporated into the cross-entropy loss function as an additional safeguard against residual imbalance effects.
Three fine-tuning experiments were conducted in total. The first two experiments used the 69,184-sample dataset comprising Trendyol and Amazon Turkey reviews, and compared the 32k and 128k vocabulary models under identical training conditions. The third experiment extended the dataset to include Hepsiburada reviews, yielding the final 183,333-sample corpus, and used the 128k vocabulary model based on its stronger morphological tokenization properties. Table 3 reports the validation macro F1 scores across all experiments and epochs.
The results reveal a clear pattern: while the 128k vocabulary model showed a marginal advantage in the early epochs of Experiments 1 and 2, the two models converged to nearly identical performance by Epoch 4, with a difference of only 0.001 in macro F1. This suggests that, for the given dataset size, the additional expressiveness of the larger vocabulary did not translate into a significant performance gain on its own. The decisive factor was the inclusion of Hepsiburada data in Experiment 3. With the dataset size increasing from 69,184 to 183,333 samples and the class balance shifting from a skewed distribution to a perfectly uniform 1:1:1 ratio, the model achieved a validation macro F1 of 0.9243 from the very first epoch onward, already surpassing the final performance of the previous experiments.
The final model was evaluated on the held-out test set, yielding the per-class results reported in Table 4. Notably, the neutral class, historically the most challenging category in sentiment analysis tasks due to its ambiguous boundary with both positive and negative classes, achieved an F1 score of 0.90, a substantial improvement over the 0.80 observed in Experiment 1. This improvement is largely attributable to the inclusion of the Hepsiburada subset, which contained a substantially higher proportion of neutral and negative samples compared to the other platforms.

3.3. Sentiment-Aware Summarization

Following sentiment classification, reviews were grouped by their predicted sentiment label and passed through a summarization module designed to produce a concise, human-readable summary for each sentiment category. The objective was to distill the collective opinion of hundreds of reviewers into three distinct paragraphs, one for positive, one for neutral, and one for negative reviews, thereby allowing end users to grasp the overall reception of a product without reading individual comments.
Initially, a transformer-based extractive–abstractive summarization approach inspired by the FusionSum framework was considered for this module. The plan involved fine-tuning a multilingual T5 (mT5) model on Turkish e-commerce review data to perform sentence-level fusion and abstractive generation. Two candidate models were evaluated: mtufan/mt5-small-turkish-summarization, which had been fine-tuned on Turkish news corpora, and csebuetnlp/mT5_multilingual_XLSum, a multilingual variant trained on summarization benchmarks across several languages. Both models were tested on a sample of 30 review groups drawn from the merged dataset, using pseudo-references constructed from the first three sentences of each group. The average ROUGE-L score across both models was 0.0532, well below the target threshold of 0.35. Qualitative inspection of the generated summaries confirmed that both models produced outputs that were semantically unrelated to the input reviews, a consequence of the domain mismatch between Turkish news text, on which these models were trained, and informal Turkish e-commerce language. On this basis, the mT5-based approach was abandoned in accordance with the contingency plan defined in the project proposal.
As an alternative, Kumru-2B was adopted for the summarization module. Kumru-2B is a 2-billion-parameter causal language model developed from scratch for the Turkish language by VNGRS [35]. It was pre-trained on a cleaned and deduplicated corpus of 500 GB of Turkish text comprising approximately 300 billion tokens, and subsequently fine-tuned on one million instruction-following examples. Unlike general-purpose multilingual models, Kumru-2B incorporates a native Turkish tokenizer based on byte-pair encoding with a vocabulary of 50,176 tokens, which processes Turkish morphological structures significantly more efficiently. Benchmark evaluations on the Cetvel suite, a Turkish-language evaluation framework covering 26 distinct NLP tasks, including summarization, grammatical error correction, and question answering, have shown that Kumru-2B outperforms considerably larger multilingual models such as LLaMA-3.3-70B, Gemma-3-27B, and Qwen-2-72B on Turkish-specific tasks, despite its comparatively compact size.
Summarization was implemented through structured prompt engineering. For each sentiment class, a task-specific system prompt was defined, instructing the model to produce a single coherent paragraph in Turkish, without bullet points, numbered lists, or quotation marks. The prompt explicitly identified the sentiment category and directed the model to highlight the dominant themes within that group, such as product quality and delivery speed for positive reviews, or recurring complaints and disappointment for negative ones. Up to 20 reviews per sentiment class were concatenated and passed to the model as context. Generation was performed with a temperature of 0.3, a repetition penalty of 1.2, and a maximum of 150 new tokens, configurations chosen to balance output diversity with factual consistency.
The system was tested on a real Trendyol product page containing 849 reviews. The model successfully produced fluent, domain-appropriate summaries for all three sentiment classes within the same inference session. Representative outputs are provided in Table 5.

3.4. Fuzzy Logic-Based Product Evaluation

While sentiment distribution and sentiment-specific summaries provide actionable information for users who are willing to engage with detailed review content, many consumers and sellers require a single interpretable score that consolidates multiple heterogeneous signals into an at-a-glance evaluation. The Fuzzy Box addresses this complementary need by integrating sentiment distribution, star ratings, seller reliability, and review volume into a unified product-level score, enabling rapid decision support without requiring the user to manually weigh each signal.
The final component of the proposed pipeline is a fuzzy logic-based scoring module, referred to as the Fuzzy Box, which aggregates multiple product-level signals into a single interpretable score ranging from 0 to 10. The evaluation engine is grounded in the foundational principles of fuzzy set theory [36], utilizing a Mamdani-style inference mechanism [37] originally developed for industrial control applications [38]. The motivation behind adopting a fuzzy logic framework as opposed to a simple weighted average lies in the inherently imprecise and overlapping nature of the input variables. For example, a positive review ratio of 0.72 cannot be cleanly categorized as either “high” or “low”; it occupies an intermediate region that a crisp threshold would misrepresent. Fuzzy logic handles such ambiguity by assigning partial degrees of membership to linguistic categories, enabling more nuanced and human-aligned evaluations.

3.4.1. Input Variables and Fuzzification

Four input variables were defined for the Fuzzy Box module. The first is the positive review ratio, computed as the proportion of reviews classified as positive by the BERT model, normalized to the range [0, 10]. The second is the average star rating assigned by users to the product, rescaled from the original [1, 5] scale to [0, 10] using the transformation (r − 1)/4 × 10. The third is the seller score, a platform-provided metric reflecting seller reliability, clipped to the [0, 10] range. The fourth is the review count, which serves as a proxy for evaluation reliability and is normalized by dividing by 100 and clipping at 10.
Although the positive review ratio and the average star rating are related signals, they are not redundant. The neutral sentiment class, in particular, exhibits a relatively high mean star rating, indicating that users frequently assign favorable star ratings while expressing mixed or ambivalent opinions in their review text. This divergence suggests that text-based sentiment classification captures evaluative nuances that aggregate star ratings alone cannot represent. Furthermore, star ratings were used during the LLM-based annotation phase solely as a contextual signal to improve labeling accuracy on Amazon Turkey reviews, not as a direct label assignment mechanism. Their role as an input variable in the Fuzzy Box is therefore independent of the annotation pipeline and does not constitute double-counting of the same underlying evaluation signal.
Each input variable was fuzzified using triangular membership functions, which assign a degree of membership between 0 and 1 to each of three linguistic categories: low, medium, and high. For a given input value x and a triangular function defined by parameters (a, b, c), the membership degree μ(x) equals (xa)/(ba) if a < xb, and (cx)/(cb) if b < x < c.
The parameters for each variable were set empirically based on the observed distributions in the collected dataset. For the positive ratio and average rating, the triangular boundaries were set at (3, 6, 9), reflecting the expectation that scores below 3 indicate poor sentiment, scores around 6 represent moderate satisfaction, and scores above 9 indicate strong positive reception. For the seller score, boundaries were set at (3, 6, 8) to account for the compressed upper range typically observed in platform seller ratings.

3.4.2. Rule Base

A rule base consisting of 15 fuzzy IF-THEN rules was constructed to model the relationships between the input variables and the output score. The rules were designed to capture both primary interactions, such as the combined effect of a high positive ratio and high average rating, and secondary modifying effects, such as seller reliability and review volume. Table 6 presents a representative subset of the rule base.
Each rule fires with a strength equal to the minimum membership degree of its antecedents, following the Mamdani inference approach. Rules involving seller score and review count are treated as weighted modifiers with reduced firing strengths (0.3–0.5) to prevent these secondary variables from dominating the output.

3.4.3. Defuzzification and Output

The output score was computed through weighted average defuzzification, in which each rule’s contribution is the product of its firing strength and its associated output value. The aggregation of multiple conflicting criteria (e.g., positive ratio vs. seller score) follows the logic of ordered weighted averaging operators to ensure a balanced final score [39]. The final score S is given by S = Σ (wi × si)/Σ wi, where wi denotes the firing strength of the ith rule and si denotes the corresponding output value. The resulting score is clipped to the interval [0, 10] and mapped to one of five interpretive categories as shown in Table 7.
In addition to the numerical score, the module generates a set of natural language explanations derived from the rules that fired with non-zero strength. These explanations identify the dominant factors contributing to the final score, for instance, noting that a high seller score reinforced the overall evaluation, or that a low positive ratio constrained it. This explainability component is consistent with the broader objective of producing outputs that are not only accurate but also interpretable by non-expert users.
The Fuzzy Box was evaluated on the same product used for the summarization test. With a positive review ratio of 0.844, an average rating of 4.92, and a seller score of 9.1 across 849 reviews, the module produced a score of 7.7, corresponding to the “Good” category. The explanations generated by the system attributed the score primarily to the high positive review ratio combined with the high star rating, and further reinforced by the seller’s strong reliability score.

3.5. System Integration

The four modules described in the preceding sections, data collection, sentiment classification, summarization, and fuzzy evaluation, were integrated into a unified end-to-end pipeline accessible through a web-based interface. The pipeline accepts a Trendyol product URL as input and returns a structured JSON response containing the sentiment distribution, per-class summaries, and the fuzzy evaluation score.
The backend was implemented using FastAPI, a Python-based asynchronous web framework, and deployed on Google Colab with an NVIDIA A100 GPU. External access to the API was enabled through ngrok, a reverse tunneling service that exposes the locally running server to the public internet via a secure HTTPS endpoint. The frontend was developed as a single-page React application styled with Tailwind CSS, providing an interactive dashboard that visualizes the sentiment distribution as a horizontal bar chart, displays the three Kumru-2 B-generated summaries as categorized cards, and renders the Fuzzy Box score alongside its interpretive label and contributing factors.
The complete pipeline was tested on a real product page from Trendyol containing 849 reviews. End-to-end processing from URL submission to final JSON output was completed in under 60 s, satisfying the system performance criterion of processing 1000 reviews in under 10 min. Figure 2 illustrates the system architecture and the data flow between components.

4. Experiments and Results

4.1. Experimental Setup

All experiments were conducted on Google Colab Pro using an NVIDIA A100 SXM4 80 GB GPU. The sentiment classification model was implemented using the Hugging Face Transformers library (version 5.0.0) with PyTorch (version 2.11.0) as the backend, developed under Python (version 3.12.13). Fine-tuning hyperparameters were kept constant across all three experiments to ensure a fair comparison: learning rate 2 × 10−5, weight decay 0.01, batch size 64, maximum sequence length 128 tokens, and 4 training epochs. The AdamW optimizer was used in conjunction with a linear learning rate scheduler incorporating a 10% warm-up period. All experiments used a fixed random seed of 42 to ensure reproducibility.
For the summarization module, Kumru-2B was loaded in half-precision (torch.float16) with automatic device mapping to maximize GPU utilization. Generation hyperparameters were set as follows: temperature 0.3, repetition penalty 1.2, and a maximum of 150 new tokens per summary. The fuzzy inference engine was implemented from scratch in Python without external fuzzy logic libraries, using triangular membership functions and weighted average defuzzification as described in Section 3.4.

4.2. Sentiment Classification Results

The sentiment classification experiments yielded consistent improvements across the three experimental conditions. As reported in Table 3, the baseline experiment using 69,184 samples and the 32k vocabulary model achieved a best validation macro F1 of 0.8834. Replacing the backbone with the 128k vocabulary variant under identical conditions produced a marginally lower score of 0.8824, indicating that the larger vocabulary did not confer a meaningful advantage at this dataset scale. The performance difference between the two vocabulary configurations remained marginal across all epochs, suggesting that vocabulary expansion alone did not substantially affect classification performance under the current dataset scale.
The most substantial performance gain was observed in Experiment 3, where the addition of 114,149 Hepsiburada reviews to the training corpus, coupled with the resulting shift to a perfectly balanced 1:1:1 class distribution, produced a validation macro F1 of 0.9058 at the end of the first epoch alone, already exceeding the best results of the previous experiments. The model continued to improve through subsequent epochs, reaching a final validation macro F1 of 0.9243 at Epoch 4. This result represents a 4.6 percentage point improvement over Experiment 1 and surpasses the project success criterion of F1 ≥ 0.80 by a margin of 12.4 percentage points. The confusion matrix of the fine-tuned BERTurk model using Experiment 3 is shown in Figure 3.
Test set evaluation of the Experiment 3 model revealed strong performance across all three classes. The positive class achieved an F1 score of 0.95, the negative class 0.92, and the neutral class 0.90. The neutral class, which is typically the most difficult to classify due to its semantic proximity to both positive and negative categories, showed a 10 percentage point improvement compared to Experiment 1 (from 0.80 to 0.90). This improvement is attributed to the Hepsiburada subset, which contributed 44,043 neutral samples, nearly three times the combined neutral count of the Trendyol and Amazon Turkey subsets. Overall test accuracy was 92%, confirming the generalizability of the model to unseen data.
Figure 4 shows the training and validation loss curves and macro F1 trajectories across epochs for Experiment 3. The validation loss reached its minimum at Epoch 2 (0.2083) before gradually increasing in subsequent epochs, while the validation F1 continued to rise through Epoch 4. This divergence between loss and F1 is consistent with a phenomenon known as soft overfitting, in which the model becomes increasingly uncertain about borderline cases reflected in rising loss while still improving its hard classification decisions. As the best model checkpoint was saved based on validation F1 rather than loss, this behavior did not negatively impact final performance.

4.3. Summarization Results

The summarization module was evaluated qualitatively on a representative product from Trendyol, a hair care oil with 849 reviews. Following sentiment classification, the reviews were grouped into 716 positive, 93 neutral, and 40 negative instances. Kumru-2B generated a separate summary for each group within the same inference session.
The generated summaries were assessed along three dimensions: fluency, relevance, and factual consistency. In terms of fluency, all three summaries produced grammatically correct and stylistically natural Turkish prose, with no instances of repetition, hallucination of product names, or code-switching to English. In terms of relevance, the summaries accurately reflected the dominant themes present in each group, for example, the positive summary correctly identified softness, shine, and reduced hair loss as the primary benefits cited by reviewers, while the negative summary captured recurring complaints about dryness and packaging damage. Factual consistency was maintained throughout, with no summary introducing claims that were absent from the input reviews.
For reference, the baseline mT5 models evaluated during the model selection phase produced an average ROUGE-L score of 0.0532 on pseudo-reference summaries constructed from review subsets, confirming the inadequacy of those models for this domain. Although a direct ROUGE-L evaluation of the Kumru-2B outputs was not performed due to the absence of human-written reference summaries, the qualitative results suggest that the Kumru-2B model generates more coherent and domain-appropriate summaries than those obtained with the tested mT5 baselines.

4.4. Fuzzy Box Results

The Fuzzy Box module was evaluated on the same product used for the summarization assessment. The input variables extracted from the pipeline output were as follows: positive review ratio 0.844, average star rating 4.92 (normalized to 9.80 on the [0, 10] scale), seller score 9.1, and review count 849 (normalized to 10.0). These values were fuzzified using the triangular membership functions defined in Section 3.4.1, and the rule base was evaluated to determine the firing strength of each rule.
The majority of the high-confidence rules fired with near-maximum strength. Rule R1 IF positive ratio is HIGH AND average rating is HIGH fired with a strength of 0.94, contributing a score of 9.0 to the weighted average. Rule R7, the seller score modifier, fired with a strength of 0.78, contributing a score of 8.0. The review count reliability bonus was activated at full strength, given that the review count exceeded the 100-review threshold by a factor of eight.
The final defuzzified score was 7.7 out of 10, placing the product in the “Good” category. The system generated three natural language explanations attributing the score to the high positive review ratio, combined with the strong average rating, the high seller reliability score, and the large number of reviews contributing to evaluation confidence. These explanations were displayed alongside the numerical score in the user interface, providing end users with a transparent account of the factors driving the evaluation.
To further validate the module across diverse product categories and sentiment profiles, the Fuzzy Box was evaluated on six additional Trendyol products spanning electronics, fashion, cosmetics, books, and home goods. Table 8 summarizes the results.
The results demonstrate that the module produces scores consistent with the underlying sentiment distributions. The low-rated electronics product, which exhibited a positive review ratio of only 0.302 and an average star rating of 2.88, received a score of 4.52 and was categorized as “Conditional Recommendation”, the only product in the evaluation set to fall below the “Good” threshold. All remaining products received scores between 7.1 and 7.9, reflecting their higher positive ratios and star ratings. Notably, the books category received the highest seller score (9.8) despite having the fewest reviews (140), illustrating the role of the seller reliability modifier in the rule base. The fashion product, which had a lower positive ratio (0.646) than the other high-rated categories, correspondingly received the lowest fuzzy score among the “Good” products (7.14), demonstrating the module’s sensitivity to variation in sentiment distribution even within the same output category.
To assess the robustness of the Fuzzy Box against parameter uncertainty, a sensitivity analysis was conducted by systematically varying the membership function boundaries across five parameter configurations, ranging from narrow to wide boundary settings. The analysis was performed on six products spanning five categories. The mean maximum score deviation across all products and configurations was 0.49 points on a 10-point scale, and the interpretive category label remained unchanged for five out of six products across all parameter configurations. The single label change observed in the book category product occurred at a boundary value between the “Good” and “High Satisfaction” categories, which represent adjacent and similarly favorable evaluations. These results indicate that the Fuzzy Box produces stable and consistent outputs across a range of membership function parameterizations, suggesting that the evaluation framework is reasonably robust to variations in membership function settings and reducing concerns regarding potential designer bias associated with the manually specified rule base.

4.5. Error Analysis

To better understand the failure modes of the classification model, a systematic analysis of misclassified instances was conducted on the test set. Errors were grouped by their (true label, predicted label) pair, and representative examples were selected for qualitative examination. Table 9 presents a categorized summary of the most frequent error patterns, along with illustrative examples drawn directly from the test data.
Three dominant error patterns emerge from this analysis. First, ‘delivery-only reviews’ comments that mention only shipping speed without evaluating the product itself are systematically misclassified as positive, as fast delivery is a strong positive signal in the training data. Second, ‘ironic and implicit complaints’ lack the explicit negative lexical markers on which the model relies, causing them to be assigned neutral labels. Third, ‘concessive structures’ of the form “X is lacking but Y is good” introduce polarity ambiguity that the model resolves toward the less extreme class.
These patterns are consistent with known limitations of token-level sentiment models when applied to mixed-polarity or pragmatically complex utterances, and suggest that discourse-level features such as contrast connectives and sentence-level polarity shifts could improve classification accuracy in future work.

4.6. Statistical Significance

To verify that the performance improvement observed in Experiment 3 relative to Experiment 1 is statistically significant rather than attributable to chance, two complementary analyses were conducted.
First, a McNemar’s test was applied to the paired predictions of Experiment 1 and Experiment 3 on the shared test set. McNemar’s test is appropriate for comparing two classifiers evaluated on the same instances, as it accounts for the dependency structure of paired observations. The test yielded χ2 = 60.09 (p < 0.0001), indicating that the difference in classification accuracy between the two models is statistically significant at the 0.001 level.
Second, a bootstrap confidence interval for the macro F1 score of the final model was estimated using 10,000 resampling iterations with replacement. Table 10 reports the full results of both analyses.
The marginal difference between the validation macro F1 (0.9243) and the bootstrap point estimate on the test set (0.9245) is attributable to the stochastic nature of bootstrap resampling and confirms the stability of the reported performance. The 95% confidence interval of [0.9214, 0.9275] confirms that the reported macro F1 of 0.9245 is a stable and reliable estimate unlikely to vary substantially across different random draws from the test distribution. Taken together, these results confirm that the performance gains observed in Experiment 3, driven by the expansion of the training corpus to 183,333 samples and the achievement of a balanced class distribution, represent a genuine and statistically robust improvement over the baseline configuration.

4.7. Validation Against Human-Annotated Gold Standard

To provide a rigorous assessment of our model’s generalization capabilities, we conducted a zero-shot cross-dataset evaluation on an external human-annotated Turkish e-commerce dataset comprising 11,426 samples [40]. The comprehensive results of this evaluation are presented in Table 11.
As shown in Table 11, our fine-tuned BERTurk model achieved an overall accuracy of 73.33%. The performance metrics demonstrate that the model maintains high discriminative power for polar sentiments, achieving F1-scores of 0.82 for positive and 0.80 for negative classes. While the neutral class exhibits a lower F1-score of 0.47, this is consistent with the inherent subjectivity and the lack of consensus in human annotation guidelines for neutral content across diverse e-commerce datasets. Overall, these validation results support the robustness of the model primarily for polar sentiments, while confirming that neutral reviews remain inherently more difficult to classify. These results, supported by the detailed metrics in Table 11, substantiate the robustness of our LLM-assisted annotation strategy and confirm that the model has captured authentic semantic patterns rather than artifactual noise.

5. Discussion

The experimental results presented in this study demonstrate that the proposed pipeline achieves strong performance across all evaluated components, while also surfacing several findings that warrant broader reflection.

5.1. Impact of Data Volume and Class Balance on Classification Performance

The experimental findings indicate that dataset scale and class balance had a greater impact on classification performance than tokenizer vocabulary size. While the 128k BERTurk variant produced only marginal gains over the 32k model under identical training conditions, the inclusion of the Hepsiburada subset and the resulting balanced class distribution substantially improved the macro F1-score and especially the neutral class performance. These results suggest that carefully balanced and domain-diverse datasets remain more influential than moderate architectural modifications for Turkish e-commerce sentiment classification.
This finding has practical implications for future work in Turkish NLP. Researchers who face resource constraints may achieve better returns by investing in additional data collection and careful class balancing rather than exploring larger or more complex model architectures [41]. At the same time, the 128k vocabulary model was retained for the final experiment on the grounds that its morphological tokenization advantage may become more pronounced at larger dataset scales or on tasks requiring finer-grained semantic distinctions, such as aspect-level sentiment analysis.

5.2. Plan B Transition in the Summarization Module

The preliminary mT5-based summarization experiments highlighted the importance of domain alignment in Turkish NLP applications. Models trained predominantly on news corpora failed to generalize to informal e-commerce language characterized by short, noisy, and morphologically variable expressions. In contrast, the Turkish-native instruction-tuned Kumru-2B model generated substantially more coherent and contextually relevant summaries through structured prompting.
The adoption of Kumru-2B as a replacement resolved this issue by leveraging a model that had been pre-trained and instruction-tuned entirely on Turkish data, including informal and conversational text. The structured prompt engineering approach used in this study, where the model was explicitly instructed to produce a single coherent paragraph without lists or quotations, proved effective in constraining the output format and preventing the model from defaulting to enumeration-style responses, which were observed in preliminary experiments with less directive prompts. These observations underscore the importance of prompt design as a critical factor in the deployment of instruction-tuned language models for structured generation tasks.

5.3. Interpretability and the Role of Fuzzy Logic

The decision to implement a fuzzy logic-based scoring module rather than a simpler aggregation function, such as a weighted sum of the positive ratio and average rating, was motivated by the desire to produce scores that better reflect human judgment under uncertainty. A crisp weighted average would treat a positive ratio of 0.70 and a positive ratio of 0.80 as separated by a fixed gap, whereas the fuzzy approach allows these values to share membership in overlapping linguistic categories and interact non-linearly through the rule base.
In practice, the Fuzzy Box produced scores that aligned well with intuitive expectations. The test product, with a positive ratio of 0.844, an average rating of 4.92, and a seller score of 9.1, received a score of 7.7, categorized as “Good” rather than “High Satisfaction.” This outcome reflects the system’s sensitivity to the negative review content, which included substantive complaints about product performance and packaging quality, and which the fuzzy rules penalized relative to a hypothetical product with a higher positive ratio. The natural language explanations generated alongside the score further enhanced the interpretability of the output, enabling users to understand not just the final verdict but the reasoning behind it.

5.4. Limitations

Several limitations of the current study should be acknowledged. First, the sentiment labels used for training were generated by large language models rather than human annotators, introducing a degree of label noise that is difficult to quantify precisely. While the use of multiple models across platforms and the incorporation of star ratings as an additional signal helped mitigate this risk, a formal inter-annotator agreement study such as Cohen’s Kappa computed against a human-labeled gold standard was not conducted and represents an important direction for future validation.
Second, the summarization module was evaluated primarily through qualitative criteria, including fluency, relevance, and factual consistency, with representative outputs provided in Table 5 for transparency. However, the absence of quantitative metrics such as BERTScore or human evaluation scores represents a limitation of the current study. In particular, it is difficult to determine from qualitative assessment alone whether the generated summaries accurately represent the distribution of reviewer opinions, omit important information, or overrepresent minority complaints. Incorporating a systematic human evaluation study and automatic factual consistency metrics remains an important direction for future work.
Third, the current pipeline is limited to Trendyol product pages, as the scraper was developed specifically for that platform’s HTML structure. Extending the system to Amazon Turkey and Hepsiburada would require additional scraper development and validation, and may introduce platform-specific biases in the collected data.
Additionally, the current Fuzzy Box applies a fixed weighting scheme that may not reflect individual user preferences. A user-configurable weighting mechanism, in which consumers or sellers can adjust the relative importance of sentiment distribution, star ratings, and seller reliability according to their specific decision context, represents a natural and valuable extension of the current system.
Finally, the Fuzzy Box rule base was constructed manually based on domain intuition and the observed data distributions. A data-driven approach to rule induction such as fuzzy c-means clustering or genetic algorithm-based rule optimization could yield a more principled and potentially more accurate scoring function, and is left as a direction for future work.

6. Conclusions

This study presented an end-to-end pipeline for automated Turkish e-commerce review analysis, integrating a fine-tuned BERTurk sentiment classifier, a Kumru-2B-based summarization module, and a fuzzy logic evaluation engine into a unified, deployable system. The sentiment classifier achieved a macro F1 of 0.9243 on a balanced 183,333-sample corpus drawn from three platforms, surpassing the target threshold by 12.4 percentage points, while the summarization module demonstrated that prompt-engineered instruction-tuned models outperform news-trained abstractive systems for informal Turkish e-commerce text. The Fuzzy Box provided interpretable product scores through a 15-rule Mamdani inference system, and end-to-end processing of 849 reviews was completed in under 60 s. Future work will focus on fine-tuning Kumru-2B on domain-specific summarization data, extending the pipeline to Amazon Turkey and Hepsiburada, incorporating human annotation to quantify label noise, replacing the manually constructed fuzzy rule base with a data-driven induction approach, and conducting a formal user study to evaluate the perceived usefulness and trustworthiness of the system’s outputs.

Author Contributions

Conceptualization, E.Ö. and F.A.Ö.; data curation, E.Ö. and F.A.Ö.; investigation, E.Ö.; methodology, E.Ö. and F.A.Ö.; supervision, A.B.Ö. All authors declare equal and joint responsibility for the study. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by The Scientific and Technological Research Council of Türkiye (TÜBİTAK) under the 1002A Program through project no: 125E955.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All relevant data are included within the manuscript without restriction. The dataset supporting the findings of this study, consisting of a balanced collection of 183,333 Turkish e-commerce reviews, is publicly available on GitHub: https://github.com/doukansurel/E-Commerce-Turkish-Dataset, (Version 2.50.1, accessed on 20 May 2026) [32].

Acknowledgments

The authors would like to thank the open-source NLP community and the developers of the Turkish language models for providing the foundational tools used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yalçınkaya, M.; Çataldaş, İ. Determinants of customer satisfaction and loyalty in the Turkish e-commerce sector. Front. Commun. 2025, 10, 1603554. [Google Scholar] [CrossRef]
  2. Çetin, S.; Zaimoğlu, E.A. Explainable Turkish E-Commerce Review Classification Using a Multi-Transformer Fusion Framework and SHAP Analysis. J. Theor. Appl. Electron. Commer. Res. 2026, 21, 59. [Google Scholar] [CrossRef]
  3. Medhat, W.; Hassan, A.; Korashy, H. Sentiment analysis algorithms and applications: A survey. Ain Shams Eng. J. 2014, 5, 1093–1113. [Google Scholar] [CrossRef]
  4. Kaya, M.; Fidan, G.; Toroslu, I.H. Sentiment analysis of Turkish political news. In Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Macau, China, 4–7 December 2012; Volume 1, pp. 174–180. [Google Scholar]
  5. Dehkharghani, R.; Saygin, Y.; Yanikoglu, B.; Oflazer, K. SentiTurkNet: A Turkish polarity lexicon for sentiment analysis. Lang. Resour. Eval. 2016, 50, 667–685. [Google Scholar] [CrossRef]
  6. Mridha, M.F.; Lima, A.A.; Nur, K.; Das, S.C.; Hasan, M.; Kabir, M.M. A survey of automatic text summarization: Progress, process and challenges. IEEE Access 2021, 9, 156043–156070. [Google Scholar] [CrossRef]
  7. Al-Natour, S.; Turetken, O. A comparative assessment of sentiment analysis and star ratings for consumer reviews. Int. J. Inf. Manag. 2020, 54, 102132. [Google Scholar] [CrossRef]
  8. Demircan, M.; Seller, A.; Abut, F.; Akay, M.F. Developing Turkish sentiment analysis models using machine learning and e-commerce data. Int. J. Cogn. Comput. Eng. 2021, 2, 202–207. [Google Scholar] [CrossRef]
  9. Açıkalın, U.U.; Bardak, B.; Kutlu, M. Turkish sentiment analysis using bert. In Proceedings of the 2020 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Turkey, 5–7 October 2020; pp. 1–4. [Google Scholar]
  10. Teke, B.; Zamir, G.; Budak, A.B.; Aksakallı, I.K. BERTurk-Based Sentiment Analysis on E-Commerce Multi Domain Product Reviews. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilim. Derg. 2025, 25, 497–509. [Google Scholar] [CrossRef]
  11. Öcal, A. BERT-Based Sentiment Analysis of Turkish e-Commerce Reviews: Star Ratings Versus Text. Sak. Univ. J. Comput. Inf. Sci. 2025, 8, 677–687. [Google Scholar]
  12. Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
  13. Erkan, G.; Radev, D.R. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 2004, 22, 457–479. [Google Scholar] [CrossRef]
  14. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
  15. Aggarwal, S. Fuzzy based trust model to evaluate and analyse trust in B2C E-Commerce. In Proceedings of the 2014 IEEE International Advance Computing Conference (IACC), New Delhi, India, 21–22 February 2014; pp. 1300–1306. [Google Scholar]
  16. Şimşek, H.; Güvendiren, İ. Soft computing based e-commerce website service quality index measurement. Electron. Commer. Res. Appl. 2023, 61, 101303. [Google Scholar] [CrossRef]
  17. Golondrino, G.E.C.; Alarcón, M.A.O.; Martínez, L.M.S. Determination of the satisfaction attribute in usability tests using sentiment analysis and fuzzy logic. Int. J. Comput. Commun. Control 2023, 18, 1–14. [Google Scholar] [CrossRef]
  18. Koçak, S.; İç, Y.T.; Sert, M.; Atalay, K.D.; Dengiz, B. Development of a decision support system for selection of reviewers to evaluate research and development projects. Int. J. Inf. Technol. Decis. Mak. 2023, 22, 1991–2020. [Google Scholar] [CrossRef]
  19. Pinar, A. An integrated sentiment analysis and q-rung orthopair fuzzy MCDM model for supplier selection in E-commerce: A comprehensive approach. Electron. Commer. Res. 2025, 25, 1311–1342. [Google Scholar] [CrossRef]
  20. Turan, T.; Küçüksılle, E.U. Summarization, prediction, and analysis of Turkish constitutional court decisions with explainable artificial intelligence and a hybrid natural language processing method. IEEE Access 2025, 13, 59766–59779. [Google Scholar] [CrossRef]
  21. Alawi, A.B.; Bozkurt, F. A hybrid machine learning model for sentiment analysis and satisfaction assessment with Turkish universities using Twitter data. Decis. Anal. J. 2024, 11, 100473. [Google Scholar] [CrossRef]
  22. Savci, P.; Das, B. Prediction of customers’ interests using sentiment analysis in e-commerce data for comparison of Arabic, English, and Turkish. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 227–237. [Google Scholar] [CrossRef]
  23. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
  24. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  25. Schweter, S. BERTurk–BERT Models for Turkish, Version 1.0.0. April 2020. Available online: https://zenodo.org/records/3770924 (accessed on 20 June 2025).
  26. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
  27. Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 13–18 July 2020; pp. 11328–11339. [Google Scholar]
  28. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
  29. Pangakis, N.; Wolken, S.; Fasching, N. Automated annotation with generative ai requires validation. arXiv 2023, arXiv:2306.00176. [Google Scholar] [CrossRef]
  30. Gilardi, F.; Alizadeh, M.; Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA 2023, 120, e2305016120. [Google Scholar] [CrossRef]
  31. Harari, N. LLMs outperform outsourced human coders on complex textual analysis. Sci. Rep. 2025, 15, 40122. [Google Scholar] [CrossRef] [PubMed]
  32. Sürel, A.D. Turkish E-Commerce Sentiment Analysis Dataset, GitHub. 2026. Available online: https://github.com/doukansurel/E-Commerce-Turkish-Dataset (accessed on 20 May 2026).
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
  34. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar]
  35. Vngrs-ai. Kumru-2B: A Turkish Large Language Model, Hugging Face. 2024. Available online: https://huggingface.co/vngrs-ai/Kumru-2B (accessed on 20 May 2026).
  36. Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
  37. Mamdani, E.H.; Assilian, S. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Hum.-Comput. Stud. 1999, 51, 135–147. [Google Scholar] [CrossRef]
  38. Scharf, E.M.; Mandic, N.J. Industrial Applications of Fuzzy Control; Elsevier Science Inc.: Amsterdam, The Netherlands, 1985; pp. 41–62. [Google Scholar]
  39. Yager, R.R. On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Trans. Syst. Man Cybern. 1988, 18, 183–190. [Google Scholar] [CrossRef]
  40. Bilen, B. Duygu Analizi İçin Ürün Yorumları, Kaggle. 2020. Available online: https://www.kaggle.com/datasets/burhanbilenn/duygu-analizi-icin-urun-yorumlari (accessed on 20 May 2026).
  41. Öner, İ.; Özbay, E. Cross-Model Deepfake Text Detection with XLM-RoBERTa: A Strongly Generalizable Multi-LLM Training Strategy. Appl. Sci. 2026, 16, 5060. [Google Scholar] [CrossRef]
Figure 1. Overall system architecture of the proposed analysis pipeline.
Figure 1. Overall system architecture of the proposed analysis pipeline.
Applsci 16 05849 g001
Figure 2. Detailed data flow and integration between backend and frontend components.
Figure 2. Detailed data flow and integration between backend and frontend components.
Applsci 16 05849 g002
Figure 3. Confusion matrix of the fine-tuned BERTurk model (Experiment 3) on the held-out test set.
Figure 3. Confusion matrix of the fine-tuned BERTurk model (Experiment 3) on the held-out test set.
Applsci 16 05849 g003
Figure 4. Training and validation loss curves alongside macro F1 trajectories for Experiment 3.
Figure 4. Training and validation loss curves alongside macro F1 trajectories for Experiment 3.
Applsci 16 05849 g004
Table 1. Raw data collected per platform before preprocessing.
Table 1. Raw data collected per platform before preprocessing.
PlatformRaw CollectedAfter Filtering/SamplingFilter Applied
Trendyol582,60159,791Length-based sampling (≥11 words)
Amazon TR232,908134,179Language detection (Turkish only)
Hepsiburada2,657,073300,000Random sampling
Total3,472,582494,000
Table 2. Dataset distribution after LLM-based sentiment labeling and stratified downsampling.
Table 2. Dataset distribution after LLM-based sentiment labeling and stratified downsampling.
PlatformPositiveNeutralNegativeTotal
Trendyol51,9184751312259,791
Amazon TR98,92612,31722,936134,179
Hepsiburada126,68944,043129,268300,000
Merged (balanced)61,11161,11161,111183,333
Table 3. Validation set macro F1 scores across fine-tuning experiments. All values are computed on the validation split (15% of the dataset).
Table 3. Validation set macro F1 scores across fine-tuning experiments. All values are computed on the validation split (15% of the dataset).
ExperimentModelData SizeEpoch 1Epoch 2Epoch 3Epoch 4
Exp. 1bert-base-turkish-cased (32k)69,1840.84980.87470.88150.8834
Exp. 2bert-base-turkish-128k-cased69,1840.85460.87520.87760.8824
Exp. 3 (final)bert-base-turkish-128k-cased183,3330.90580.92050.92300.9243
Table 4. Per-class precision, recall, and F1-score on the held-out test set (Experiment 3, 15% split).
Table 4. Per-class precision, recall, and F1-score on the held-out test set (Experiment 3, 15% split).
ClassPrecisionRecallF1-Score
Negative0.920.920.92
Neutral0.900.900.90
Positive0.950.950.95
Macro avg0.920.920.9243
Table 5. Sample summarization outputs generated by Kumru-2B for a hair care product with 849 reviews.
Table 5. Sample summarization outputs generated by Kumru-2B for a hair care product with 849 reviews.
Sentiment ClassReview CountGenerated
Summary
Positive716“Kullanıcılar ürünün saçlara yumuşaklık ve parlaklık sağladığını, dökülmeyi azalttığını belirtiyor. Düzenli kullanımda saçların daha sağlıklı göründüğü vurgulanıyor.”
Neutral93“Ürün hızlı ulaştı ve ilk denemede olumlu sonuç verdi. Bazı kullanıcılar saçın çabuk sertleştiğini belirtirken düzenli kullanımla düzeleceğini düşünüyor.”
Negative40“Bazı müşteriler ürünün saçlarını kuruttuğunu ve beklentiyi karşılamadığını belirtiyor. Kargo hasarı ve kapak problemi de şikayetler arasında yer alıyor.”
Table 6. Representative subset of fuzzy inference rules.
Table 6. Representative subset of fuzzy inference rules.
RuleConditionOutput Score
R1IF positive ratio is HIGH AND avg rating is HIGH9.0
R2IF positive ratio is HIGH AND avg rating is MEDIUM7.5
R3IF positive ratio is MEDIUM AND avg rating is HIGH7.0
R4IF positive ratio is MEDIUM AND avg rating is MEDIUM5.5
R5IF positive ratio is LOW AND avg rating is LOW2.0
R6IF positive ratio is LOW AND avg rating is MEDIUM3.5
R7IF seller score is HIGH (modifier)+8.0 contribution
R8IF seller score is LOW (modifier)+3.0 contribution
R9IF review count ≥ 100+7.0 reliability bonus
Table 7. Fuzzy Box output categories.
Table 7. Fuzzy Box output categories.
Score RangeCategory
8.0–10.0High Satisfaction
6.0–8.0Good
4.0–6.0Conditional Recommendation
2.0–4.0Low Satisfaction
0.0–2.0Not Recommended
Table 8. Fuzzy Box evaluation results across six product categories.
Table 8. Fuzzy Box evaluation results across six product categories.
CategoryReviewsPos. RatioAvg. RatingSeller ScoreFuzzy ScoreLabel
Electronics
(high-rated)
10170.9364.879.07.93Good
Electronics
(low-rated)
2480.3022.888.84.52Conditional
Fashion8090.6464.268.87.14Good
Cosmetics10170.9064.849.47.94Good
Books1400.7864.619.87.84Good
Home & Living10200.9494.869.37.91Good
Table 9. Representative misclassification examples by error type.
Table 9. Representative misclassification examples by error type.
TruePredictedExample ReviewReason
NeutralPositive“Daha kullanmadım, kargo hızlı teşekkürler”No product opinion expressed; fast delivery framed positively
NeutralPositive“Ufacık bir kitap, bir saatte çok rahat biter”Descriptive statement without explicit sentiment; model infers positive
NegativeNeutral“İnsan bir ayraç hediye eder, kocaman ‘Prime’ yazılısından”Implicit complaint expressed through irony; no explicit negative marker
NegativeNeutral“135 Watt değil 900 Watt yazılmış… gravür uçları kalitesiz”Mixed content—factual correction alongside complaint dilutes negative signal
PositiveNeutral“Ürünü beğendim… kargo Malatya’ya 2 günde geldi, PTT hızlansa iyi olacak”Genuine positive opinion offset by a mild delivery complaint
Table 10. Statistical significance analysis results computed on the held-out test set (Experiment 3).
Table 10. Statistical significance analysis results computed on the held-out test set (Experiment 3).
TestStatisticValueInterpretation
McNemar’s testχ260.09Exp. 1 vs. Exp. 3 comparison
McNemar’s testp-value<0.0001Statistically significant (p < 0.001)
Bootstrap CI (10,000 iter.)Lower bound0.921495% confidence interval
Bootstrap CI (10,000 iter.)Upper bound0.927595% confidence interval
Bootstrap CI (10,000 iter.)Point estimate0.9245Macro F1 (Experiment 3)
Table 11. Zero-shot cross-dataset evaluation results of the fine-tuned BERTurk model on the human-annotated external dataset.
Table 11. Zero-shot cross-dataset evaluation results of the fine-tuned BERTurk model on the human-annotated external dataset.
Sentiment ClassPrecisionRecallF1-ScoreSupport (N)
Negative0.740.870.804237
Neutral0.540.410.472937
Positive0.830.820.824252
Overall Accuracy 73.33% (Total: 11,426)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Özbay, E.; Altunbey Özbay, F.; Özer, A.B. A Unified AI Framework for Turkish E-Commerce Review Analysis: Sentiment Classification, LLM-Based Summarization, and Fuzzy Evaluation. Appl. Sci. 2026, 16, 5849. https://doi.org/10.3390/app16125849

AMA Style

Özbay E, Altunbey Özbay F, Özer AB. A Unified AI Framework for Turkish E-Commerce Review Analysis: Sentiment Classification, LLM-Based Summarization, and Fuzzy Evaluation. Applied Sciences. 2026; 16(12):5849. https://doi.org/10.3390/app16125849

Chicago/Turabian Style

Özbay, Erdal, Feyza Altunbey Özbay, and Ahmet Bedri Özer. 2026. "A Unified AI Framework for Turkish E-Commerce Review Analysis: Sentiment Classification, LLM-Based Summarization, and Fuzzy Evaluation" Applied Sciences 16, no. 12: 5849. https://doi.org/10.3390/app16125849

APA Style

Özbay, E., Altunbey Özbay, F., & Özer, A. B. (2026). A Unified AI Framework for Turkish E-Commerce Review Analysis: Sentiment Classification, LLM-Based Summarization, and Fuzzy Evaluation. Applied Sciences, 16(12), 5849. https://doi.org/10.3390/app16125849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop