1. Introduction
The rapid growth of e-commerce has fundamentally changed the way consumers make purchasing decisions. Rather than relying on expert recommendations or physical inspection, buyers increasingly depend on the accumulated opinions of previous customers that are publicly available, searchable, and, in principle, freely interpretable. In practice, however, the sheer volume of user-generated reviews on major platforms renders manual reading infeasible. A moderately popular product on a Turkish e-commerce platform such as Trendyol or Hepsiburada may accumulate thousands of reviews within months of its listing. For a prospective buyer, synthesizing this volume of feedback into an actionable purchasing decision is an overwhelming task. For a seller or platform operator seeking to understand product-level satisfaction trends, it is even more so [
1].
This review volume problem is not simply a matter of information overload. It reflects a deeper structural asymmetry: the mechanisms by which consumers generate feedback have scaled far faster than the mechanisms by which that feedback can be consumed and acted upon. Existing platform-level solutions aggregate star ratings, helpfulness votes, and sorting algorithms address the surface of this problem without resolving its underlying cause. A product with a 4.2-star average rating across 800 reviews conceals as much as it reveals: the distribution of sentiment across those reviews, the specific dimensions of the product that satisfied or disappointed buyers, and the reliability of the seller through whom the product was purchased are all invisible to a user who sees only the aggregate score [
2].
The first dimension of the problem concerns sentiment classification: automatically determining the polarity of individual reviews at scale. Sentiment analysis has emerged as a critical field in Natural Language Processing (NLP) to extract and quantify subjective information from the ever-growing volume of digital text [
3]. It offers a principled approach to this problem by enabling the automatic classification of review text into polarity categories, typically positive, neutral, or negative, without requiring human reading of individual comments. Transformer-based models, and BERT in particular, have substantially advanced the state of the art in sentiment classification across a wide range of languages and domains. For Turkish, a morphologically rich and agglutinative language that poses distinct challenges for natural language processing, early efforts focused on constructing lexical resources and rule-based models to handle its complex morphology [
4,
5]. More recently, the development of BERTurk, a BERT model pre-trained exclusively on Turkish corpora, has provided a strong foundation for domain-specific fine-tuning. However, sentiment classification alone does not solve the review volume problem: knowing that 84% of reviews for a given product are positive tells a user relatively little about why those reviews are positive, which product attributes they praise, or what the remaining 16% of reviewers found objectionable.
The second dimension concerns automatic summarization: condensing large review collections into human-readable outputs that explain why reviewers are satisfied or dissatisfied. Automatic summarization of review collections addresses this complementary need, but remains an open challenge in the Turkish NLP literature. Existing multilingual summarization models, trained predominantly on news corpora, fail to generalize to the informal, abbreviated, and domain-specific language of e-commerce reviews. The mismatch between pre-training distribution and deployment context is particularly acute for Turkish, where the morphological complexity of the language further complicates transfer from resource-rich settings [
6].
A third dimension of the problem concerns the aggregation of heterogeneous quality signals into a single interpretable evaluation. Users do not only care about sentiment; they also consider star ratings, seller reliability, and the statistical confidence that comes from a large review sample. Combining these signals through a simple weighted average obscures the non-linear and context-dependent interactions between them. Fuzzy logic provides a framework for modeling this ambiguity explicitly, representing intermediate values through graded membership functions and aggregating signals through human-interpretable IF-THEN rules [
7].
This paper addresses all three dimensions of the review volume problem through a unified end-to-end pipeline. The proposed system accepts a Trendyol product URL as input and produces three outputs: a three-class sentiment classification of every review on the product page, a natural language summary for each sentiment category generated by the Kumru-2B Turkish language model, and a fuzzy logic-based product score that integrates sentiment distribution, average star rating, seller reliability, and review count into a single interpretable evaluation. The pipeline is accessible through a React-based web interface and processes 850+ reviews in under 60 s.
The main contributions of this study can be summarized as follows:
- (i)
Proposing the first unified Turkish e-commerce intelligence framework integrating sentiment classification, sentiment-aware summarization, and explainable fuzzy evaluation;
- (ii)
Constructing and publicly releasing a large-scale, balanced Turkish e-commerce review dataset containing 183,333 manually validated LLM-annotated samples collected from three major platforms;
- (iii)
Demonstrating that dataset diversity and class balance exert a greater influence on Turkish sentiment classification performance than tokenizer vocabulary expansion;
- (iv)
Showing that instruction-tuned Turkish LLMs substantially outperform news-oriented multilingual summarization models in informal review summarization tasks;
- (v)
Introducing an interpretable fuzzy inference mechanism capable of transforming heterogeneous review signals into transparent product-level evaluations.
Although previous studies have investigated Turkish sentiment classification [
8,
9,
10,
11], review summarization [
12,
13,
14], or fuzzy evaluation [
15,
16,
17] independently, the literature still lacks an integrated decision-support framework capable of simultaneously performing sentiment prediction, sentiment-aware abstractive summarization, and interpretable multi-criteria product evaluation within a unified Turkish NLP pipeline. Existing studies generally focus on isolated classification benchmarks and rarely address the downstream usability of sentiment outputs for real consumer decision-making scenarios. Furthermore, the interaction between large language model-based summarization and fuzzy reasoning mechanisms remains largely unexplored in Turkish e-commerce analytics. The present study addresses this gap by proposing a fully integrated and deployable framework that combines transformer-based sentiment analysis, instruction-tuned Turkish LLM summarization, and explainable fuzzy inference into a single operational architecture [
18,
19,
20].
The remainder of this paper is organized as follows.
Section 2 reviews related work in Turkish sentiment analysis, automatic text summarization, LLM-based annotation, and fuzzy logic for e-commerce evaluation.
Section 3 describes the methodology, including data collection, model training, and system integration.
Section 4 presents experimental results.
Section 5 discusses the findings and their implications.
Section 6 concludes with directions for future work.
3. Methodology
This study proposes an end-to-end pipeline for Turkish e-commerce review analysis consisting of four main components: (1) multi-platform data collection and preprocessing, (2) sentiment classification using a fine-tuned BERTurk model, (3) sentiment-aware summarization using the Kumru-2B language model, and (4) a fuzzy logic-based product evaluation module.
Figure 1 illustrates the overall system architecture.
3.1. Data Collection and Preprocessing
To construct a comprehensive and domain-representative dataset, product reviews were collected from three major Turkish e-commerce platforms: Trendyol, Amazon Turkey, and Hepsiburada. A Selenium-based web scraper was developed to extract user reviews along with associated metadata, including star rating, seller information, and seller score. Initially, over 582,000 reviews were collected from Trendyol. Following a length-based filtering step in which reviews exceeding a predefined word count threshold were excluded to ensure consistency with the token length distribution of the other platforms, the Trendyol subset was reduced to 59,791 reviews. In total, 494,000 raw reviews were collected across all three platforms.
For sentiment labeling, two large language models were employed. Reviews from Trendyol were labeled using GPT-4o mini via the OpenRouter API, while Amazon Turkey and Hepsiburada reviews were labeled using Gemini 2.5 Flash. For Amazon Turkey reviews, both the review text and the star rating were provided to the model simultaneously to improve labeling accuracy. For Hepsiburada reviews, where star ratings were unavailable in a subset of the collected data, labeling was performed based solely on review content. Each review was assigned one of three sentiment categories: positive, neutral, or negative.
Following the labeling phase, a standard text preprocessing pipeline was applied. This included whitespace normalization, removal of non-Turkish characters, and spell correction using Gemini 2.5 Flash to address the high frequency of orthographic errors commonly found in informal Turkish user-generated content. Duplicate reviews were removed using exact text matching across platforms.
The labeled dataset exhibited a severe class imbalance, particularly in the Trendyol subset, where the positive-to-negative ratio reached 16.6:1. To address this, stratified downsampling was applied such that the final training corpus contained an equal number of samples per class. The resulting merged dataset comprised 183,333 reviews distributed equally across the three sentiment categories (61,111 per class), drawn from all three platforms.
Table 1 and
Table 2 summarize the data collection and filtering statistics, and the sentiment label distribution before and after stratified downsampling, respectively. The fully preprocessed and balanced dataset used in this study has been made publicly available to support future research [
32].
3.2. Sentiment Classification
Sentiment classification was performed using BERTurk, a BERT-based language model pre-trained exclusively on Turkish corpora. The underlying architecture follows the Transformer framework introduced by Vaswani et al. [
33]. Two variants of BERTurk were evaluated: the standard 32,000-token vocabulary model (bert-base-turkish-cased) and a larger 128,000-token vocabulary model (bert-base-turkish-128k-cased). The latter was developed to better handle the agglutinative morphological structure of Turkish, where a single word can encode information that would require several words in English. For instance, a word such as “kullanamıyorum” (“I cannot use it”) is tokenized into three subword units by the 32k model, whereas the 128k model represents it as a single token, preserving semantic integrity more effectively.
Both models were fine-tuned for three-class sequence classification (positive, neutral, negative) using the merged dataset of 183,333 reviews. The dataset was partitioned into training, validation, and test sets following a 70/15/15 stratified split, ensuring that the class distribution was preserved across all partitions. The fine-tuning process leverages the bidirectional representations learned during pre-training, similar to other cross-lingual architectures like XLM-R [
34].
Training was conducted on a Google Colab environment equipped with an NVIDIA A100 80 GB GPU. The AdamW optimizer was used with a learning rate of 2 × 10−5 and a weight decay of 0.01. A linear learning rate scheduler with a 10% warm-up ratio was applied over the total number of training steps. Given the perfectly balanced class distribution achieved through stratified downsampling, class weights were nonetheless incorporated into the cross-entropy loss function as an additional safeguard against residual imbalance effects.
Three fine-tuning experiments were conducted in total. The first two experiments used the 69,184-sample dataset comprising Trendyol and Amazon Turkey reviews, and compared the 32k and 128k vocabulary models under identical training conditions. The third experiment extended the dataset to include Hepsiburada reviews, yielding the final 183,333-sample corpus, and used the 128k vocabulary model based on its stronger morphological tokenization properties.
Table 3 reports the validation macro F1 scores across all experiments and epochs.
The results reveal a clear pattern: while the 128k vocabulary model showed a marginal advantage in the early epochs of Experiments 1 and 2, the two models converged to nearly identical performance by Epoch 4, with a difference of only 0.001 in macro F1. This suggests that, for the given dataset size, the additional expressiveness of the larger vocabulary did not translate into a significant performance gain on its own. The decisive factor was the inclusion of Hepsiburada data in Experiment 3. With the dataset size increasing from 69,184 to 183,333 samples and the class balance shifting from a skewed distribution to a perfectly uniform 1:1:1 ratio, the model achieved a validation macro F1 of 0.9243 from the very first epoch onward, already surpassing the final performance of the previous experiments.
The final model was evaluated on the held-out test set, yielding the per-class results reported in
Table 4. Notably, the neutral class, historically the most challenging category in sentiment analysis tasks due to its ambiguous boundary with both positive and negative classes, achieved an F1 score of 0.90, a substantial improvement over the 0.80 observed in Experiment 1. This improvement is largely attributable to the inclusion of the Hepsiburada subset, which contained a substantially higher proportion of neutral and negative samples compared to the other platforms.
3.3. Sentiment-Aware Summarization
Following sentiment classification, reviews were grouped by their predicted sentiment label and passed through a summarization module designed to produce a concise, human-readable summary for each sentiment category. The objective was to distill the collective opinion of hundreds of reviewers into three distinct paragraphs, one for positive, one for neutral, and one for negative reviews, thereby allowing end users to grasp the overall reception of a product without reading individual comments.
Initially, a transformer-based extractive–abstractive summarization approach inspired by the FusionSum framework was considered for this module. The plan involved fine-tuning a multilingual T5 (mT5) model on Turkish e-commerce review data to perform sentence-level fusion and abstractive generation. Two candidate models were evaluated: mtufan/mt5-small-turkish-summarization, which had been fine-tuned on Turkish news corpora, and csebuetnlp/mT5_multilingual_XLSum, a multilingual variant trained on summarization benchmarks across several languages. Both models were tested on a sample of 30 review groups drawn from the merged dataset, using pseudo-references constructed from the first three sentences of each group. The average ROUGE-L score across both models was 0.0532, well below the target threshold of 0.35. Qualitative inspection of the generated summaries confirmed that both models produced outputs that were semantically unrelated to the input reviews, a consequence of the domain mismatch between Turkish news text, on which these models were trained, and informal Turkish e-commerce language. On this basis, the mT5-based approach was abandoned in accordance with the contingency plan defined in the project proposal.
As an alternative, Kumru-2B was adopted for the summarization module. Kumru-2B is a 2-billion-parameter causal language model developed from scratch for the Turkish language by VNGRS [
35]. It was pre-trained on a cleaned and deduplicated corpus of 500 GB of Turkish text comprising approximately 300 billion tokens, and subsequently fine-tuned on one million instruction-following examples. Unlike general-purpose multilingual models, Kumru-2B incorporates a native Turkish tokenizer based on byte-pair encoding with a vocabulary of 50,176 tokens, which processes Turkish morphological structures significantly more efficiently. Benchmark evaluations on the Cetvel suite, a Turkish-language evaluation framework covering 26 distinct NLP tasks, including summarization, grammatical error correction, and question answering, have shown that Kumru-2B outperforms considerably larger multilingual models such as LLaMA-3.3-70B, Gemma-3-27B, and Qwen-2-72B on Turkish-specific tasks, despite its comparatively compact size.
Summarization was implemented through structured prompt engineering. For each sentiment class, a task-specific system prompt was defined, instructing the model to produce a single coherent paragraph in Turkish, without bullet points, numbered lists, or quotation marks. The prompt explicitly identified the sentiment category and directed the model to highlight the dominant themes within that group, such as product quality and delivery speed for positive reviews, or recurring complaints and disappointment for negative ones. Up to 20 reviews per sentiment class were concatenated and passed to the model as context. Generation was performed with a temperature of 0.3, a repetition penalty of 1.2, and a maximum of 150 new tokens, configurations chosen to balance output diversity with factual consistency.
The system was tested on a real Trendyol product page containing 849 reviews. The model successfully produced fluent, domain-appropriate summaries for all three sentiment classes within the same inference session. Representative outputs are provided in
Table 5.
3.4. Fuzzy Logic-Based Product Evaluation
While sentiment distribution and sentiment-specific summaries provide actionable information for users who are willing to engage with detailed review content, many consumers and sellers require a single interpretable score that consolidates multiple heterogeneous signals into an at-a-glance evaluation. The Fuzzy Box addresses this complementary need by integrating sentiment distribution, star ratings, seller reliability, and review volume into a unified product-level score, enabling rapid decision support without requiring the user to manually weigh each signal.
The final component of the proposed pipeline is a fuzzy logic-based scoring module, referred to as the Fuzzy Box, which aggregates multiple product-level signals into a single interpretable score ranging from 0 to 10. The evaluation engine is grounded in the foundational principles of fuzzy set theory [
36], utilizing a Mamdani-style inference mechanism [
37] originally developed for industrial control applications [
38]. The motivation behind adopting a fuzzy logic framework as opposed to a simple weighted average lies in the inherently imprecise and overlapping nature of the input variables. For example, a positive review ratio of 0.72 cannot be cleanly categorized as either “high” or “low”; it occupies an intermediate region that a crisp threshold would misrepresent. Fuzzy logic handles such ambiguity by assigning partial degrees of membership to linguistic categories, enabling more nuanced and human-aligned evaluations.
3.4.1. Input Variables and Fuzzification
Four input variables were defined for the Fuzzy Box module. The first is the positive review ratio, computed as the proportion of reviews classified as positive by the BERT model, normalized to the range [0, 10]. The second is the average star rating assigned by users to the product, rescaled from the original [1, 5] scale to [0, 10] using the transformation (r − 1)/4 × 10. The third is the seller score, a platform-provided metric reflecting seller reliability, clipped to the [0, 10] range. The fourth is the review count, which serves as a proxy for evaluation reliability and is normalized by dividing by 100 and clipping at 10.
Although the positive review ratio and the average star rating are related signals, they are not redundant. The neutral sentiment class, in particular, exhibits a relatively high mean star rating, indicating that users frequently assign favorable star ratings while expressing mixed or ambivalent opinions in their review text. This divergence suggests that text-based sentiment classification captures evaluative nuances that aggregate star ratings alone cannot represent. Furthermore, star ratings were used during the LLM-based annotation phase solely as a contextual signal to improve labeling accuracy on Amazon Turkey reviews, not as a direct label assignment mechanism. Their role as an input variable in the Fuzzy Box is therefore independent of the annotation pipeline and does not constitute double-counting of the same underlying evaluation signal.
Each input variable was fuzzified using triangular membership functions, which assign a degree of membership between 0 and 1 to each of three linguistic categories: low, medium, and high. For a given input value x and a triangular function defined by parameters (a, b, c), the membership degree μ(x) equals (x − a)/(b − a) if a < x ≤ b, and (c − x)/(c − b) if b < x < c.
The parameters for each variable were set empirically based on the observed distributions in the collected dataset. For the positive ratio and average rating, the triangular boundaries were set at (3, 6, 9), reflecting the expectation that scores below 3 indicate poor sentiment, scores around 6 represent moderate satisfaction, and scores above 9 indicate strong positive reception. For the seller score, boundaries were set at (3, 6, 8) to account for the compressed upper range typically observed in platform seller ratings.
3.4.2. Rule Base
A rule base consisting of 15 fuzzy IF-THEN rules was constructed to model the relationships between the input variables and the output score. The rules were designed to capture both primary interactions, such as the combined effect of a high positive ratio and high average rating, and secondary modifying effects, such as seller reliability and review volume.
Table 6 presents a representative subset of the rule base.
Each rule fires with a strength equal to the minimum membership degree of its antecedents, following the Mamdani inference approach. Rules involving seller score and review count are treated as weighted modifiers with reduced firing strengths (0.3–0.5) to prevent these secondary variables from dominating the output.
3.4.3. Defuzzification and Output
The output score was computed through weighted average defuzzification, in which each rule’s contribution is the product of its firing strength and its associated output value. The aggregation of multiple conflicting criteria (e.g., positive ratio vs. seller score) follows the logic of ordered weighted averaging operators to ensure a balanced final score [
39]. The final score
S is given by
S = Σ (
wi ×
si)/Σ
wi, where
wi denotes the firing strength of the
ith rule and
si denotes the corresponding output value. The resulting score is clipped to the interval [0, 10] and mapped to one of five interpretive categories as shown in
Table 7.
In addition to the numerical score, the module generates a set of natural language explanations derived from the rules that fired with non-zero strength. These explanations identify the dominant factors contributing to the final score, for instance, noting that a high seller score reinforced the overall evaluation, or that a low positive ratio constrained it. This explainability component is consistent with the broader objective of producing outputs that are not only accurate but also interpretable by non-expert users.
The Fuzzy Box was evaluated on the same product used for the summarization test. With a positive review ratio of 0.844, an average rating of 4.92, and a seller score of 9.1 across 849 reviews, the module produced a score of 7.7, corresponding to the “Good” category. The explanations generated by the system attributed the score primarily to the high positive review ratio combined with the high star rating, and further reinforced by the seller’s strong reliability score.
3.5. System Integration
The four modules described in the preceding sections, data collection, sentiment classification, summarization, and fuzzy evaluation, were integrated into a unified end-to-end pipeline accessible through a web-based interface. The pipeline accepts a Trendyol product URL as input and returns a structured JSON response containing the sentiment distribution, per-class summaries, and the fuzzy evaluation score.
The backend was implemented using FastAPI, a Python-based asynchronous web framework, and deployed on Google Colab with an NVIDIA A100 GPU. External access to the API was enabled through ngrok, a reverse tunneling service that exposes the locally running server to the public internet via a secure HTTPS endpoint. The frontend was developed as a single-page React application styled with Tailwind CSS, providing an interactive dashboard that visualizes the sentiment distribution as a horizontal bar chart, displays the three Kumru-2 B-generated summaries as categorized cards, and renders the Fuzzy Box score alongside its interpretive label and contributing factors.
The complete pipeline was tested on a real product page from Trendyol containing 849 reviews. End-to-end processing from URL submission to final JSON output was completed in under 60 s, satisfying the system performance criterion of processing 1000 reviews in under 10 min.
Figure 2 illustrates the system architecture and the data flow between components.
4. Experiments and Results
4.1. Experimental Setup
All experiments were conducted on Google Colab Pro using an NVIDIA A100 SXM4 80 GB GPU. The sentiment classification model was implemented using the Hugging Face Transformers library (version 5.0.0) with PyTorch (version 2.11.0) as the backend, developed under Python (version 3.12.13). Fine-tuning hyperparameters were kept constant across all three experiments to ensure a fair comparison: learning rate 2 × 10−5, weight decay 0.01, batch size 64, maximum sequence length 128 tokens, and 4 training epochs. The AdamW optimizer was used in conjunction with a linear learning rate scheduler incorporating a 10% warm-up period. All experiments used a fixed random seed of 42 to ensure reproducibility.
For the summarization module, Kumru-2B was loaded in half-precision (torch.float16) with automatic device mapping to maximize GPU utilization. Generation hyperparameters were set as follows: temperature 0.3, repetition penalty 1.2, and a maximum of 150 new tokens per summary. The fuzzy inference engine was implemented from scratch in Python without external fuzzy logic libraries, using triangular membership functions and weighted average defuzzification as described in
Section 3.4.
4.2. Sentiment Classification Results
The sentiment classification experiments yielded consistent improvements across the three experimental conditions. As reported in
Table 3, the baseline experiment using 69,184 samples and the 32k vocabulary model achieved a best validation macro F1 of 0.8834. Replacing the backbone with the 128k vocabulary variant under identical conditions produced a marginally lower score of 0.8824, indicating that the larger vocabulary did not confer a meaningful advantage at this dataset scale. The performance difference between the two vocabulary configurations remained marginal across all epochs, suggesting that vocabulary expansion alone did not substantially affect classification performance under the current dataset scale.
The most substantial performance gain was observed in Experiment 3, where the addition of 114,149 Hepsiburada reviews to the training corpus, coupled with the resulting shift to a perfectly balanced 1:1:1 class distribution, produced a validation macro F1 of 0.9058 at the end of the first epoch alone, already exceeding the best results of the previous experiments. The model continued to improve through subsequent epochs, reaching a final validation macro F1 of 0.9243 at Epoch 4. This result represents a 4.6 percentage point improvement over Experiment 1 and surpasses the project success criterion of F1 ≥ 0.80 by a margin of 12.4 percentage points. The confusion matrix of the fine-tuned BERTurk model using Experiment 3 is shown in
Figure 3.
Test set evaluation of the Experiment 3 model revealed strong performance across all three classes. The positive class achieved an F1 score of 0.95, the negative class 0.92, and the neutral class 0.90. The neutral class, which is typically the most difficult to classify due to its semantic proximity to both positive and negative categories, showed a 10 percentage point improvement compared to Experiment 1 (from 0.80 to 0.90). This improvement is attributed to the Hepsiburada subset, which contributed 44,043 neutral samples, nearly three times the combined neutral count of the Trendyol and Amazon Turkey subsets. Overall test accuracy was 92%, confirming the generalizability of the model to unseen data.
Figure 4 shows the training and validation loss curves and macro F1 trajectories across epochs for Experiment 3. The validation loss reached its minimum at Epoch 2 (0.2083) before gradually increasing in subsequent epochs, while the validation F1 continued to rise through Epoch 4. This divergence between loss and F1 is consistent with a phenomenon known as soft overfitting, in which the model becomes increasingly uncertain about borderline cases reflected in rising loss while still improving its hard classification decisions. As the best model checkpoint was saved based on validation F1 rather than loss, this behavior did not negatively impact final performance.
4.3. Summarization Results
The summarization module was evaluated qualitatively on a representative product from Trendyol, a hair care oil with 849 reviews. Following sentiment classification, the reviews were grouped into 716 positive, 93 neutral, and 40 negative instances. Kumru-2B generated a separate summary for each group within the same inference session.
The generated summaries were assessed along three dimensions: fluency, relevance, and factual consistency. In terms of fluency, all three summaries produced grammatically correct and stylistically natural Turkish prose, with no instances of repetition, hallucination of product names, or code-switching to English. In terms of relevance, the summaries accurately reflected the dominant themes present in each group, for example, the positive summary correctly identified softness, shine, and reduced hair loss as the primary benefits cited by reviewers, while the negative summary captured recurring complaints about dryness and packaging damage. Factual consistency was maintained throughout, with no summary introducing claims that were absent from the input reviews.
For reference, the baseline mT5 models evaluated during the model selection phase produced an average ROUGE-L score of 0.0532 on pseudo-reference summaries constructed from review subsets, confirming the inadequacy of those models for this domain. Although a direct ROUGE-L evaluation of the Kumru-2B outputs was not performed due to the absence of human-written reference summaries, the qualitative results suggest that the Kumru-2B model generates more coherent and domain-appropriate summaries than those obtained with the tested mT5 baselines.
4.4. Fuzzy Box Results
The Fuzzy Box module was evaluated on the same product used for the summarization assessment. The input variables extracted from the pipeline output were as follows: positive review ratio 0.844, average star rating 4.92 (normalized to 9.80 on the [0, 10] scale), seller score 9.1, and review count 849 (normalized to 10.0). These values were fuzzified using the triangular membership functions defined in
Section 3.4.1, and the rule base was evaluated to determine the firing strength of each rule.
The majority of the high-confidence rules fired with near-maximum strength. Rule R1 IF positive ratio is HIGH AND average rating is HIGH fired with a strength of 0.94, contributing a score of 9.0 to the weighted average. Rule R7, the seller score modifier, fired with a strength of 0.78, contributing a score of 8.0. The review count reliability bonus was activated at full strength, given that the review count exceeded the 100-review threshold by a factor of eight.
The final defuzzified score was 7.7 out of 10, placing the product in the “Good” category. The system generated three natural language explanations attributing the score to the high positive review ratio, combined with the strong average rating, the high seller reliability score, and the large number of reviews contributing to evaluation confidence. These explanations were displayed alongside the numerical score in the user interface, providing end users with a transparent account of the factors driving the evaluation.
To further validate the module across diverse product categories and sentiment profiles, the Fuzzy Box was evaluated on six additional Trendyol products spanning electronics, fashion, cosmetics, books, and home goods.
Table 8 summarizes the results.
The results demonstrate that the module produces scores consistent with the underlying sentiment distributions. The low-rated electronics product, which exhibited a positive review ratio of only 0.302 and an average star rating of 2.88, received a score of 4.52 and was categorized as “Conditional Recommendation”, the only product in the evaluation set to fall below the “Good” threshold. All remaining products received scores between 7.1 and 7.9, reflecting their higher positive ratios and star ratings. Notably, the books category received the highest seller score (9.8) despite having the fewest reviews (140), illustrating the role of the seller reliability modifier in the rule base. The fashion product, which had a lower positive ratio (0.646) than the other high-rated categories, correspondingly received the lowest fuzzy score among the “Good” products (7.14), demonstrating the module’s sensitivity to variation in sentiment distribution even within the same output category.
To assess the robustness of the Fuzzy Box against parameter uncertainty, a sensitivity analysis was conducted by systematically varying the membership function boundaries across five parameter configurations, ranging from narrow to wide boundary settings. The analysis was performed on six products spanning five categories. The mean maximum score deviation across all products and configurations was 0.49 points on a 10-point scale, and the interpretive category label remained unchanged for five out of six products across all parameter configurations. The single label change observed in the book category product occurred at a boundary value between the “Good” and “High Satisfaction” categories, which represent adjacent and similarly favorable evaluations. These results indicate that the Fuzzy Box produces stable and consistent outputs across a range of membership function parameterizations, suggesting that the evaluation framework is reasonably robust to variations in membership function settings and reducing concerns regarding potential designer bias associated with the manually specified rule base.
4.5. Error Analysis
To better understand the failure modes of the classification model, a systematic analysis of misclassified instances was conducted on the test set. Errors were grouped by their (true label, predicted label) pair, and representative examples were selected for qualitative examination.
Table 9 presents a categorized summary of the most frequent error patterns, along with illustrative examples drawn directly from the test data.
Three dominant error patterns emerge from this analysis. First, ‘delivery-only reviews’ comments that mention only shipping speed without evaluating the product itself are systematically misclassified as positive, as fast delivery is a strong positive signal in the training data. Second, ‘ironic and implicit complaints’ lack the explicit negative lexical markers on which the model relies, causing them to be assigned neutral labels. Third, ‘concessive structures’ of the form “X is lacking but Y is good” introduce polarity ambiguity that the model resolves toward the less extreme class.
These patterns are consistent with known limitations of token-level sentiment models when applied to mixed-polarity or pragmatically complex utterances, and suggest that discourse-level features such as contrast connectives and sentence-level polarity shifts could improve classification accuracy in future work.
4.6. Statistical Significance
To verify that the performance improvement observed in Experiment 3 relative to Experiment 1 is statistically significant rather than attributable to chance, two complementary analyses were conducted.
First, a McNemar’s test was applied to the paired predictions of Experiment 1 and Experiment 3 on the shared test set. McNemar’s test is appropriate for comparing two classifiers evaluated on the same instances, as it accounts for the dependency structure of paired observations. The test yielded χ2 = 60.09 (p < 0.0001), indicating that the difference in classification accuracy between the two models is statistically significant at the 0.001 level.
Second, a bootstrap confidence interval for the macro F1 score of the final model was estimated using 10,000 resampling iterations with replacement.
Table 10 reports the full results of both analyses.
The marginal difference between the validation macro F1 (0.9243) and the bootstrap point estimate on the test set (0.9245) is attributable to the stochastic nature of bootstrap resampling and confirms the stability of the reported performance. The 95% confidence interval of [0.9214, 0.9275] confirms that the reported macro F1 of 0.9245 is a stable and reliable estimate unlikely to vary substantially across different random draws from the test distribution. Taken together, these results confirm that the performance gains observed in Experiment 3, driven by the expansion of the training corpus to 183,333 samples and the achievement of a balanced class distribution, represent a genuine and statistically robust improvement over the baseline configuration.
4.7. Validation Against Human-Annotated Gold Standard
To provide a rigorous assessment of our model’s generalization capabilities, we conducted a zero-shot cross-dataset evaluation on an external human-annotated Turkish e-commerce dataset comprising 11,426 samples [
40]. The comprehensive results of this evaluation are presented in
Table 11.
As shown in
Table 11, our fine-tuned BERTurk model achieved an overall accuracy of 73.33%. The performance metrics demonstrate that the model maintains high discriminative power for polar sentiments, achieving F1-scores of 0.82 for positive and 0.80 for negative classes. While the neutral class exhibits a lower F1-score of 0.47, this is consistent with the inherent subjectivity and the lack of consensus in human annotation guidelines for neutral content across diverse e-commerce datasets. Overall, these validation results support the robustness of the model primarily for polar sentiments, while confirming that neutral reviews remain inherently more difficult to classify. These results, supported by the detailed metrics in
Table 11, substantiate the robustness of our LLM-assisted annotation strategy and confirm that the model has captured authentic semantic patterns rather than artifactual noise.
5. Discussion
The experimental results presented in this study demonstrate that the proposed pipeline achieves strong performance across all evaluated components, while also surfacing several findings that warrant broader reflection.
5.1. Impact of Data Volume and Class Balance on Classification Performance
The experimental findings indicate that dataset scale and class balance had a greater impact on classification performance than tokenizer vocabulary size. While the 128k BERTurk variant produced only marginal gains over the 32k model under identical training conditions, the inclusion of the Hepsiburada subset and the resulting balanced class distribution substantially improved the macro F1-score and especially the neutral class performance. These results suggest that carefully balanced and domain-diverse datasets remain more influential than moderate architectural modifications for Turkish e-commerce sentiment classification.
This finding has practical implications for future work in Turkish NLP. Researchers who face resource constraints may achieve better returns by investing in additional data collection and careful class balancing rather than exploring larger or more complex model architectures [
41]. At the same time, the 128k vocabulary model was retained for the final experiment on the grounds that its morphological tokenization advantage may become more pronounced at larger dataset scales or on tasks requiring finer-grained semantic distinctions, such as aspect-level sentiment analysis.
5.2. Plan B Transition in the Summarization Module
The preliminary mT5-based summarization experiments highlighted the importance of domain alignment in Turkish NLP applications. Models trained predominantly on news corpora failed to generalize to informal e-commerce language characterized by short, noisy, and morphologically variable expressions. In contrast, the Turkish-native instruction-tuned Kumru-2B model generated substantially more coherent and contextually relevant summaries through structured prompting.
The adoption of Kumru-2B as a replacement resolved this issue by leveraging a model that had been pre-trained and instruction-tuned entirely on Turkish data, including informal and conversational text. The structured prompt engineering approach used in this study, where the model was explicitly instructed to produce a single coherent paragraph without lists or quotations, proved effective in constraining the output format and preventing the model from defaulting to enumeration-style responses, which were observed in preliminary experiments with less directive prompts. These observations underscore the importance of prompt design as a critical factor in the deployment of instruction-tuned language models for structured generation tasks.
5.3. Interpretability and the Role of Fuzzy Logic
The decision to implement a fuzzy logic-based scoring module rather than a simpler aggregation function, such as a weighted sum of the positive ratio and average rating, was motivated by the desire to produce scores that better reflect human judgment under uncertainty. A crisp weighted average would treat a positive ratio of 0.70 and a positive ratio of 0.80 as separated by a fixed gap, whereas the fuzzy approach allows these values to share membership in overlapping linguistic categories and interact non-linearly through the rule base.
In practice, the Fuzzy Box produced scores that aligned well with intuitive expectations. The test product, with a positive ratio of 0.844, an average rating of 4.92, and a seller score of 9.1, received a score of 7.7, categorized as “Good” rather than “High Satisfaction.” This outcome reflects the system’s sensitivity to the negative review content, which included substantive complaints about product performance and packaging quality, and which the fuzzy rules penalized relative to a hypothetical product with a higher positive ratio. The natural language explanations generated alongside the score further enhanced the interpretability of the output, enabling users to understand not just the final verdict but the reasoning behind it.
5.4. Limitations
Several limitations of the current study should be acknowledged. First, the sentiment labels used for training were generated by large language models rather than human annotators, introducing a degree of label noise that is difficult to quantify precisely. While the use of multiple models across platforms and the incorporation of star ratings as an additional signal helped mitigate this risk, a formal inter-annotator agreement study such as Cohen’s Kappa computed against a human-labeled gold standard was not conducted and represents an important direction for future validation.
Second, the summarization module was evaluated primarily through qualitative criteria, including fluency, relevance, and factual consistency, with representative outputs provided in
Table 5 for transparency. However, the absence of quantitative metrics such as BERTScore or human evaluation scores represents a limitation of the current study. In particular, it is difficult to determine from qualitative assessment alone whether the generated summaries accurately represent the distribution of reviewer opinions, omit important information, or overrepresent minority complaints. Incorporating a systematic human evaluation study and automatic factual consistency metrics remains an important direction for future work.
Third, the current pipeline is limited to Trendyol product pages, as the scraper was developed specifically for that platform’s HTML structure. Extending the system to Amazon Turkey and Hepsiburada would require additional scraper development and validation, and may introduce platform-specific biases in the collected data.
Additionally, the current Fuzzy Box applies a fixed weighting scheme that may not reflect individual user preferences. A user-configurable weighting mechanism, in which consumers or sellers can adjust the relative importance of sentiment distribution, star ratings, and seller reliability according to their specific decision context, represents a natural and valuable extension of the current system.
Finally, the Fuzzy Box rule base was constructed manually based on domain intuition and the observed data distributions. A data-driven approach to rule induction such as fuzzy c-means clustering or genetic algorithm-based rule optimization could yield a more principled and potentially more accurate scoring function, and is left as a direction for future work.
6. Conclusions
This study presented an end-to-end pipeline for automated Turkish e-commerce review analysis, integrating a fine-tuned BERTurk sentiment classifier, a Kumru-2B-based summarization module, and a fuzzy logic evaluation engine into a unified, deployable system. The sentiment classifier achieved a macro F1 of 0.9243 on a balanced 183,333-sample corpus drawn from three platforms, surpassing the target threshold by 12.4 percentage points, while the summarization module demonstrated that prompt-engineered instruction-tuned models outperform news-trained abstractive systems for informal Turkish e-commerce text. The Fuzzy Box provided interpretable product scores through a 15-rule Mamdani inference system, and end-to-end processing of 849 reviews was completed in under 60 s. Future work will focus on fine-tuning Kumru-2B on domain-specific summarization data, extending the pipeline to Amazon Turkey and Hepsiburada, incorporating human annotation to quantify label noise, replacing the manually constructed fuzzy rule base with a data-driven induction approach, and conducting a formal user study to evaluate the perceived usefulness and trustworthiness of the system’s outputs.