1. Introduction
With the acceleration of digitalization, e-commerce has become one of the dominant forms of modern trade, significantly transforming consumer purchasing behavior. The widespread availability of the internet, increased mobile device usage, and advances in secure online payment systems have fundamentally reshaped traditional shopping habits. Today, consumers rely heavily on user-generated product reviews; however, the abundance of comments makes it difficult to locate informative and reliable content. Superficial expressions, such as “good,” “nice,” or “bad,” fail to address essential questions regarding product quality, usability, and limitations, rendering them useless from the consumer’s perspective. Therefore, identifying and highlighting useful reviews that provide meaningful insights has become a critical requirement for improving user experience. Addressing this need, the present study aims to classify Turkish e-commerce reviews as useful or useless.
Recent advancements in natural language processing (NLP) have enabled machine learning (ML) techniques to be effectively utilized in this field. Traditional algorithms, such as Support Vector Machines (SVM) and Logistic Regression (LR), have been widely applied in text classification tasks [
1,
2]. SVM, based on statistical learning theory, offers strong generalization capability in high-dimensional feature spaces and mitigates overfitting through margin optimization [
1,
3]. Methods for multi-class classification using SVM have been proposed in earlier studies [
4], while other work has demonstrated high performance in detecting aggressive language in Turkish social media comments [
2]. Further research has reported an accuracy of over 93% for cosmetic product sentiment classification using SVM [
5], and additional studies have highlighted its superiority in multi-label e-commerce reviews [
6]. Similarly, LR is frequently preferred due to its simplicity, interpretability, and robustness. The success of LR in classifying student arguments has been highlighted in the literature [
7], and improved LR efficiency has been achieved through the Logistic Regression Matching Pursuit (LRMP) algorithm [
8].
A paradigm shift in NLP has occurred with the introduction of transformer-based models, particularly Google’s Bidirectional Encoder Representations from Transformers (BERT), which has outperformed conventional methods across numerous tasks [
9]. Strong results have been achieved using BERT–SVM hybrid architectures for short-text semantic matching [
10], while multimodal transformers have proven effective across different data modalities [
11]. The Robustly Optimized BERT Approach (RoBERTa), proposed to improve pre-training optimization, has shown further performance gains [
12]. For morphologically rich and agglutinative languages such as Turkish, language-specific versions of BERT and RoBERTa have demonstrated superior results [
13]. Studies have shown that BERT- and RoBERTa-based models outperform multilingual alternatives in Turkish datasets, supporting the importance of language-specific models, as demonstrated by Hugging Face [
14]. Research targeting e-commerce review analysis further reinforces these findings. LSTM-based sentiment classification for Turkish reviews has been conducted in previous work [
15], while transformer-based architectures have achieved higher accuracy on Turkish news texts [
16].
Studies focused specifically on Turkish e-commerce reviews provide valuable insights into the effectiveness of both traditional and transformer-based approaches. Nine BERT variants were tested on 10,000 Turkish reviews, with ConvBERTurk achieving a 97.04% F1-score [
17]. Other work obtained 96% accuracy using BERTurk-based sentiment analysis on 73,392 multi-domain reviews, while also showing that SVM remains competitive among traditional classifiers [
18]. The effectiveness of fusion architectures has additionally been noted in the literature. A BERT–RoBERTa fusion model achieved 94.3% accuracy on COVID-19 tweets [
19], and RoBERTa demonstrated better generalization for depression detection tasks [
20]. Meanwhile, additional studies have shown that traditional models such as SVM, Naive Bayes, and Random Forest still perform strongly on Turkish datasets, depending on their characteristics [
5,
21].
The comparison of multilingual and language-specific transformer models has also been extensively explored. Research indicates that XLM-RoBERTa slightly outperformed BERT on the TRSAv1 sentiment dataset [
22], whereas other work has shown that BERTurk variants significantly surpassed ELECTRA-based models on 150,000 Turkish examples [
23]. Additional findings verify that language-specific transformer models outperform multilingual ones due to Turkish’s morphological complexity [
13]. Moreover, TF-IDF, Word2Vec, GloVe, and BERT embeddings have been compared in multi-label classification tasks, with BERT achieving the highest Micro F1-score [
6]. The present study builds upon these findings by revisiting the dataset introduced in earlier research [
15], analyzing the same 15,170 e-commerce reviews under a new problem definition—classification of reviews as useful or useless.
A comprehensive analysis of the existing literature reveals two key gaps: (i) most studies focus primarily on English datasets, and research on agglutinative languages remains limited, and (ii) Turkish studies predominantly concentrate on sentiment analysis [
18,
21,
23], while the prediction of review usefulness has not been adequately explored. Addressing this gap, the present study evaluates both traditional ML methods (SVM, LR) and modern transformer-based architectures (BERT, RoBERTa, BERTurk) for usefulness classification in Turkish e-commerce reviews. Additionally, a novel Multi-Transformer Fusion Framework (MTFF) is proposed, incorporating three fusion strategies—concatenation, weighted sum, and attention-based fusion—to identify the most effective architecture.
The major contributions of this study are as follows:
- (i)
It extends beyond sentiment analysis and introduces a usefulness classification task for Turkish e-commerce reviews;
- (ii)
It provides a comprehensive comparison of traditional ML models and transformer-based architectures using the same dataset;
- (iii)
It demonstrates the effectiveness of transformer models for morphologically rich languages;
- (iv)
It proposes a multi-transformer fusion framework, where the concatenation strategy yields the highest performance.
In conclusion, this study fills an important gap in the Turkish NLP literature and offers an original model for enhancing the user experience on e-commerce platforms.
Research Hypotheses
Based on the identified research gaps and the objectives of this study, the following research hypotheses are formulated:
H1: Transformer-based models (BERT, RoBERTa, and BERTurk) achieve higher classification performance than traditional machine learning approaches (SVM, LR) in the task of Turkish e-commerce review usefulness classification.
H2: Language-specific transformer models trained for Turkish outperform multilingual transformer models in terms of F1-score and accuracy, due to the morphological richness of the Turkish language.
H3: The proposed MTFF provides statistically significant performance improvements compared to single transformer-based models, with the concatenation-based fusion strategy yielding the highest performance.
The remainder of this paper is organized as follows:
Section 2 details the materials and methods, including preprocessing, manual labeling, and model architectures;
Section 3 presents comparative experimental results; and
Section 4 discusses implications, limitations, and directions for future research.
3. Results
3.1. Embedding Results
This section presents experimental results where BERT and RoBERTa-based embeddings are used in conjunction with ML classifiers (LR and SVM). The performance metrics obtained for both model families are presented together in
Table 6. As shown in
Table 6, BERT-based embeddings achieved higher classification accuracy compared to RoBERTa-based embeddings. In the tables, values in bold represent the best results. The best performance was achieved with the BERT + LR combination (F1-Macro = 0.8847, Accuracy = 0.8866). The BERT + SVM model also produced a very close result with an F1-Macro value of 0.8840.
In RoBERTa-based embeddings, the LR classifier produced the best result with an F1-Macro of 0.8599. The fact that the performance differences are quite low suggests that both embedding types have achieved balanced learning success.
The confusion matrix analysis presented in
Figure 3 shows that all four models perform a similarly balanced classification. In particular, the BERT + LR model’s lower error rate in the positive class (useful comments) explains its slight superiority.
3.2. Fine-Tuning Results
Various variants of the BERT and RoBERTa model families were evaluated. The F1-Macro, Accuracy, Precision, and Recall values obtained on the test dataset are presented in
Table 7.
As shown in
Table 7, the Turkish BERT-222 model demonstrated the highest performance among all models with 0.8937 F1-Macro and 0.8950 Accuracy values. The Turkish BERT and XLM-RoBERTa Base models showed similar performance, and it was observed that both models could adapt to the Turkish language structure.
The confusion matrix results presented in
Figure 4 indicate that the Turkish BERT-222 model performs a more balanced classification between the “useful” and “useless” classes. The XLM-RoBERTa Base model also achieves a similar balance thanks to its multilingual training advantage, whereas the DistilRoBERTa model experiences a performance loss due to compression. Overall, it can be said that BERT-based models achieve higher accuracy and F1 scores on the Turkish dataset compared to RoBERTa-based models. This is considered to stem from the fact that BERT models include variants specifically trained for Turkish.
3.3. MTFF Results
In the fusion phase, the output representations of the BERT and RoBERTa models were combined using three different fusion strategies: Concat Fusion, Attention Fusion, and Weighted Fusion. Each strategy was trained as described in
Section 3.2 Method. These strategies aimed to improve overall performance by leveraging the complementary features of the models.
As shown in
Table 8, the highest performance was achieved with the Concat Fusion strategy (F1-Macro = 0.9175). The Weighted Fusion model ranked second with an F1 score of 0.9155, while Attention Fusion ranked third with an F1 score of 0.9086. All strategies showed an accuracy rate above 91%, demonstrating that BERT and RoBERTa representations are complementary.
Figure 5 presents the confusion matrix visuals for the three fusion strategies. The analysis revealed that the Concat Fusion model distinguished between the two classes in a balanced manner and achieved a higher correct classification rate in the “useful” class (1.0). The Weighted Fusion model produced similar accuracy results but generated more false positives in the “useless” (0.0) class. In the Attention Fusion model, although class balance was maintained, the overall F1 value remained lower.
3.4. Feature Importance Analysis Using SHAP
To improve the interpretability of the best-performing concat-based fusion model, SHAP (SHapley Additive exPlanations) analysis was employed. The SHAP framework enables a quantitative assessment of how individual features contribute to the model’s predictions, providing insights into the linguistic cues that drive usefulness classification.
Figure 6 presents a two-dimensional visualization of the model’s prediction space, illustrating the separation between useful and useless reviews along with their predicted usefulness probabilities. The visualization shows a clear clustering pattern, indicating that the model effectively distinguishes informative reviews from non-informative ones. Reviews predicted as useful are associated with higher usefulness probabilities and form a more compact and coherent cluster, whereas useless reviews exhibit lower probabilities and greater dispersion. This separation suggests that the concatenated representations produce a well-structured feature space with robust decision boundaries.
Figure 7 illustrates the most influential concepts based on the average absolute SHAP values computed over approximately 15,000 samples. The results indicate that words related to product functionality, physical condition, and user experience—such as “charge,” “screen,” “deformed,” “defective,” and “torn”—receive substantially higher importance scores. This finding demonstrates that the model prioritizes experience-based and information-rich expressions rather than surface-level sentiment indicators.
To further analyze individual predictions, representative useful and useless samples were examined. In useful reviews, experience-oriented tokens (e.g., terms related to charging behavior, screen quality, or functional performance) exhibit strong positive contributions to the model output. In contrast, tokens with limited semantic relevance, punctuation marks, or context-independent fragments contribute negatively or marginally to the prediction. For useless reviews, generic or vague expressions (e.g., brief recommendations or unspecific evaluations) fail to generate strong discriminative signals, reflecting their low informational value.
Additionally, the SHAP-based analysis indicates that features originating from both BERT and XLM-RoBERTa embeddings contribute meaningfully to the final classification decisions, with neither representation consistently dominating the other. This observation confirms that the concatenation strategy captures complementary semantic information from both models rather than introducing redundant features.
Overall, the SHAP analysis demonstrates that the concat-based fusion model relies on semantically meaningful and experience-driven linguistic cues and that its decision-making process aligns well with human intuition regarding review usefulness.
To further investigate the relative contribution of each representation within the fusion architecture, the concat fusion mechanism was analyzed in detail. In this architecture, BERT and XLM-RoBERTa representations are directly concatenated and fed into a multi-layer classifier without any predefined static weighting mechanism. Instead, the relative contribution of each representation is automatically learned by the classifier layers during training. To quantitatively assess this contribution, the weights of the first linear layer processing the concatenated vector were examined. The results indicate that BERT contributes 50.04% and XLM-RoBERTa contributes 49.96% to the final decision. This finding demonstrates that both representations are utilized almost equally in the decision-making process and that the fusion mechanism integrates them in a balanced and complementary manner rather than prioritizing a single model. Therefore, the proposed approach performs a data-driven and adaptive representation fusion rather than relying on static weighting.
3.5. Statistical Significance Analysis
The McNemar test was used to evaluate whether the differences in prediction between fusion strategies were statistically significant. This test aims to determine whether the difference between models is random, based on the number of examples predicted differently by two classification models working on the same sample set. The McNemar test is a non-parametric method recommended for comparing the performance of two classifiers on the same dataset [
27]. In this study, the McNemar test was applied to the confusion matrix results of the models to compare the prediction performance of the fusion strategies. Results obtained in the McNemar test with
p < 0.05 indicate a statistically significant difference between the two models. In this case, it is accepted that the prediction performance of the models is not random and that one model produces results that are significantly different from the other. Conversely, when
p ≥ 0.05, it is considered that the difference between the models may be random and that their performance is at a similar level. The results obtained are presented in
Table 9. According to this, no significant difference was observed between the Concat and Weighted strategies (
p = 0.4795), indicating that the two models produced similar decisions. However, in the Concat–Attention (
p = 0.0029) and Attention–Weighted (
p = 0.0003) comparisons, the
p-values were found to be less than 0.05, indicating that these differences were statistically significant. These results show that the concat fusion strategy within the MTFF architecture provides the most stable and generalizable results.
3.6. Error Analysis of Misclassified Samples
When the linguistic differences between correctly and incorrectly classified samples are examined, it is observed that the model tends to make more errors, particularly on short sentences that are highly context-dependent and contain superficial expressions. As illustrated in
Figure 8, misclassified samples are predominantly composed of shorter sentences that provide limited contextual information, whereas correctly classified samples generally exhibit longer, more explanatory structures. Sentences involving slang usage, interrogative forms, and numerical expressions often include indirect meanings and are open to interpretation, which makes semantic inference more challenging for the model. In contrast, higher classification performance is observed for sentences that contain explicit linguistic cues, such as negation structures and contrastive conjunctions. Overall, these findings indicate that the amount and clarity of contextual information play a decisive role in the model’s decision-making process, and that context-dependent expressions carrying implicit meanings increase the likelihood of classification errors.
4. Discussion
The primary objective of this study is to identify useful customer reviews that support users’ product purchase decisions on e-commerce platforms and to distinguish them from non-informative reviews in order to deliver high-quality content to consumers. The aim of the study is to develop a classification system that highlights qualified reviews and facilitates users’ access to reliable product-related information.
In this context, transformer-based models and fusion strategies were systematically evaluated on 15,170 Turkish e-commerce reviews. The BERT + Logistic Regression (LR) approach demonstrated the best performance among embedding-based methods, achieving an F1-score of 88.47%. Among individual models, the Turkish BERT-222 fine-tuning model achieved superior performance with an F1-score of 89.37%. The highest performance was obtained using the proposed Multi-Transformer Fusion Framework (MTFF) with a concatenation-based fusion strategy, achieving an F1-score of 91.75% and an accuracy of 92.09%.
The superior performance of the concat-based fusion approach compared to both single-model baselines and alternative fusion strategies can be attributed to its ability to preserve the full representational capacity of each pre-trained language model. By directly concatenating the embeddings from BERT and XLM-RoBERTa, the model retains the distinct and complementary linguistic features captured by each encoder without imposing premature compression or information bottlenecks. This allows the downstream classifier to learn, in a data-driven manner, how much emphasis should be placed on each representation. Supporting this interpretation, the SHAP-based analysis indicates that the model primarily relies on semantically meaningful and experience-oriented linguistic cues, such as references to product functionality, condition, and usage, rather than superficial sentiment expressions. Moreover, an examination of the classifier weights reveals that BERT and XLM-RoBERTa representations contribute almost equally to the final predictions, suggesting that the fusion mechanism integrates both sources in a balanced and complementary manner rather than prioritizing a single model. In contrast, fusion mechanisms such as weighted averaging, attention-based fusion, or projection-based integration often involve early-stage aggregation or dimensionality reduction, which may suppress subtle yet discriminative features. This effect becomes particularly critical in short, noisy, and highly variable user-generated texts, where preserving diverse contextual cues is essential. Consequently, the concat strategy enables a richer joint feature space, leading to more robust decision boundaries and improved classification performance.
In the literature, Arzu and Aydoğan reported an accuracy of approximately 83% for sentiment classification of Turkish e-commerce reviews using BERT-based models [
23]. In the present study, an F1-score of 89.37% was achieved with the Turkish BERT-222 model, demonstrating that individual BERT-based models can surpass previously reported results. Similarly, Kumar and Sadanandam achieved 94.3% accuracy using a BERT–RoBERTa fusion architecture for English-language texts [
19]. For Turkish texts, the present study represents the first application of a BERT + RoBERTa fusion strategy for usefulness classification on a dataset created by Çabuk et al. [
25].
The results indicate that a concatenation-based fusion strategy, which combines the strengths of BERT and RoBERTa representations, is an effective approach for classifying Turkish e-commerce reviews. The proposed architecture, particularly the concatenation-based fusion strategy (91.75% F1-score), is directly applicable to improving user experience on e-commerce platforms. Automatically identifying and highlighting useful customer reviews enables consumers to make more informed purchasing decisions while also providing a valuable tool for platform operators.
Limitations and Future Research
Despite the promising findings, this study has several limitations that should be acknowledged. First, the annotation process was conducted by a single researcher, which may introduce subjectivity and potential labeling bias. Future studies could involve multiple annotators and report inter-annotator agreement metrics such as Cohen’s Kappa.
Second, although the McNemar test was employed to assess the statistical significance of performance differences between fusion strategies, the evaluation was limited to pairwise comparisons on a single train–test split. Future research should conduct more robust statistical analyses across multiple data partitions or repeated cross-validation folds to enhance the generalizability of the findings.
Future studies may investigate the generalization capability of the proposed model across various forms of Turkish text, including social media content, product-related questions, and long-form consumer reviews. Furthermore, evaluating the approach on larger and domain-specific datasets could provide deeper insights into the robustness of fusion-based architecture.
Additionally, optimizing the model for real-time applications and integrating explainable artificial intelligence (XAI) techniques can offer greater transparency into the model’s decision-making mechanisms. Finally, extending the proposed fusion strategy to other Turkish NLP tasks, such as topic classification, sarcasm detection, and multi-label sentiment analysis, could further expand the applicability of the framework.
5. Conclusions
The experimental findings reveal clear performance differences among the evaluated modeling approaches. Hybrid methods combining transformer-based representations with machine learning classifiers, such as BERT + LR and RoBERTa + LR, achieved higher classification performance compared to traditional machine learning models alone. In addition, Turkish-specific fine-tuned BERT models produced the strongest results among individual transformer architectures.
With respect to model fusion, approaches integrating BERT and RoBERTa representations demonstrated more stable and improved performance than single-model configurations. In particular, the proposed MTFF employing a concatenation-based fusion strategy achieved the highest overall performance across all experimental settings. These results indicate that leveraging complementary information from different transformer architectures provides a substantial advantage in usefulness classification.
Based on these findings, all proposed research hypotheses are supported. The superior performance of transformer-based models over traditional machine learning methods confirms H1, the consistent advantage of Turkish-specific models over multilingual alternatives supports H2, and the performance gains obtained through the BERT–RoBERTa fusion within the MTFF validate H3. Overall, the results demonstrate that fusion-based transformer architecture constitutes an effective and practical solution for real-world e-commerce applications.