1. Introduction
Digital marketplaces promise convenience, selection, and price transparency. At the same time, they shift key elements of exchange away from direct interaction between buyers and sellers and into platform-managed processes. Trust becomes a prerequisite because consumers must rely on distant sellers, complex logistics networks, and platform dispute systems. Research on online transactions shows that trust reduces uncertainty and supports purchase intentions, whereas perceived risk discourages participation [
1,
2,
3]. These mechanisms remain central in marketplace settings, where consumers evaluate not only sellers but also the platform that defines the rules and enforces them.
In 2024, Türkiye’s e-commerce volume reached TRY 3.162 trillion (≈USD 90 billion), a 61.7% year-on-year increase, with 5.91 billion transactions, and e-commerce accounted for 6.5% of GDP [
4]. Consumer adoption is substantial but still below mature European markets: 51.7% of internet users in Türkiye reported purchasing goods or services online in the previous 12 months in 2024 [
5], compared with 77% in the EU [
6]. Against this backdrop, this study uses longitudinal complaint texts to map how consumer risk and trust grievances are framed, how their salience shifts from 2019 to 2025, and how these patterns differ across major marketplaces.
Consumer complaints provide a direct record of where expectations collapse. A complaint is rarely only a negative sentiment [
7]. It is a narrative that assigns responsibility, describes harm, and demands a remedy [
8]. Service recovery research shows that failures and recoveries shape loyalty, word of mouth, and relationship quality [
9,
10]. Complaint texts often contain richer detail than closed-form survey items because consumers describe specific breakdowns such as delayed delivery, missing parcels, refund delays, counterfeit items, and account compromise. Complaint narratives also reveal perceived justice [
11]. When consumers view procedures as inconsistent or unresponsive, they infer unfairness and reduce trust even if outcomes are eventually corrected [
12].
The platform economy adds another layer. Marketplaces govern access, visibility, and remediation through policies and technical systems. Platform research argues that governance shapes incentives and behavior across buyers, sellers, and intermediaries [
13,
14]. As marketplaces scale, governance becomes more standardized and data-driven. Rule enforcement, seller vetting, and dispute routing may rely on algorithmic classification and queue management. This creates a tension: scale encourages automation, while legitimacy depends on perceived fairness, transparency, and accountability [
15,
16]. Complaints, therefore, capture more than operational issues; they also document governance frictions.
Customer service automation intensifies this tension. Firms deploy chatbots, scripted flows, and self-service portals to reduce response time and cost. Service scholarship treats AI as a structural shift in service design because machines can perform routine tasks and mediate interaction quality [
17,
18,
19]. Automation can improve speed, but it can also introduce new failure modes. Consumers may get trapped in repetitive scripts, fail to reach a human agent, or lose confidence that their case is being evaluated. These experiences align with procedural justice principles, where voice, consistency, and correctability shape acceptance of decisions [
20]. One resulting complaint category is escalation failure. It reflects a perceived barrier to remedy rather than a slow outcome.
Despite extensive work on trust, risk, and service recovery [
21,
22,
23], evidence remains limited on how consumer risk narratives change over time in marketplace environments. Many studies rely on surveys or experiments that hold context constant and measure perceptions at a single point in time [
24,
25,
26,
27,
28]. These designs clarify mechanisms, yet they are less suited to capturing shifts driven by logistics shocks, policy changes, or service automation rollouts. Qualitative studies provide depth but often cover narrow time windows or small samples [
29]. Longitudinal, high-granularity measurement is needed to track changes in complaint composition, not only changes in complaint volume [
30].
Text-as-data methods enable the study of these dynamics at scale. Large review and complaint corpora have been used to identify product attributes, diagnose service pain points, and infer consumer priorities [
31,
32]. Topic modeling is particularly useful because it extracts recurring themes from unstructured text without requiring an exhaustive hand-coded scheme. Classical probabilistic models such as latent Dirichlet allocation are widely used for discovering themes in large corpora [
33]. However, complaint texts are often short, noisy, and multilingual, which can reduce coherence and interpretability in bag-of-words approaches. Embedding-based methods address these limitations by clustering dense semantic representations derived from transformer models [
34,
35]. BERTopic follows this approach by combining transformer embeddings, density-based clustering, and class-based TF–IDF to improve interpretability and stability in applied text corpora [
36].
Building on this line of work, the present study develops and applies a BERTopic pipeline to map consumer risk and trust complaints in Turkish e-commerce between 2019 and 2025. The corpus comprises de-identified complaint texts drawn from two public channels: posts published on Şikayetvar [
37] and user reviews of official marketplace mobile applications on Google Play and the Apple App Store. Using this corpus, the analysis (i) derives an interpretable topic map of complaint content, (ii) estimates topic prevalence over time and detects turning points, and (iii) tests whether topic-prevalence profiles differ across marketplaces after accounting for corpus composition.
This study makes three contributions. First, it offers a longitudinal, marketplace-level measurement of the consumer “complaint agenda” around risk and trust, showing how the relative prominence of grievance categories changes across time. Second, it extends platform trust and perceived risk research by identifying escalation failure as a distinct complaint category that speaks to perceived fairness and trust in marketplace governance, particularly in service and support pathways. Third, it contributes methodologically by demonstrating a multilingual, stability-oriented BERTopic workflow for short and noisy complaint texts, with robustness checks designed for applied settings. The findings also translate into actionable governance implications for refund service levels, authenticity controls, and the design of AI-mediated escalation paths. The main objectives of the study, along with their corresponding research questions, are outlined below.
RQ1: What are the dominant consumer risk and trust complaint topics in e-commerce, and how robust are these topics across modeling choices and languages (Turkish vs. English)?
RQ2: How do complaint topics evolve, and are there identifiable structural shifts or turning points in topic prevalence?
RQ3: Do topic-prevalence profiles differ systematically across marketplaces after accounting for corpus composition (source mix and language share)?
3. Methodology
The study analyzes a longitudinal, publicly available corpus of consumer complaint texts about three major e-commerce marketplaces operating in Turkey. The corpus combines two public channels, posts published on Şikayetvar [
37] and user reviews of the official marketplace mobile applications on Google Play and the Apple App Store, covering January 2019 to December 2025. Complaints were de-identified prior to analysis, and results are reported only in aggregate form. For brevity, the three marketplaces are denoted as Marketplace A, B, and C. All computational analyses were conducted in Python (version 3.11).
Complaint texts capture how consumers frame failures, assign blame, and define what counts as unacceptable, offering a direct window into trust, risk, and justice mechanisms. They also serve as legitimacy tests of platform governance and dispute resolution [
51]. Accordingly, recurring complaint narratives can be treated as latent grievance categories whose prevalence tracks salience and friction points at scale, even if it does not equal incident rates.
To identify these grievance categories and trace their salience over time and across marketplaces, BERTopic is used. Complaint corpora are often short and noisy, which limits the applicability of classical topic models [
33]. BERTopic leverages transformer embeddings to cluster semantically similar complaints despite lexical variation [
34,
35] and produces interpretable topics via class-based TF–IDF [
36]. Topic quality and interpretability were evaluated using coherence and diversity diagnostics, stability checks, and structured human labeling [
57,
58].
3.1. Data Sources and Sampling Frame
The corpus comprises publicly available consumer complaint texts about three major e-commerce marketplaces operating in Turkey: Trendyol [
59], Hepsiburada [
60], and Amazon (Turkey) [
61]. Data were collected from two public channels: (i) posts published on Şikayetvar and (ii) user reviews of the official marketplace mobile applications on Google Play [
62,
63,
64] and the Apple App Store [
65,
66,
67]. The sampling window spans January 2019 to December 2025. The unit of analysis is the complaint text; each record includes the text, platform label, timestamp, and a source indicator. Prior to analysis, complaints were de-identified by removing direct identifiers (e.g., phone numbers, emails, addresses, order or tracking codes), and results are reported only in aggregate form. For brevity in figures and tables, Trendyol is denoted as Marketplace A, Hepsiburada as Marketplace B, and Amazon Turkey as Marketplace C. These labels are used for readability only and do not indicate anonymization.
Data collection followed a targeted sampling frame. An e-commerce complaint was defined as a text describing a marketplace transaction or a platform-mediated service process, including order fulfillment, returns, refunds, customer support, seller conduct, and account security. During collection, a keyword filter was applied to reduce irrelevant content. Keywords covered core failure domains, such as delivery, refund, return, counterfeit, fraud, support, account, and privacy, in both Turkish and English. Texts were retained if at least one marketplace brand was referenced or if a marketplace transaction marker such as “order,” “seller,” “marketplace,” or “shipment” was present. The corpus was constructed using a targeted keyword frame designed to capture marketplace-related complaint narratives, followed by manual screening to remove clearly irrelevant, duplicate, or non-substantive entries. These steps were intended to improve topical relevance rather than to estimate population incidence.
The raw dataset contained 158,420 texts. After preprocessing, language filtering, and deduplication, the final analytical corpus comprised 118,173 texts. Only Turkish and English texts were retained (Turkish: 71%, English: 29%). Deduplication was performed in two stages: (i) hash-based exact-match removal of identical texts after normalization (lowercasing, whitespace standardization, and removal of URLs and identifier-like strings such as order/tracking codes), retaining the first occurrence; and (ii) near-duplicate removal using character n-gram TF–IDF cosine similarity, collapsing records with similarity ≥ 0.95 and keeping the earliest timestamped text within each cluster. In total, 18,735 texts were removed through deduplication, and a further 21,512 texts were excluded due to language filtering and low-information/spam screening.
Figure 1 summarizes the data collection and filtering pipeline, and descriptive characteristics of the corpus are reported in
Table 1.
Although the texts are publicly available, complaint narratives can contain personal identifiers or transactional details. A de-identification protocol aligned with established guidance for ethical internet research was applied prior to analysis and storage [
68]. No attempt was made to re-identify individuals, profile users, or link records to external datasets.
Redaction was rule-based and conservative. Emails, phone numbers, addresses, order IDs, tracking codes, and bank references containing partial card numbers were removed, along with personal names when they appeared in common name–surname patterns. Only de-identified text and derived features (topic probabilities and topic prevalence) were stored. Results were reported at aggregated levels (topic, month, platform) to reduce disclosure risk.
3.2. Text Preprocessing
Complaint texts are short, noisy, and heterogeneous, which can degrade clustering stability when preprocessing is inconsistent [
69]. A standardized preprocessing pipeline was applied. Text was normalized through lowercasing, Unicode normalization, whitespace cleanup, and URL removal. Identifier redaction was implemented via regular expressions consistent with the de-identification protocol. Automatic language identification was performed using the fastText lid.176 language identification model to retain Turkish and English texts only [
70]. Empty entries were removed, and texts shorter than 15 whitespace tokens were excluded to avoid unstable representations driven by sparse content.
Stopwords were removed for topic representation only. Turkish and English stopword lists were used, and domain stopwords such as “order” and “platform” were added to reduce uninformative high-frequency terms in c-TF-IDF representations. Stopwords were not removed before embedding because transformer encoders can benefit from contextual function words in short texts [
35].
Turkish morphology can inflate lexical variation [
71]. As a robustness check, the full pipeline was repeated with Turkish lemmatization using an established Turkish morphological analyzer (e.g., Zemberek [
72]), and the stability of topic solutions and temporal patterns was evaluated.
3.3. Topic Modeling with BERTopic
BERTopic was used to extract complaint topics from the multilingual complaint corpus. The method clusters dense semantic document representations and constructs interpretable topic descriptors using class-based TF–IDF (c-TF-IDF) [
36]. The embedding step relied on a multilingual sentence-transformer suitable for Turkish and English (paraphrase-multilingual-MiniLM-L12-v2) [
35]. Transformer embeddings capture semantic similarity beyond surface word overlap, which is valuable in complaint corpora where similar issues are described using varied phrasing [
34].
The modeling pipeline proceeded in four stages. First, each complaint was encoded into a dense vector using the pretrained transformer model [
35]. Second, UMAP was applied for dimensionality reduction prior to clustering [
73]. Third, reduced embeddings were clustered with HDBSCAN, which accommodates clusters of varying density and assigns outliers to a noise class [
74]. Fourth, c-TF-IDF was computed within each cluster to generate representative terms [
36], and term lists were refined using a KeyBERT-inspired, diversity-aware representation to reduce redundancy among top words. Outliers were assigned by HDBSCAN to a noise class; low-confidence non-noise assignments were additionally excluded using a probability threshold (
p < 0.15). Among the non-noise texts, 20,531 (19.0%) had topic-assignment probabilities below the 0.15 threshold and were therefore excluded from prevalence-based summaries. Manual screening was limited to removing residual spam or boilerplate content using pre-defined rules.
Model specifications are summarized in
Table 2 to support reproducibility. Parameter settings were fixed prior to topic interpretation to limit the researcher degrees of freedom. Because UMAP includes stochasticity, stability checks re-estimated the pipeline across 10 UMAP random seeds while holding the remaining parameters constant.
3.4. Topic Quality, Robustness, and Validation
Topic quality and robustness were assessed using complementary quantitative diagnostics, cross-run stability checks, multilingual consistency checks, and structured human validation. Quantitative diagnostics focused on semantic coherence (c_v) and topic diversity, defined as the share of unique words across topic descriptors [
75]. In addition, topic intrusion and word intrusion tests were conducted on a subsample to assess interpretability [
56].
Sensitivity to topic granularity: Candidate solutions were estimated for
k = 10–60 topics, and c_v coherence and topic diversity were compared across this range (
Figure 2). Coherence peaked around
k = 35 while topic diversity remained high. Although larger
k values further increased diversity, they were associated with a noticeable decline in coherence. Accordingly,
k = 35 was selected for the main analysis.
Stability across runs and preprocessing variants: To assess whether the identified topics were artifacts of a single stochastic run, the model was re-estimated across 10 UMAP random seeds while holding the embedding model and clustering settings fixed (
Table 2). In addition, sensitivity to preprocessing variants was evaluated, including the Turkish-specific preprocessing described in
Section 3.2.
Topic solutions were treated as stable when they showed high overlap in top terms (Jaccard similarity ≥ 0.80) and high similarity in topic representations (cosine similarity ≥ 0.70) across runs/specifications. Stability diagnostics are summarized in
Table S1 (Supplementary Material).
Multilingual validation (Turkish vs. English): Because the corpus contains both Turkish and English complaints, the consistency of the topic structure across languages was evaluated. The topic model was re-estimated on Turkish-only and English-only subsets, and topics were aligned based on similarity in their representations (e.g., top-term overlap and representation similarity). Alignment was then assessed by comparing (i) the degree of topic correspondence across languages and (ii) whether the dominant macro-themes were preserved. Alignment summary statistics are provided in
Table S2a, and topic-level language-stratified summaries and mappings are provided in
Table S2b (Supplementary Material).
Human validation and topic labeling: Two independent coders labeled topics using (i) the top c-TF-IDF terms and (ii) 10 representative complaints per topic sampled from high-confidence assignments. Coders followed a shared codebook defining label rules and boundary conditions between closely related topics. Inter-coder agreement was assessed using Cohen’s kappa (κ = 0.81) [
76]. Disagreements were resolved through discussion, resulting in the final topic labels used in
Section 4. For transparency, the full topic inventory (label, top terms, prevalence, and representative texts) is reported in
Table S3a (Supplementary Material). For transparency, representative complaints sampled from high-confidence assignments are provided in
Table S3b (Supplementary Material). To formalize the macro-theme grouping procedure,
Table S3d (Supplementary Material) reports the topic-to-macro-theme assignment rules, boundary conditions for adjacent categories, and the sampling logic used for the 10 representative complaints per topic. After topic labeling, the macro-theme grouping was finalized through coder-guided adjudication using predefined rules. Because this step was consensus-based rather than conducted as a separate, independent coding round, a distinct macro-theme-level agreement statistic was not computed.
3.5. Measures and Analysis Strategy
BERTopic assigns each complaint to a topic and, when available, provides assignment probabilities. Monthly topic prevalence was estimated in two ways. The primary measure was probability-weighted prevalence for topic
k in month
t, defined as
where
denotes the number of complaints in month
t and
is the model-based probability that complaint
i belongs to topic
k. As a robustness check, prevalence was re-estimated using modal assignments as the share of complaints in month
t whose most likely topic was
k.
Platform topic profiles were computed as topic distributions within each marketplace over the full window and within each year. Cross-platform differences in topic distributions were summarized using Jensen–Shannon divergence and assessed using permutation tests.
RQ1 identified dominant grievance categories. Topic inventories were reported with labels, representative terms, and prevalence. Interpretation was anchored in perceived risk facets and justice concerns [
20,
47]. Topic-solution quality, stability, and multilingual validation diagnostics are reported in
Section 3.4.
Because the corpus is bilingual, language-related sensitivity was assessed at two levels. First, semantic robustness was evaluated by re-estimating the topic model separately on Turkish-only and English-only subsets and aligning topics across languages in
Table S2a,b (Supplementary Material). Second, differences in macro-theme prevalence across Turkish and English complaints were tested using a Pearson chi-square test in
Table S2c (Supplementary Material). To assess whether these differences persisted after adjustment, language effects were also examined within the multinomial logistic framework used for platform comparisons in
Table S2d (Supplementary Material).
RQ2 assessed the temporal change in topic salience. Structural shifts in the monthly prevalence series were detected using PELT changepoint estimation [
77]. Pre–post differences were assessed with the Mann–Whitney test and summarized with Cliff’s delta [
78,
79]. Monotonic trends were evaluated with the Mann–Kendall test and Theil–Sen slopes [
80]. False discovery rates were controlled using the Benjamini–Hochberg procedure [
81]. Sensitivity checks assessed whether inferences were robust to temporal dependence in monthly series. Turning-point contrasts were summarized using fixed six-month pre/post windows to provide a common reporting frame across themes. As a sensitivity check, the same comparisons were re-estimated using four-month and eight-month windows, together with a dependence-aware monthly comparison. As shown in
Table S8 (Supplementary Material), the direction of the main shifts remained unchanged, with the strongest breaks proving robust across specifications, while the decline in Remediation Frictions was interpreted more cautiously.
To triangulate the interpretation of the post-2023 shift in escalation-related narratives, a bilingual dictionary of automation markers was constructed (
Table S4, Supplementary Material), and each text was flagged if it contained ≥ 1 marker. Monthly marker prevalence was computed as the share of flagged texts. Pre- and post-2023-02 windows were then compared using the same six-month windowing and Mann–Whitney framework used for macro-theme turning points, with Cliff’s δ reported as the effect size. Results are reported for the full corpus and within Escalation Failures.
RQ3 evaluated platform differences net of composition. Multinomial logistic regression was estimated with modal topic assignment as the outcome and platform as the focal predictor, including month fixed effects and language controls. Results were reported as relative risk ratios with robust standard errors [
82], supplemented by Jensen–Shannon divergence and permutation-based
p-values.
Finally, substantive conclusions were required to persist under the modal-assignment prevalence estimates. Findings are interpreted as shifts in expressed risk and justice concerns under evolving operational and governance conditions, not as direct estimates of incident rates.
5. Discussion
An implication to consider for trust and perceived risk theory is that complaint salience tracks the risk facets that are most exposed at a given time. The sharp increase in Fulfillment Disruptions at the 2020-04 breakpoint aligns with heightened time and performance risk, where consumers face uncertainty about delivery execution and limited visibility into last-mile processes [
47]. In marketplace settings, these risks are evaluated through institution-based trust, since platforms act as structural assurance providers through rules, guarantees, and dispute systems [
43,
44]. When fulfillment appears unreliable, the credibility of platform assurances is tested more strongly, even if the marketplace is not the direct carrier. The parallel decline in Product Integrity Risks after the same breakpoint suggests agenda substitution: authenticity and misrepresentation remain important, yet a system-wide disruption can re-prioritize what consumers choose to voice in public complaints. This supports a compositional view of risk narratives: the most salient grievance category is not necessarily the most frequent underlying failure, but the one that feels most consequential or least controllable in that period.
A second and more novel contribution concerns the emergence and growth of Escalation Failures as a distinct macro-theme. Service failure and recovery research emphasizes that post-failure response quality shapes satisfaction and downstream relationship outcomes [
9,
10]. Justice theory clarifies why escalation narratives often sound more severe than the initiating incident: lack of voice, weak correctability, and inconsistency in procedures are conditions under which legitimacy judgments deteriorate [
20,
50]. The post-2023 increase in Escalation Failures suggests a shift in where breakdowns are perceived, from “slow resolution” to “blocked access to resolution.” This pattern is consistent with users increasingly describing front-line support as scripted or automated, which can reduce friction for routine cases while also being associated with failure modes such as looping responses, misclassification, and delayed or blocked handoff to humans [
12,
29,
54,
121]. In platform contexts, escalation is not a peripheral service attribute; it is part of governance because the platform defines routing, evidence requirements, and the conditions under which exceptions can be granted [
122]. The results extend platform trust research by showing that automation and procedural constraints reshape grievances via response speed and perceived limits on voice and correctability.
A third insight is the salience and growth of Governance Threats, which concentrates risks tied to account integrity, fraud exposure, privacy, consent, and manipulative interface practices. These complaints target the platform as an institutional actor responsible for the integrity of the transaction environment, not just an intermediary connecting buyers and sellers. This is aligned with institution-based trust frameworks, where the platform’s safeguards substitute for direct interpersonal trust under uncertainty [
43,
44]. The rise in Governance Threats after the 2022-06 breakpoint is consistent with a broader shift toward system-level risk narratives, including security and privacy concerns. Research on the economics of information security highlights how incentives and externalities can lead to underinvestment unless governance mechanisms are credible and consistently enforced [
115]. Privacy scholarship similarly emphasizes control, notice, and alignment between stated practices and experienced outcomes as foundations of trust [
116,
117]. Dark-pattern complaints sharpen this framing by positioning interface design as a governance instrument that steers choice in ways users experience as deceptive or difficult to reverse [
120,
123,
124]. The results suggest that governance and security are not “edge” topics in the complaint ecosystem; they represent a stable cluster that gains prominence over time.
Platform comparisons add an additional layer by showing that grievance compositions differ systematically across marketplaces even after accounting for time variation and bilingual corpus composition. Marketplace A’s higher shares of Fulfillment Disruptions and Escalation Failures are consistent with an agenda shaped by operational execution and post-failure procedural access, while Marketplace B’s higher Remediation Frictions indicate a stronger concentration on monetary recovery workflows. Marketplace C’s higher Product Integrity Risks and Governance Threats indicate greater salience of authenticity, information quality, and system integrity narratives in that ecosystem. These differences align with the multi-target nature of trust in marketplaces, where consumers evaluate sellers and platform governance simultaneously [
40,
42]. Complaint narratives make these targets visible because texts assign responsibility, describe what remedy pathways were available, and reveal which assurances were perceived as credible. The platform-level patterning of topics provides an empirical signature of governance and service design differences that is hard to capture with cross-sectional surveys.
Methodologically, the study shows the value of multilingual text-as-data for theory-driven measurement of risk and trust narratives. Topic modeling has been used to extract themes from large corpora without exhaustive hand coding [
33], and recent work demonstrates how embeddings improve semantic coherence in short, noisy texts [
34,
35]. The BERTopic workflow strengthens interpretability through class-based TF–IDF representations and clustering of dense semantic vectors [
36]. The additional stability checks, multilingual consistency checks, and structured human labeling improve confidence that the identified categories correspond to meaningful grievance structures rather than artifacts of a single run or a language-specific vocabulary [
56,
75]. This strengthens the bridge between computational measurement and theory: topic prevalence can be interpreted as changing salience of risk facets and justice concerns, rather than a purely descriptive taxonomy.
Interpretation should remain bounded by what complaint data can support. Complaint corpora capture voiced dissatisfaction rather than the full distribution of negative experiences, and complaint behavior reflects opportunity, motivation, and perceived efficacy of voicing [
55]. The analysis therefore speaks most directly to the composition of expressed grievances in public channels, not to incident rates. Platform differences can reflect differences in user base, channel mix, and reporting norms as well as differences in operational performance. Turning points identify structural changes in prevalence series but do not, on their own, establish causal drivers. These boundaries do not weaken the central contribution, which is to provide a high-granularity, longitudinal view of how risk and trust narratives are framed and reweighted in marketplace complaints under evolving operational conditions and governance regimes.
The main theoretical message is that marketplace trust is increasingly shaped by procedural access and governance integrity, as well as fulfillment and refunds. Complaint agendas reveal where consumers perceive control, fairness, and accountability to break down and how these perceptions shift when platforms scale, automate, and adjust governance. This sets up a clear foundation for the separate implications section by identifying the mechanisms that move complaint narratives from operational incidents to institutional trust judgments.
Although the empirical setting is Turkey, the mechanisms highlighted by the findings are not unique to Turkey. Fulfillment reliability, remedy predictability, escalation access, and governance integrity are core trust levers in many marketplace environments. What is likely to vary across countries is the relative salience of these themes, depending on logistics maturity, platform structure, consumer protection enforcement, and the extent of service automation. For this reason, the complaint-to-action framework should be read as a transferable diagnostic template rather than a Turkey-specific ranking of interventions.
6. Managerial Implications
The findings translate into practical guidance for marketplace operators because the complaint agenda shifts across failure types and across platforms. The point is not to treat complaint prevalence as incident rates, but to use changes in topic salience as a high-signal view of where customers perceive reliability, recovery, escalation access, and governance to be breaking down.
Table 8 summarizes a compact “complaint-to-action” cheat sheet that links each macro-theme to a priority lever and a minimal KPI set. The actions listed in
Table 8 should therefore be read as plausible managerial responses inferred from complaint salience patterns, not as tested causal remedies or validated intervention effects.
Topic prevalence can be operationalized as a monitoring layer. Platforms may move from reactive case handling to proactive control by tracking a small watchlist of sentinel topics (
Table 7) at weekly or monthly cadence. The dashboard should include thresholds relative to each platform’s own baseline, plus a short “time-to-mitigate” measure to prevent recurring spikes from becoming normalized. This creates a repeatable routine: detect a shift, diagnose the likely process bottleneck, deploy a short-horizon fix, and confirm improvement through the same topic signals.
Fulfillment may be better treated as a trust lever rather than a logistics metric. Fulfillment Disruptions remain prominent and show sharp turning-point behavior, which makes them suitable for early-warning monitoring. The quickest wins often come from reducing “last-mile ambiguity”: clearer tracking states, automatic triggers for stalled scans, proactive messages that specify what the platform will do next, and exception workflows that do not require repeated customer contact. These interventions target performance and time risk as customers experience it [
48] and protect institution-based trust when the platform is the visible accountability holder [
43,
44].
Remediation can be made more predictable by productizing refund and return SLAs. Remediation Frictions represent uncertainty of recovery, which customers often interpret as ongoing financial exposure. The managerial lever is predictability: explicit SLAs, transparent case status, and a clean separation between cases that require investigation and cases that can be auto-approved. Where wallet credits, promotional refunds, or partial refunds drive disputes, platforms may reduce avoidable complaints by clarifying convertibility and eligibility rules before customers initiate a return.
Escalation paths can be designed around procedural access. Escalation Failures are actionable because they reflect not only a service breakdown but a breakdown in getting heard. Procedural justice research highlights why these cases become trust-damaging: lack of voice, weak correctability, and inconsistent handling undermine legitimacy judgments [
20,
50]. If automated support is part of the service model, the core safeguard is a reliable handoff rule: when the system cannot resolve a claim with evidence, customers need a clear route to human review, case ownership, and protection against non-deliberative auto-closure. This reduces repeat contacts and prevents “looping” interactions from becoming a stable source of grievances.
Greater investment in governance integrity becomes important where risks become system-level. Governance Threats point to risks that customers attribute to the platform’s safeguards: account takeover, unauthorized transactions, verification failures, consent disputes, and manipulative interface practices. These are not “edge” issues because they frame the platform as responsible for system integrity. Security economics and privacy research emphasize the need for credible, consistently enforced protections and transparent data practices [
115,
116,
117]. Dark-pattern complaints are especially reputationally costly because they imply intent; routine audits of enrollment, cancellation, defaults, and consent flows can prevent issues that customer support cannot easily “fix” after the fact [
121,
124].
Finally, platform-level differences in macro-theme profiles can guide prioritization. A platform with a complaint agenda weighted toward escalation should not spend its primary improvement capacity on minor UI refinements; it needs escalation architecture and case governance. A platform weighted toward integrity disputes needs seller enforcement and listing governance. Using the macro-theme profile as a resource-allocation tool helps ensure that investments match the most trust-relevant pain points visible in customer narratives.
7. Robustness, Limitations, and Future Research
The main findings are not tied to a single modeling or measurement choice. Topic prevalence was estimated using probability-weighted assignments and re-estimated using modal assignments; substantive patterns (theme ordering, turning-point timing, and platform reweighting) were required to hold under both constructions. Topic solutions were selected using coherence–diversity diagnostics and checked for stability across UMAP random seeds and preprocessing variants, with topic consistency evaluated through overlap in top terms and similarity in topic representations. Multilingual validity was examined by re-estimating models on Turkish-only and English-only subsets and aligning topics across languages to verify that the dominant macro-themes were preserved. Temporal inferences were further supported by distributional pre–post comparisons around estimated breakpoints, effect-size reporting, and multiple-testing control.
The corpus reflects voiced dissatisfaction in public channels rather than the full distribution of negative experiences. Complaint behavior depends on opportunity, motivation, and perceived efficacy of voicing, so topic prevalence should be interpreted as the salience of expressed grievances rather than incident rates [
55]. Source mix also matters: complaint-portal posts and app-store reviews differ in length, context, and posting incentives, which can shape what gets articulated and how. Platform comparisons can likewise reflect differences in user composition, channel usage, and reporting norms in addition to operational or governance differences. Turning points identify structural changes in prevalence series, but they do not, on their own, establish causal drivers; multiple real-world changes can coincide with a breakpoint. Finally, topic models provide structured summaries, yet boundaries between closely related topics are not always sharp, and some nuances in complaint narratives (especially rare or highly specific issues) may be absorbed into broader categories or the noise class.
Several extensions would deepen inference and improve actionability. First, linking complaint topics to operational and policy data (logistics KPIs, refund processing logs, seller enforcement actions, support queue metrics, and product or interface changes) would help identify which mechanisms plausibly drive observed turning points. Second, stronger causal designs could be pursued around discrete shocks or rollouts, using quasi-experimental strategies (e.g., difference-in-differences around policy changes, staggered adoption of support automation, or regional variation in logistics constraints). Third, richer cross-channel triangulation, adding social media, call-center transcripts, or in-app support chats, could test whether the same themes appear when incentives to complain differ. Fourth, the escalation-failure mechanism merits dedicated study: future work could map where automated support improves resolution versus where it produces “blocked access” experiences, and which design choices reduce perceived unfairness. Finally, cross-country replication would clarify which patterns are specific to the Turkish marketplace context and which generalize to other platform ecosystems with different governance regimes and consumer protection environments.
Cross-country comparative research would be especially valuable for testing how patterns such as chatbot loops, blocked handoff, and ticket auto-closure vary across institutional settings. These complaint patterns may differ in countries with stronger consumer protection rules, clearer appeal rights, or lower and higher levels of AI adoption in customer service. Such comparisons would help distinguish grievance structures that are broadly general from those that are shaped more directly by national regulatory and service environments.
8. Conclusions
This study examined 118,173 de-identified Turkish and English complaint texts from Turkey’s e-commerce marketplace environment between 2019 and 2025 to address three research questions. First, it identified a stable complaint architecture composed of 35 micro-topics grouped into five macro-themes: Fulfillment Disruptions, Remediation Frictions, Product Integrity Risks, Escalation Failures, and Governance Threats. Second, it showed that complaint salience changed over time, with a marked shift toward Fulfillment Disruptions in 2020-04, rising Governance Threats after 2022-06, and increasing Escalation Failures after 2023-02. Third, it showed that complaint profiles differ systematically across marketplaces even after accounting for time and language composition.
The main contribution of the study is to show that marketplace trust is shaped not only by delivery, refunds, or product integrity but also by what happens after failure. In particular, the macro-theme of Escalation Failures emerges as a distinct grievance domain, indicating that blocked or automated support can transform operational incidents into broader judgments about fairness, accountability, and platform trust. More broadly, the findings suggest that complaint texts can serve as an early warning signal of changing consumer risk and trust narratives. At the same time, the results should be interpreted within the limits of the data on voiced dissatisfaction, the source composition, the targeted text selection, and observational inference. For managers and platform designers, these findings point to plausible priorities for trust protection, including more reliable fulfillment and remediation, more credible escalation paths, stronger authenticity controls, and more visible governance safeguards.