Automated Classification of Public Transport Complaints via Text Mining Using LLMs and Embeddings

Daniyar Rakhimzhanov; Saule Belginova; Didar Yedilkhan

doi:10.3390/info16080644

,

and

¹

Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, 020000 Astana, Kazakhstan

²

Department of Information Technology, University Turan, 050013 Almaty, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Information2025, 16(8), 644;https://doi.org/10.3390/info16080644

This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications

Version Notes

Order Reprints

Abstract

The proliferation of digital public service platforms and the expansion of e-government initiatives have significantly increased the volume and diversity of citizen-generated feedback. This trend emphasizes the need for classification systems that are not only tailored to specific administrative domains but also robust to the linguistic, contextual, and structural variability inherent in user-submitted content. This study investigates the comparative effectiveness of large language models (LLMs) alongside instruction-tuned embedding models in the task of categorizing public transportation complaints. LLMs were tested using a few-shot inference, where classification is guided by a small set of in-context examples. Embedding models were assessed under three paradigms: label-only zero-shot classification, instruction-based classification, and supervised fine-tuning. Results indicate that fine-tuned embeddings can achieve or exceed the accuracy of LLMs, reaching up to 90 percent, while offering significant reductions in inference latency and computational overhead. E5 embeddings showed consistent generalization across unseen categories and input shifts, whereas BGE-M3 demonstrated measurable gains when adapted to task-specific distributions. Instruction-based classification produced lower accuracy for both models, highlighting the limitations of prompt conditioning in isolation. These findings position multilingual embedding models as a viable alternative to LLMs for classification at scale in data-intensive public sector environments.

Keywords:

multilingual complaint classification; large language models; embedding models; instruction-based classification; zero-shot learning; few-shot inference; public transportation; resource-efficient NLP; supervised fine-tuning; public sector NLP

1. Introduction

Public transportation complaints represent a strategically important but underutilized source of operational data for urban mobility systems. With the expansion of digital communication channels, transit authorities are exposed to increasingly large volumes of free-text feedback generated by passengers [1,2,3]. Historically, most text classification systems in civic contexts have relied on sentiment analysis or topic modeling approaches, which reduce feedback to binary polarity or general thematic clusters [4,5,6]. Recent works on multilingual cyber threat detection underscore the significance of architectures capable of addressing contextual volatility in high-stakes environments, such as social media and civic communication channels [7].

Earlier comparative studies in grammar acceptability classification have shown that transformer-based models substantially outperform POS-tagging and BoW-based methods in linguistic tasks of similar structural complexity [8]. While sufficient for broad trend monitoring, such techniques lack the semantic resolution needed for fine-grained categorization and actionable understanding. In high-throughput, multilingual data environments such as municipal transportation systems, the ability to separate overlapping intents and accurately localize service-specific issues is critical. Novel developments in LLMs and multilingual embedding models have enabled more context-aware and resource-efficient approaches to multilingual classification [9,10,11]. These paradigms are especially relevant in infrastructure-heavy domains, where low-latency inference and scalable deployment are critical constraints [12,13,14,15], as shown in prior applications involving automated review analysis and clinical information extraction [16,17].

Despite the progress in using LLMs and embeddings, there remains a notable gap in the literature: no prior work has thoroughly compared these two paradigms side-by-side on public service complaint classification nor fully explored the trade-offs between accuracy and resource efficiency in this context. Existing studies tend to focus on one class of models in isolation. For example, [11] applied a reasoning-augmented LLM to consumer complaint texts, demonstrating improved detection of complaint categories with chain-of-thought prompts. In a separate approach [14], the authors combined multiple LLM-based classifiers (an ensemble of large models) to boost complaint classification accuracy. These works underscore the growing interest in LLMs for complex classifications. However, they primarily emphasize accuracy gains and do not address the practical drawbacks of relying solely on LLMs, namely the high computational cost and latency, as well as dependence on proprietary APIs or large infrastructure.

On the other hand, research on embedding models in multilingual contexts shows they can be highly effective (e.g., achieving cross-lingual semantic alignment) [18], but their use in fine-grained complaint categorization is under-explored. There is little insight into whether a properly tuned embedding model can match an LLM’s performance on this task or how an instruction-based embedding classifier compares to a zero-shot LLM prompt. This represents a critical gap, given that public agencies need solutions that are both accurate and efficient. Therefore, a strong rationale exists to investigate which approach (LLM vs. embeddings) is more suitable for public transport complaint classification under real-world conditions. The current study is motivated by this gap and aims to provide clarity on the comparative effectiveness of LLMs and embedding models when balancing performance with operational constraints.

Closely relevant to our study, works [19,20] proved that fine-tuning embeddings outperformed prompting. However, unlike these studies, which typically evaluate one or two model types in isolation, our work systematically compares a broad spectrum of architectures across both embedding-based and LLM-based paradigms. The evaluation covers multiple multilingual architectures and explores differences in classification accuracy, latency, and deployment feasibility under varied inference conditions.

In this study, we investigate the comparative effectiveness of LLMs and embedding-based architecture across several classification setups. We hypothesize that embedding models, when properly configured, can deliver performance comparable to LLMs while exhibiting superior inference efficiency. Our results enhance a comprehensive insight into model selection under resource constraints and provide evidence for the viability of embedding-based systems in public-sector language processing pipelines. Emphasis is placed on performance trade-offs across paradigms and the operational feasibility of integrating such models into multilingual, high-volume environments.

2. Materials and Methods

2.1. Dataset

The dataset used in this study is derived from public transport complaints submitted by residents of Astana, Kazakhstan, throughout 2023. These complaints were collected through official municipal platforms, where users described various transit-related issues in free-text format. From a corpus of over 58,000 transport-related entries, a representative subset of 2400 complaints was selected for manual annotation and used in the present study. Each complaint contains a natural-language narrative, varying in length from 47 to over 2000 characters. These texts exhibit significant linguistic diversity, including unstructured phrasing, grammatical inconsistencies, and frequent code-switching between Russian and Kazakh. These characteristics reflect real-world data distributions and provide a rigorous testbed for classification under variability [5].

Each complaint was independently labeled by two native-speaking annotators with domain expertise in urban mobility and transportation systems. Annotation was performed according to a predefined taxonomy consisting of seven functionally distinct categories, the descriptions and examples of which are presented in Table 1. Inter-annotator reliability was assessed using Cohen’s kappa coefficient, which yielded κ = 0.7065, indicating substantial agreement. In cases of disagreement, the final label was determined by a senior annotator, whose judgments served as the reference due to their deeper involvement in the development of the annotation protocol.

Table 1. Complaint categories with descriptions and representative examples (translated to English).

While Table 1 provides concise examples, other complaints are linguistically more complex, featuring code-codeswitching as shown in Table 2.

Table 2. Code-switched complaints.

The dataset was randomly sampled to preserve the natural distribution of complaint types across categories, including underrepresented and context-dependent cases (see Table 3). The annotated entries were divided into 75% training and 25% testing partitions, applied consistently across all modeling paradigms, including supervised fine-tuning, few-shot prompting, and embedding-based zero-shot classification. All the few-shot prompts were drawn exclusively from the training set to prevent data leakage.

Table 3. Distribution of annotated complaints by category with a consistent 75/25 train–test split.

Several categories in the dataset have low support in the test split due to their natural rarity. For example, expressions of gratitude occur far less frequently than complaints about service disruptions. These categories were retained to preserve the authenticity of user input and to support targeted qualitative analysis. Evaluation reports both class-specific and macro-averaged metrics to account for class imbalance.

2.2. Models and Experimental Configurations

To compare the performance of LLMs and embedding models, we evaluate five models representing these two major classes as shown in Figure 1. For LLMs, classification is performed using few-shot prompting, where the model is given a small number of labeled examples within the prompt to guide prediction. For embedding-based models, we consider three approaches: (1) zero-shot classification using label descriptions, where the model assigns a class based solely on semantic similarity to label names or definitions; (2) instruction-based classification, which uses task prompts to guide inference without retraining; and (3) supervised fine-tuning, where model weights are adjusted using labeled training data.

Figure 1. Models under evaluation.

2.2.1. Few-Shot Prompting Configuration for LLMs

Few-shot prompting configuration avoids any model fine-tuning or parameter adjustment and relies exclusively on in-context learning capacity [11]. Each LLM receives a structured prompt with three labeled examples per class, covering all seven target categories (21 examples total). Samples are drawn from the training set and kept constant across all experiments to ensure consistency. The prompt lists classification categories and instructs the model to return one label per complaint. Messages follow a role-tagged format, compatible with both OpenAI and Claude APIs to ensure consistency across models. Listing 1 shows the input format while Figure 2 illustrates the overall workflow. Without task-specific tuning, this setup enables a fair evaluation of generalization in multilingual, low-supervision settings.

Figure 2. The workflow for few-shot classification using LLMs and prompt-based inputs.

Listing 1. Prompt template used in few-shot classification with LLMs.

messages = [
{
“role”: “system”,
“content”:
“You are a complaint classification assistant.”
“Given a complaint, classify it into one of the following
categories: “+“, “join (categories)”
“Here are some examples:\n\n”
},
{
“role”: “user”,
“content”:
“Now, classify the following complaint:\nComplaint:
{complaint_text}\nCategory:”
}
]

This few-shot prompting setup enables direct comparison between LLMs and embedding-based classifiers under comparable data and task conditions [14,18]. It follows evidence that contextual examples improve LLM generalization in complex classification tasks [19,20], including prompt-based extraction in domains like clinical reports [16]. Henríquez-Jara et al. [21] also demonstrated ChatGPT’s (GPT-3.5-turbo) effectiveness for classifying affective content in multilingual public transport feedback.

2.2.2. Zero-Shot Configuration for Embeddings

As a baseline for classification without supervision, we implemented a label-only zero-shot approach using multilingual embedding models. This method excludes the use of labeled examples, natural language instructions, or any form of fine-tuning. Instead, both the complaint texts and the category labels are independently encoded into dense vector representations. Classification is performed by computing the cosine similarity between the embedded complaint and each label embedding, with the highest similarity score determining the predicted class (see Figure 3).

Figure 3. The workflow for embedding-based zero-shot classification with cosine similarity.

2.2.3. Instruction-Based Configuration for Embeddings

While the label-only method provides a simple and fully unsupervised baseline by matching complaints to category names via embedding similarity, it does not supply the model with any explicit information about the task or the semantic distinctions between classes. To address this limitation, we extended the zero-shot paradigm with an instruction-based classification approach, which incorporates task descriptions directly into the input through natural language prompts. This setup allows embedding models to perform zero-shot classification by leveraging prompt semantics, without requiring any task-specific fine-tuning or labeled data [9,22].

Each input was created by concatenating a classification instruction with the complaint text, forming a single natural-language input (see Listing 2). Likewise, each category was represented by pairing the same instruction with a brief, class-specific description drawn from a manually curated desc_map. This setup generated two types of embedding inputs: (1) the instruction plus complaint text and (2) the instruction plus a label description. The cosine similarity between these two types of embeddings was used to determine the most semantically aligned category for each complaint. To address the multilingual nature of the dataset, all prompts were constructed in both Russian and English. Each instruction listed all seven target categories, with a concise and distinctive description for each class to improve inter-class differentiation and reduce ambiguity during inference. We do not include the workflow figure, as it closely resembles Figure 3.

Listing 2. Instruction-based prompt template for multilingual embedding models.

categories = [
“Transport Problems”,
“Personnel Problems”,
“Bus Stop Infrastructure”,
“Equipment Faults”,
“Praise”,
“Organizational and Technical Problems”,
“Other Complaints”
]

desc_map = {
“Transport Problems”: “bus delays, route cancellations, irregular schedules”,
“Personnel Problems”: “driver rudeness, negligence, impolite staff behavior”,
“Bus Stop Infrastructure”: “missing shelters, poor signage, broken stop markings”,
“Equipment Faults”: “broken terminals, air conditioning issues, malfunctioning doors”,
“Praise”: “positive feedback, gratitude for service quality”,
“Organizational and Technical Problems”: “schedule errors, mobile app failures”,
“Other Complaints”: “issues not covered by the main categories”
}

instruction = “Classify the complaint into one of the following categories:\n”
for cat, desc in desc_map.items():
instruction += f ”- {cat}: {desc}\n”
instruction += “\nComplaint:”

2.2.4. Supervised Fine-Tuning Configuration for Embeddings

Fine-tuning was conducted to evaluate model performance under explicit supervision. A contrastive learning objective was used to adjust the embedding space by pulling together representations of complaints from the same category and pushing apart those from different categories, as illustrated in Figure 4. This approach improves the model’s ability to distinguish between semantically similar classes. The training data consisted of labeled complaint pairs from the annotated dataset. No additional preprocessing was applied, and raw user-generated inputs were preserved, including punctuation, inconsistent casing, and multilingual fragments. This ensured that training reflected real-world input conditions and maintained natural linguistic variation.

Figure 4. Workflow of fine-tuning pretrained embedding models (BGE-M3/E5) for multi-class complaint classification.

Fine-tuning affected only the upper layers of the model architecture, including the projection layer and selected encoder components, while pretrained weights remained frozen to retain multilingual generalizability. Optimization was conducted for three epochs using a batch size of 64 and a learning rate of 2 × 10⁻⁵. Following training, complaint texts from the test set were embedded into fixed-size vectors and compared to class-level prototype representations. These prototypes were computed as mean embeddings of training examples belonging to each category. Final predictions were derived via cosine similarity between the complaint embedding and the centroids, selecting the nearest semantic match.

This fine-tuning strategy enables embedding models to learn domain-aligned representations that capture the structural and lexical variation specific to transportation complaint data. It offers a resource-efficient method for adapting pretrained models to real-world deployment environments characterized by linguistic variability, high-volume input, and the need for low-latency inference.

2.3. Evaluation Strategy

To ensure fair comparison across modeling paradigms, we applied a unified evaluation protocol using a shared test split. Performance was primarily measured by exact match accuracy, reflecting the proportion of cases where the predicted category matched the ground truth. To address class imbalance, we also computed weighted precision, recall, and F1-score. All results were derived via a consistent post-processing pipeline and recorded in tabular format to facilitate reproducibility and category-level diagnostics.

2.3.1. Confusion Matrix Analysis

To complement aggregate metrics, we analyzed confusion matrices to visualize inter-class misclassifications. Heatmaps highlighted correct predictions along the diagonal and confusion patterns off-diagonal, enabling interpretation of common errors. These visualizations revealed frequent misclassifications between lexically overlapping categories, class-specific prediction biases, and imbalances in precision-recall dynamics.

2.3.2. Bootstrap-Based Confidence Estimation

To quantify the statistical reliability of classification accuracy and assess the significance of observed differences between models, we employed a non-parametric bootstrap resampling procedure [23]. This approach enables the estimation of confidence intervals without assuming any underlying distribution of prediction errors, making it especially suitable for evaluation under limited test set sizes or class imbalance.

Let

D = {(x_{i}, y_{i})}_{i = 1}^{n}

denote the test set of n samples, where

x_{i}

is the complaint text and

y_{i}

is the ground-truth category. For each model M, predictions

{\hat{y}}_{i}

are obtained, and classification accuracy is computed as follows:

A c c u r a c y = \frac{1}{n} \sum_{i = 1}^{n} I I ({\hat{y}}_{i} = y_{i})

(1)

To construct a bootstrap distribution of accuracy, we repeatedly sample with replacement from the test set to generate B = 1000 bootstrap replicates

D^{(b)}, b \in {1, \dots, B}

, each of size n. For each replicate, accuracy

{A c c}^{(b)}

is computed. To ensure reproducibility, all bootstrap procedures were conducted using a fixed random seed of 42. The resulting distribution allows for the construction of a 95% confidence interval by extracting the 2.5th and 97.5th percentiles, as follows:

{C I}_{95 %} = [{P r e c e n t i l e}_{2.5} ({A c c}^{(1 : B)}), {P r e c e n t i l e}_{97.5} ({A c c}^{(1 : B)})]

(2)

To assess the robustness of the performance estimates, bootstrap resampling was applied to all tested models. The resulting confidence intervals were subsequently used to determine whether the observed differences in classification accuracy could be considered statistically significant.

3. Results

The following subsections provide a stratified breakdown of model performance by classification paradigm, emphasizing trade-offs in representational generalization, instruction sensitivity, and inference efficiency under high-volume multilingual input conditions.

3.1. Model-Level Accuracy Comparison

3.1.1. Language-Specific Instruction Adaptation

To assess the influence of instruction language, we evaluated both models using prompts formulated in Russian and English. As shown in Table 4, E5 achieved higher accuracy with English instructions (85.83%) compared to Russian (82.00%), while BGE-M3 showed a slight preference for Russian (81.83% vs. 80.67%).

Table 4. Instruction-based classification accuracy across models and instruction languages.

These results suggest that E5 is more attuned to English, likely due to its instruction-tuning process, whereas BGE-M3 may benefit from closer alignment between prompt and input language. Confidence intervals from 1000 bootstrap iterations confirmed this trend: for E5, 83.8–87.8% in English vs. 80.1–83.9% in Russian, indicating modest statistical significance. Full category-level metrics are reported in Appendix A, Table A1 and Table A2.

This highlights the importance of prompt-language alignment in multilingual settings, particularly when leveraging instruction-conditioned embeddings.

3.1.2. Zero-Shot Performance Across Embedding Models

We evaluated the zero-shot classification performance of two multilingual embedding models (E5 and BGE-M3) using cosine similarity between complaint texts and static label embeddings, without fine-tuning or prompt-based conditioning. Results are presented in Table 5, with class-level metrics in Appendix B, Table A3.

Table 5. Zero-shot classification accuracy for E5 and BGE-M3 models.

E5 outperformed BGE-M3 by nearly 9 percentage points, indicating stronger generalization in zero-shot settings. This difference likely stems from E5’s instruction-tuned pretraining, which yields more semantically aligned embeddings. In contrast, BGE-M3 exhibited greater variability across classes, particularly in abstract or sparsely represented categories (e.g., “Other Complaints”), suggesting reduced robustness to semantic ambiguity.

Bootstrap-based confidence intervals confirm the significance of the gap, as follows:

E5 (zero shot): Mean Precision = 89.69%, 95% CI = 87.17–92.00%
BGE-M3 (zero shot): Mean Precision = 80.71%, 95% CI = 77.50–83.83%

The non-overlapping intervals highlight E5’s superior performance in zero-shot classification without adaptation.

3.1.3. Fine-Tuned Performance Across Embedding Models

We fine-tuned both embedding models on 75% of the annotated data and evaluated them on the held-out 25%. As shown in Table 6, BGE-M3 achieved 93.00% accuracy, slightly outperforming E5 (92.68%). Full class-wise metrics are available in Appendix C, Table A4.

Table 6. Accuracy scores following supervised fine-tuning (75% training, 25% testing).

Bootstrap resampling with 1000 iterations produced overlapping confidence intervals as follows:

E5 (fine-tuned): Mean Accuracy = 92.83%, 95% CI = 90.83–94.83%
BGE-M3 (fine-tuned): Mean Accuracy = 93.00%, 95% CI = 91.00–95.00%

The marginal difference (0.17 percentage point) is not statistically significant, indicating comparable performance under supervised conditions.

However, error distribution analysis suggests divergent model behavior. BGE-M3 showed greater stability in structurally simple categories (e.g., “Gratitude” or “Personnel Issues”), while E5 was more accurate on complaints with nested structures or ambiguous phrasing (e.g., “Organizational and Technical Issues”). Additionally, BGE-M3 benefited more from fine-tuning relative to its zero-shot baseline, whereas E5 remained consistent across both modes. This suggests complementary strengths: BGE-M3 adapts well to labeled data, while E5 demonstrates stronger baseline generalization.

3.1.4. Few-Shot Classification with LLMs

Few-shot classification was conducted using in-context prompts that included three labeled examples per category. We applied this setup uniformly across three instruction-tuned LLMs: Claude 3.7 Sonnet, GPT-4o, and GPT-3.5-turbo-0125. Table 7 summarizes the resulting accuracies and cost estimates. Full category-wise results are provided in Appendix D, Table A5, Table A6 and Table A7.

Table 7. Accuracy and estimated inference cost for LLMs using few-shot prompting.

Claude slightly outperformed GPT-4o, while GPT-3.5 lagged. Bootstrap estimates confirmed the differences:

Claude: Mean Accuracy = 89.68%, 95% CI = 87.17–92.00%
GPT-4o: Mean Accuracy = 89.00%, 95% CI = 86.33–91.50%
GPT-3.5: Mean Accuracy = 66.92%, 95% CI = 63.00–70.50%

All inference calls were executed in batches of 100 complaints. This batch size was chosen to reflect realistic deployment scenarios in production pipelines. The per-batch cost reported in Table 7 was recorded directly from API logs and includes all associated token and request overhead. Accordingly, the cost per one complaint is approximately USD 0.0441 for Claude 3.7 Sonnet, USD 0.0319 for GPT-4o, and USD 0.0081 for GPT-3.5-turbo-0125. These values ensure reproducibility and clarity in practical deployment scenarios.

While these per-complaint costs are manageable at test scale, they increase significantly in high-volume public sector deployments. For instance, classifying 10,000 complaints daily would incur approximately USD 441/day with Claude 3.7 Sonnet, USD 319/day with GPT-4o, and USD 81/day with GPT-3.5-turbo, translating to monthly expenditures ranging from USD 2430 to over USD 13,000 depending on the model. This cost burden becomes particularly critical in resource-constrained municipal contexts, where budget allocations for digital services are limited.

In addition to financial cost, LLM-based inference introduces non-negligible environmental impact. LLMs are energy-intensive during inference, contributing to increased carbon emissions per query. According to recent benchmarks, models like GPT-4 and Claude 3.7 can consume an order of magnitude more energy per inference than distilled or embedding-based counterparts. These environmental and financial considerations highlight the necessity of balancing predictive accuracy with sustainability and scalability when selecting models for real-world public service applications.

3.2. Error Patterns and Category-Level Insights

To move beyond aggregate metrics, we examined model behavior at the category level using confusion matrices. This allowed us to assess not only classification accuracy, but also error distribution across semantically adjacent and infrequent classes, which were deliberately preserved in the dataset to reflect real-world complaint frequencies. Misclassification patterns revealed tendencies such as systematic overlap in ambiguous categories and uneven treatment of low-resource labels. This approach supports a more nuanced understanding of model performance, particularly under conditions of multilingual input and class imbalance.

3.2.1. Instruction-Based Models: Language-Specific Confusion Pattern

Figure 5 and Figure 6 visualize the confusion matrices for BGE-M3 and E5 under instruction prompts in Russian and English. BGE-M3 produces consistent predictions in high-frequency categories (e.g., transport problems, personnel problems) but shows instability in sparse or semantically diffuse classes such as praise and other complaints. Russian prompts yield a more concentrated diagonal, whereas English prompts improve precision in structured categories like bus stop infrastructure while increasing confusion among adjacent labels, indicating high sensitivity to prompt language and reliance on explicit linguistic alignment. In contrast, E5 maintains coherent class separation across both languages, with strong diagonal focus and limited off-axis dispersion. English prompts further enhance separability in low-frequency categories, suggesting that E5 generalizes more effectively across linguistic contexts, likely due to training exposure skewed toward English.

Figure 5. Confusion matrices for the instruction-based E5 model: (a) English-language instructions; (b) Russian-language instructions.

Figure 6. Confusion matrices for the instruction-based BGE-M3 model: (a) English-language instructions; (b) Russian-language instructions.

Overall, Figure 5 and Figure 6 illustrate that instruction-based classification remains language-sensitive, with BGE-M3 benefitting from prompt–input alignment, while E5 demonstrates greater cross-lingual stability.

3.2.2. Label-Only Zero-Shot Classification

In the label-only setup, models classified complaints based solely on cosine similarity between text and static label embeddings, without fine-tuning or prompts. As shown in Figure 7, E5 produced clearer category boundaries, with strong diagonal focus across both frequent and infrequent classes. This suggests a well-structured embedding space, likely shaped by instruction-tuned multilingual pretraining. BGE-M3, while accurate on high-volume categories like transport problems, exhibited more off-diagonal confusion in overlapping classes such as organizational and technical problems, indicating reduced semantic separation.

Figure 7. Confusion matrices for label-only zero-shot classification: (a) E5 model; (b) BGE-M3 model.

These results confirm that E5 better generalizes under zero-shot constraints, particularly in multilingual, imbalanced data. When deployed at scale, label-only classification with robust embeddings offers a lightweight yet effective strategy for public service feedback analysis.

3.2.3. Fine-Tuned Embedding Models

Figure 8 shows the confusion matrices for fine-tuned E5 and BGE-M3 models. Both exhibit high accuracy, with most predictions concentrated along the diagonal. E5 shows minimal confusion, with errors mainly in semantically close categories like organizational and technical problems and other complaints. BGE-M3 displays slightly more off-diagonal activity in low-resource classes, possibly due to greater sensitivity to intra-class variation.

Figure 8. Confusion matrices for complaint classification models: (a) confusion matrix of the fine-tuned E5 model; (b) confusion matrix of the fine-tuned BGE-M3 model.

These patterns suggest that both models, once fine-tuned, handle multilingual, imbalanced data effectively. Their performance confirms their applicability for large-scale feedback classification, though E5 may offer marginally stronger category separation in ambiguous cases.

3.2.4. LLM-Based Few-Shot Classification Performance

Figure 9 presents confusion matrices for GPT-3.5-turbo-0125, GPT-4o, and Claude 3.7 Sonnet under few-shot prompting with three examples per class. Across all models, categories like transport problems and personnel problems were classified with high consistency, reflecting lexical regularity and class dominance. Claude achieved the most distinct class separation, particularly in low-frequency categories, while GPT-4o showed strong performance in structurally defined labels such as organizational and technical problems. GPT-3.5-turbo-0125 exhibited greater confusion, especially in semantically diffuse or underrepresented classes.

Figure 9. Confusion matrices for LLM-based few-shot classification models: (a) GPT-3.5-turbo-0125; (b) GPT-4o; (c) Claude 3.7 Sonnet.

LLM performance in few-shot classification reveals distinct behavioral patterns. Claude minimizes confusion in low-resource and adjacent categories, making it a strong candidate for imbalanced multilingual tasks. GPT-4o provides competitive accuracy with lower inference cost, while GPT-3.5 exhibits reduced precision, particularly on semantically diffuse inputs, indicating limited contextual abstraction.

A deeper qualitative error analysis was conducted to examine divergence between LLMs and embedding-based models. Specifically, we categorized all classification outcomes into four types: (1) both models correct, (2) only the LLM correct, (3) only the embedding model correct, and (4) both incorrect. Table 8 summarizes the distribution of these outcomes. Among 216 test complaints, both models provided correct predictions in 192 cases. The LLM correctly classified four instances that the embedding model misclassified, while the embedding model outperformed the LLM on thirteen complaints. In seven instances, both models failed to identify the correct category. These results indicate a high level of agreement between the two models while highlighting complementary capabilities in handling ambiguous or underspecified complaints.

Table 8. Qualitative comparison of classification correctness for LLMs and embeddings.

A close examination of misclassified examples revealed consistent patterns in model behavior. Errors made by LLMs typically stemmed from ambiguity in complaints involving overlapping categories, such as infrastructural issues framed as organizational failures. In contrast, the embedding model often misinterpreted nuanced or implicit references, particularly in messages with code-switching or indirect phrasing. For instance, the LLM correctly classified complaints about inoperative validation terminals as equipment failures, whereas the embedding model defaulted to broader categories such as transport issues. These discrepancies suggest that while LLMs excel at inferring latent intent from free-form text, embedding models may rely more heavily on lexical cues and learned associations. This distinction underlines the LLMs’ strength in pragmatic reasoning and contextual flexibility and the embedding models’ precision in domain-specific regularities, especially when fine-tuned on well-structured datasets.

4. Discussion

4.1. Misclassification Trends and Error Distribution

Across all models, misclassification trends were concentrated in semantically adjacent or underrepresented categories such as organizational and technical problems, praise, and other complaints. These classes often lacked clear operational boundaries, contributing to consistent confusion in both LLMs and embedding-based models. A common issue was the misattribution of behavior-driven complaints, such as driver conduct, to infrastructural or vehicle-related classes, especially when cues were ambiguous or indirect.

For example, the complaint “Bus N, stop at BUS_STOP. The driver did not open the doors” (“Автoбус N, oстанoвка BUS_STOP. Вoдитель не oткрыл двери”) was misclassified as a transport problem by all LLMs, despite reflecting a personnel problem. This illustrates a broader pattern in which models rely on lexical anchors (e.g., “bus,” “stop”) rather than contextual cues that indicate agency or responsibility.

Fine-tuned embeddings exhibited improved precision overall but similarly struggled with intent-driven edge cases. One such example is “Bus N, two identical buses arrived at BUS_STOP. One of them did not stop” (“Автoбус N, BUS_STOP аялдамасына екі бірдей автoбус келді. Біреуі тoқтамады”), which was also misclassified as transport problems despite describing a behavioral issue consistent with Personnel problems.

Errors in distinguishing between infrastructure and operational categories were also observed. The complaint “The stop on BUS_STOP Street for routes N is not equipped with a shelter” (“Улица BUS_STOP, oстанoвка маршрутoв N не oбoрудoвана павильoнoм”) was frequently labeled as a transport problem, although it clearly falls under bus stop infrastructure. Zero-shot models were particularly sensitive to such surface-level features and often failed to infer pragmatic roles in the absence of fine-tuning.

These observations point to structural limitations in handling subtle or overlapping category definitions, especially when complaints encode implicit behavioral or operational context.

Figure 10 illustrates that misclassification rates for code-switched complaints consistently exceed those of monolingual inputs, particularly among embedding-based models. This performance gap highlights persistent challenges in processing mixed-language content, where syntactic and semantic cues are split across linguistic systems. Notably, LLMs display greater robustness, likely due to their multilingual pretraining and subword tokenization. Although the difference in error rates is not statistically significant (p = 0.1146), the observed trend suggests that LLMs may better generalize over multilingual variability. These findings emphasize that code-switching introduces latent structural ambiguity, which is insufficiently addressed by current embedding-based approaches.

Figure 10. Classification error breakdown by model: monolingual vs. code-switching inputs.

Code-Switching and Contextual Ambiguity

Multilingual complaints combining Russian and Kazakh posed a distinct challenge, particularly in differentiating between driver behavior (personnel problems) and service-level issues (transport problems). In such cases, surface-level tokens like route numbers and stop names obscured the intended meaning, especially when action or agency was conveyed through Kazakh verbs.

Representative examples include the following:

“Route N, vehicle number NUMBER, at stop BUS_STOP, departed toward the school direction” (“Маршрут N, гoс. нoмер NUMBER, аялдама BUS_STOP, мектеп бағытына кетті”)
True category: Personnel problems predicted as transport problems due to focus on route and stop markers.
“Route N did not stop at the final stop BUS_STOP” (“Маршрут N не oстанoвился на кoнечнoй аялдаме BUS_STOP”)
True category: Personnel problems classified as transport problems based on location emphasis.
“The bus on route N at stop BUS_STOP did not open the door” (“Автoбус пo маршруту N, oстанoвка BUS_STOP, не oткрыл дверь”)
True category: Transport problems interpreted as personnel problems, likely due to verb-driven phrasing.

These trends are further illustrated through visual analysis. As shown in Figure 11, a confusion heatmap of code-switched misclassifications reveals dominant error clusters between personnel problems and transport problems, as well as between equipment faults and transport problems.

Figure 11. Confusion heatmap of misclassified code-switched complaints.

In parallel, Figure 12 presents a Sankey diagram that exposes the model-wise contribution to these systematic errors. Notably, models such as GPT-3.5 and E5 zero-shot show a higher concentration of misclassifications in these ambiguous categories.

Figure 12. Sankey diagram of model-wise contributions to code-switching misclassifications.

Many of these errors appear to be triggered by Kazakh verb forms (e.g., “кетті”, “тoқтамады”) that encode intent or agentive action but are underrepresented or unanchored in model pretraining data, leading to semantic drift in multilingual inference.

To further examine the nature of these failures, Figure 13 compares the distribution of misclassification causes across model types. It reveals that embedding-based models were disproportionately prone to errors arising from code-switching and syntactic ellipsis, while LLMs more frequently exhibited surface lexical confusion. These findings suggest that embedding models struggle to resolve multilingual structural ambiguity, whereas LLMs are more vulnerable to misinterpreting semantically overloaded terms in the absence of explicit grounding.

Figure 13. Distribution of misclassification causes in embedding-based versus LLMs.

In sum, code-switching amplifies semantic ambiguity and highlights limitations in models’ ability to assign roles across languages. Embedding-based methods were particularly sensitive to surface-level cues, while LLMs generalized broadly but frequently overlooked behavioral intent without explicit contextual grounding.

To mitigate these issues, future improvements should include the following:

Task-specific fine-tuning with annotated code-switched data,
Multilingual pretraining with better coverage of functional Kazakh expressions,
Architectures sensitive to discourse roles (actor, action, object), beyond flat semantic categorization.

4.2. Comparative Performance of Modeling Approaches

Figure 14 summarizes the overall accuracy of all tested models across classification paradigms. It highlights key performance differences between fine-tuned, few-shot, instruction-based, and zero-shot setups.

Figure 14. Comparative accuracy scores for fine-tuned, zero-shot, instruction-based, and few-shot classification models.

4.3. Discovery of Latent Categories Through Embedding-Space Clustering

One promising avenue for future research involves the discovery of latent complaint categories through unsupervised clustering of embedding representations. Given the semantic granularity of instruction-aligned embeddings, techniques such as HDBSCAN and UMAP-based projection have proven effective in identifying fine-grained groupings within complex datasets [24,25]. Applied to multilingual civic feedback, these methods may reveal emerging complaint types or context-dependent subcategories not captured by existing taxonomies. As prior work has shown, clustering dense semantic spaces can support adaptive classification in evolving public discourse environments [26]. Integrating such techniques with current zero-shot frameworks could reduce reliance on manual annotation and enable more flexible, data-driven taxonomy expansion.

5. Application Case Study: Web-Based Complaint Classification System

To demonstrate practical applicability, we deployed a lightweight web-based system for real-time classification of transportation complaints. The service leverages a fine-tuned multilingual-E5 model and operates entirely on local infrastructure without reliance on external APIs or GPU resources. It accepts free-text input in Russian and Kazakh and outputs both the predicted category and associated similarity scores for all seven classes.

Classification is performed by embedding each input and computing cosine similarity to pre-defined category prototypes. A visual overview of the pipeline is shown in Figure 15. The system’s transparent output and minimal latency make it suitable for integration into municipal feedback channels.

Figure 15. A flowchart summarizing the classification process in the web-based complaint categorization service using the fine-tuned E5 model.

In real-world usage, the model showed robust performance even on noisy or emotionally charged inputs. As illustrated in Figure 16, the classifier confidently identified a Kazakh-language complaint about wait times as “Transport Problems” and correctly labeled a sanitation-related message as “Equipment Faults,” despite informal phrasing.

Figure 16. Sample output screens from the web interface of the deployed classification system: (a) complaint in Kazakh classified as “Transport Problems” with 99.91% confidence (English translation: “Bus 41 did not arrive between 09:40 and 10:22. From Saryarka TC stop in the direction of Orda Bazar stop. Long waiting time for the route.”); (b) sanitation complaint classified as “Equipment Faults” with 87.8% confidence (English translation: “Good morning! Regarding bus cleaning. The buses on route 25 are poorly cleaned inside after shifts. The car wash staff are not doing their job properly. Does anyone check the bus interior after cleaning? In the morning, on the first run, only the floors are clean. What about the seat backs?”).

This deployment demonstrates the feasibility of using fine-tuned multilingual embedding models in civic feedback systems under real-world constraints. Operating in a fully non-parametric inference regime, the model does not require retraining, parameter updates, or prompt engineering at deployment time. Its modular architecture further facilitates integration into e-government platforms and analytics dashboards for transport authorities. As such, it enables scalable and interpretable real-time classification of public feedback with minimal computational overhead.

6. Conclusions

This study examined the practical viability of embedding-based and instruction-tuned architectures for multilingual complaint classification in civic infrastructure.

Rather than reiterating accuracy benchmarks, the conclusions emphasize system-level implications: embedding models offer a non-parametric alternative that facilitates real-time classification without dependence on prompt engineering or frequent retraining. Moreover, the observed language sensitivity in instruction-based setups highlights the need for context-aware adaptation strategies when dealing with multilingual input streams.

Beyond current deployments, this research opens new directions for semantically grounded civic analytics. Future work may focus on extending embedding-based pipelines to support unsupervised category discovery, multilingual taxonomy refinement, and multimodal feedback integration. Collectively, these approaches point toward a scalable framework for responsive digital governance.

Author Contributions

Conceptualization, D.R. and S.B.; methodology, D.R. and S.B.; software, D.R.; validation, D.R. and S.B.; formal analysis, D.R.; investigation, D.R.; resources, D.Y.; data curation, D.Y.; writing—original draft preparation, D.R.; writing—review and editing, D.R. and S.B.; visualization, D.R.; supervision, S.B.; project administration, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan, grant number BR24992852 «Intelligent models and methods of Smart City digital ecosystem for sustainable development and the citizens’ quality of life improvement».

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NLP	Natural Language Processing
LLM	Large Language Model
E5	intfloat/multilingual-e5-large-instruct
BGE-M3	BAAI/bge-m3
AI	Artificial Intelligence
UMAP	Uniform Manifold Approximation and Projection
HDBSCAN	Hierarchical Density-Based Spatial Clustering of Applications with Noise
F1-score	Harmonic Mean of Precision and Recall
CI	Confidence Interval

Appendix A. Instruction-Based Classification Reports for Public Transportation Complaints

Table A1. Classification report—BGE-M3 with Russian/English instructions.

Category	Precision		Recall		F1-Score		Support
	Russian	English	Russian	English	Russian	English
Transport Problems	0.81	0.77	0.97	0.98	0.88	0.86	304
Personnel problems	0.76	0.81	0.78	0.73	0.77	0.77	151
Bus stop infrastructure	1.00	1.00	0.59	0.56	0.74	0.71	27
Equipment faults	1.00	1.00	0.14	0.11	0.25	0.19	28
Praise	0.00	0.00	0.00	0.00	0.00	0.00	7
Organizational and technical problems	0.97	0.97	0.81	0.79	0.89	0.87	43
Other complaints	0.96	1.00	0.57	0.57	0.72	0.73	40
Accuracy	-	-	-	-	0.82	0.81	600
Macro avg	0.79	0.79	0.55	0.53	0.61	0.59	600
Weighted avg	0.83	0.82	0.82	0.81	0.80	0.78	600

Table A2. Classification report—E5 with Russian/English instructions.

Category	Precision		Recall		F1-Score		Support
	Russian	English	Russian	English	Russian	English
Transport Problems	0.77	0.81	0.99	0.98	0.87	0.89	304
Personnel problems	0.88	0.91	0.69	0.82	0.77	0.86	151
Bus stop infrastructure	0.95	1.00	0.67	0.74	0.78	0.85	27
Equipment faults	1.00	1.00	0.21	0.21	0.35	0.35	28
Praise	1.00	1.00	0.43	0.86	0.60	0.92	7
Organizational and technical problems	0.97	0.95	0.79	0.81	0.87	0.88	43
Other complaints	0.89	0.93	0.62	0.65	0.74	0.76	40
Accuracy	-	-	-	-	0.82	0.86	600
Macro avg	0.92	0.94	0.63	0.73	0.71	0.79	600
Weighted avg	0.84	0.87	0.82	0.86	0.81	0.85	600

Appendix B. Zero-Shot Classification Reports for Public Transportation Complaints

Table A3. Zero-shot classification performance across categories using BGE-M3 and E5 models.

Category	BGE-M3			E5
	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Support
Transport Problems	0.94	0.80	0.87	0.94	0.96	0.95	294
Personnel problems	0.78	0.86	0.82	0.90	0.87	0.89	157
Bus stop infrastructure	0.85	0.88	0.86	0.87	0.80	0.83	25
Equipment faults	0.42	0.75	0.54	0.54	0.93	0.68	28
Praise	0.64	1.00	0.78	1.00	1.00	1.00	7
Organizational and technical problems	0.92	0.72	0.80	0.92	0.78	0.85	46
Other complaints	0.56	0.70	0.62	0.97	0.72	0.83	43
Accuracy	-	-	0.81	-	-	0.90	600
Macro avg	0.73	0.82	0.76	0.88	0.87	0.86	600
Weighted avg	0.84	0.81	0.82	0.91	0.90	0.90	600

Appendix C. Fine-Tuned Classification Reports for Public Transportation Complaints

Table A4. Fine-tuned classification performance across categories using BGE-M3 and E5 models.

Category	BGE-M3			E5
	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Support
Transport Problems	0.96	0.98	0.97	0.98	0.97	0.97	304
Personnel problems	0.92	0.97	0.94	0.91	0.98	0.94	151
Bus stop infrastructure	0.93	0.96	0.94	0.92	0.92	0.94	26
Equipment faults	0.83	0.80	0.81	0.83	0.80	0.81	30
Praise	1.00	1.00	1.00	1.00	1.00	1.00	7
Organizational and technical problems	0.88	0.70	0.78	0.92	0.80	0.78	43
Other complaints	0.80	0.72	0.76	0.75	0.62	0.76	39
Accuracy	-	-	0.93	-	-	0.93	600
Macro avg	0.90	0.87	0.89	0.89	0.88	0.89	600
Weighted avg	0.93	0.93	0.93	0.93	0.93	0.93	600

Appendix D. Few-Shot Classification Reports for Public Transportation Complaints

Table A5. Few-shot classification performance across categories using Claude 3.7 Sonnet models.

Category	Precision	Recall	F1-Score	Support
Transport Problems	0.92	0.97	0.95	303
Personnel problems	0.87	0.86	0.87	152
Bus stop infrastructure	0.82	0.88	0.85	26
Equipment faults	0.79	0.73	0.76	30
Praise	1.00	1.00	1.00	7
Organizational and technical problems	0.83	0.79	0.81	43
Other complaints	0.85	0.56	0.68	39
Accuracy	-	-	0.90	600
Macro avg	0.87	0.83	0.84	600
Weighted avg	0.89	0.89	0.89	600

Table A6. Few-shot classification performance across categories using GPT-4o models.

Category	Precision	Recall	F1-Score	Support
Transport Problems	0.92	0.97	0.95	303
Personnel problems	0.87	0.86	0.87	152
Bus stop infrastructure	0.82	0.88	0.85	26
Equipment faults	0.79	0.73	0.76	30
Praise	1.00	1.00	1.00	7
Organizational and technical problems	0.83	0.79	0.81	43
Other complaints	0.85	0.56	0.68	39
Accuracy	-	-	0.89	600
Macro avg	0.87	0.83	0.84	600
Weighted avg	0.89	0.89	0.89	600

Table A7. Few-shot classification performance across categories using GPT-3.5-turbo-0125 models.

Category	Precision	Recall	F1-Score	Support
Transport Problems	0.78	0.91	0.84	303
Personnel problems	0.89	0.64	0.75	152
Bus stop infrastructure	0.93	0.50	0.65	26
Equipment faults	1.00	0.03	0.06	30
Praise	1.00	0.86	0.92	7
Organizational and technical problems	0.00	0.00	0.00	43
Other complaints	0.80	0.21	0.33	39
Accuracy	-	-	0.67	600
Macro avg	0.77	0.45	0.51	600
Weighted avg	0.77	0.67	0.68	600

References

Yona, M.; Birfir, G.; Kaplan, S. Data science and GIS-based system analysis of transit passenger complaints to improve operations and planning. Transp. Policy 2021, 101, 133–144. [Google Scholar] [CrossRef]
Çapalı, B.; Küçüksille, E.; Alagöz, N.K. A natural language processing framework for analyzing public transportation user satisfaction: A case study. J. Innov. Transp. 2023, 4, 17–24. [Google Scholar] [CrossRef]
Liu, W.K.; Yen, C.C. Optimizing bus passenger complaint service through big data analysis: Systematized analysis for improved public sector management. Sustainability 2016, 8, 1319. [Google Scholar] [CrossRef]
Shaikh, S.; Yayilgan, S.Y.; Abomhara, M.; Ghaleb, Y.; Garcia-Font, V.; Nordgård, D. Multilingual User Perceptions Analysis from Twitter Using Zero-Shot Learning for Border Control Technologies. Soc. Netw. Anal. Min. 2025, 15, 19. [Google Scholar] [CrossRef]
Cai, H.; Dong, T.; Zhou, P.; Li, D.; Li, H. Leveraging Text Mining Techniques for Civil Aviation Service Improvement: Research on Key Topics and Association Rules of Passenger Complaints. Systems 2025, 13, 325. [Google Scholar] [CrossRef]
Fu, X.; Brinkley, C.; Sanchez, T.W.; Li, C. Text Mining Public Feedback on Urban Densification Plan Change in Hamilton, New Zealand. Environ. Plan. B Urban Anal. City Sci. 2025, 52, 646–666. [Google Scholar] [CrossRef]
Murad, S.A.; Dahal, A.; Rahimi, N. Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis. arXiv 2025, arXiv:2502.04346. [Google Scholar]
Benlahbib, A.; Boumhidi, A.; Fahfouh, A.; Alami, H. Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models. IEEE Open J. Comput. Soc. 2025, 6, 248–260. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, X.; Zhang, J.; Ma, X.; Li, M.; Zhou, Y. Instruction Tuning for Large Language Models: A Survey. arXiv 2023, arXiv:2312.03857. [Google Scholar]
Zeng, H.; Zhang, J.; He, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity. arXiv 2024, arXiv:2402.03216. [Google Scholar]
Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. Think Before You Classify: The Rise of Reasoning Large Language Models for Consumer Complaint Detection and Classification. Electronics 2025, 14, 1070. [Google Scholar] [CrossRef]
Edwards, A.; Camacho-Collados, J. Language Models for Text Classification: Is In-Context Learning Enough? arXiv 2024, arXiv:2403.17661. [Google Scholar]
Liu, F.; Li, Z.; Yin, Q.; Liu, F.; Huang, J.; Luo, J.; Thakur, A.; Branson, K.; Schwab, P.; Wu, X.; et al. A multimodal multidomain multilingual medical foundation model for zero shot clinical diagnosis. NPJ Digit. Med. 2025, 8, 86. [Google Scholar] [CrossRef]
Airani, P.; Pipada, N.; Shah, P. Classification of Complaints Text Data by Ensembling Large Language Models. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Rome, Italy, 24–26 February 2025; SciTePress: Setúbal, Portugal, 2025; Volume 3, pp. 679–689, ISBN 978-989-758-737-5. [Google Scholar]
Mei, P.; Zhang, F. Network Public Opinion Environment Shaping for Sustainable Cities by Large Language Model. SSRN Electron. J. 2024, 1–19. Available online: https://ssrn.com/abstract=5117779 (accessed on 22 July 2025).
Pendyala, V.S.; Kamdar, K.; Mulchandani, K. Automated Research Review Support Using Machine Learning, Large Language Models, and Natural Language Processing. Electronics 2025, 14, 256. [Google Scholar] [CrossRef]
Mondal, C.; Pham, D.-S.; Gupta, A.; Tan, T.; Gedeon, T. Leveraging Prompt Engineering with Lightweight Large Language Models to Label and Extract Clinical Information from Radiology Reports. In Proceedings of the ACM on Web Conference 2025 (WWW ’25), Sydney, NSW, Australia, 28 April–2 May 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1616–1625. [Google Scholar]
Oro, E.; Granata, F.M.; Ruffolo, M. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian. Big Data Cogn. Comput. 2025, 9, 141. [Google Scholar] [CrossRef]
Kokkodis, M.; Demsyn-Jones, R.; Raghavan, V. Beyond the Hype: Embeddings vs. Prompting for Multiclass Classification Tasks. arXiv 2025, arXiv:2504.04277. [Google Scholar]
Al Faraby, S.; Romadhony, A. Analysis of llms for educational question classification and generation. Comput. Educ. Artif. Intell. 2024, 7, 100298. [Google Scholar] [CrossRef]
Henríquez-Jara, B.; Arriagada, J.; Tirachini, A. Evidence of the Impact of Real-Time Information on Passenger Satisfaction across Varying Public Transport Quality in 13 Chilean Cities. SSRN Electron. J. 2024. [Google Scholar] [CrossRef]
Wang, L.; Yang, N.; Huang, X.; Yang, L.; Majumder, R.; Wei, F. Multilingual E5 Text Embeddings: A Technical Report. arXiv 2024, arXiv:2402.05672. [Google Scholar]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman and Hall/CRC: New York, NY, USA, 1993. [Google Scholar]
Islam, K.M.S.; Karri, R.T.; Vegesna, S.; Wu, J.; Madiraju, P. Contextual Embedding-Based Clustering to Identify Topics for Healthcare Service Improvement. arXiv 2025, arXiv:2504.14068. [Google Scholar]
Petukhova, A.; Matos-Carvalho, J.P.; Fachada, N. Text Clustering with Large Language Model Embeddings. Int. J. Cogn. Comput. Eng. 2025, 6, 100–108. [Google Scholar] [CrossRef]
Vrabie, C. Improving Municipal Responsiveness through AI-Powered Image Analysis in E-Government. CEE e|Dem e|Gov Days 2024. pp. 1–27. Available online: https://www.ceeol.com/search/article-detail?id=1330925 (accessed on 22 July 2025).

Figure 1. Models under evaluation.

Figure 2. The workflow for few-shot classification using LLMs and prompt-based inputs.

Figure 3. The workflow for embedding-based zero-shot classification with cosine similarity.

Figure 4. Workflow of fine-tuning pretrained embedding models (BGE-M3/E5) for multi-class complaint classification.

Figure 5. Confusion matrices for the instruction-based E5 model: (a) English-language instructions; (b) Russian-language instructions.

Figure 6. Confusion matrices for the instruction-based BGE-M3 model: (a) English-language instructions; (b) Russian-language instructions.

Figure 7. Confusion matrices for label-only zero-shot classification: (a) E5 model; (b) BGE-M3 model.

Figure 8. Confusion matrices for complaint classification models: (a) confusion matrix of the fine-tuned E5 model; (b) confusion matrix of the fine-tuned BGE-M3 model.

Figure 9. Confusion matrices for LLM-based few-shot classification models: (a) GPT-3.5-turbo-0125; (b) GPT-4o; (c) Claude 3.7 Sonnet.

Figure 10. Classification error breakdown by model: monolingual vs. code-switching inputs.

Figure 11. Confusion heatmap of misclassified code-switched complaints.

Figure 12. Sankey diagram of model-wise contributions to code-switching misclassifications.

Figure 13. Distribution of misclassification causes in embedding-based versus LLMs.

Figure 14. Comparative accuracy scores for fine-tuned, zero-shot, instruction-based, and few-shot classification models.

Figure 15. A flowchart summarizing the classification process in the web-based complaint categorization service using the fine-tuned E5 model.

Figure 16. Sample output screens from the web interface of the deployed classification system: (a) complaint in Kazakh classified as “Transport Problems” with 99.91% confidence (English translation: “Bus 41 did not arrive between 09:40 and 10:22. From Saryarka TC stop in the direction of Orda Bazar stop. Long waiting time for the route.”); (b) sanitation complaint classified as “Equipment Faults” with 87.8% confidence (English translation: “Good morning! Regarding bus cleaning. The buses on route 25 are poorly cleaned inside after shifts. The car wash staff are not doing their job properly. Does anyone check the bus interior after cleaning? In the morning, on the first run, only the floors are clean. What about the seat backs?”).

Table 1. Complaint categories with descriptions and representative examples (translated to English).

Code	Category	Description	Example
0	Transport problems	Complaints related to transport delays, overcrowded buses, vehicle breakdowns, and insufficient service on routes.	“It’s 06:45, but route N has still not arrived…”
1	Personnel problems	Complaints concerning driver misconduct, missed stops, or inappropriate behavior of ticket inspectors.	“The inspector walked through the bus without checking tickets…”
2	Bus stop infrastructure	Complaints about the condition and equipment of bus stops, including absence or malfunction of heated shelters, damaged structures, and broken information displays.	“We request the installation of an information display at N bus stop…”
3	Equipment faults	Complaints regarding broken equipment inside buses, such as validators, air conditioning units, or heating systems.	“The onboard announcer is not working on this route…”
4	Praise	Expressions of gratitude or positive feedback related to the performance of the transport system or personnel.	“Thank you… for your professionalism…”
5	Organizational and technical problems	Complaints involving transport card top-ups, public information dissemination, route maps, and schedule updates.	“Please extend route No. N to Koktal Park…”
6	Other complaints	Complaints that either span multiple categories or do not fit into any of the predefined ones.	“The manhole cover is open across from the checkpoint…”

Table 2. Code-switched complaints.

Category	Original Complaint	Kazakh Segment	Russian Segment	English Translation
Transport problems	XX таңғы 7-де келмеді, хoтя расписание указанo	XX таңғы 7-де келмеді,	хoтя расписание указанo	Bus XX did not arrive at 7 a.m., even though it is shown in the timetable.
Personnel problems	Маршрут № XX гoс. нoмер XXX Умбетей жырау аялдамасында, ж/м Караoткель-2 аялдамысна қарай бағытта, автoбус тoқтаған жoқ, өтіп кетті.	Умбетей жырау аялдамасында, ж/м Караoткель-2 аялдамысна қарай бағытта, автoбус тoқтаған жoқ, өтіп кетті.	Маршрут № XX гoс. нoмер XXX	Bus route No. XX, state number XXX, did not stop at the Umbetey Zhyrau stop in the direction of Karaotkel-2.
Bus stop infrastructure	На oстанoвке Нажимиденoва в направления Нұрлы жoл не рабoтает инфoрмациoннoе таблo	Нұрлы жoл	На oстанoвке Нажимиденoва в направления … не рабoтает инфoрмациoннoе таблo	The information board at the Nazhimidenov stop in the direction of Nurly Zhol is not working
Organizational and technical problems	100 тг каспи. Перезалив жасау керек, чек бар Каспида.	100 тг каспи. … жасау керек, чек бар Каспида.	Перезалив	100 tenge Kaspi. Need to top up, there is a check in Kaspi.

Table 3. Distribution of annotated complaints by category with a consistent 75/25 train–test split.

No	Name of Category	Number of Annotated Complaints
No	Name of Category	Total	Training Dataset	Test Dataset
1	Transport Problems	1214	910	304
2	Personnel problems	605	454	151
3	Bus stop infrastructure	104	78	26
4	Equipment faults	117	88	29
5	Praise	28	21	7
6	Organizational and technical problems	175	131	44
7	Other complaints	157	118	39
	TOTAL	2400	1800	600

Table 4. Instruction-based classification accuracy across models and instruction languages.

Model	Russian Instruction (%)	English Instruction (%)
BGE-M3	81.83	81.67
E5	82.00	85.83

Table 5. Zero-shot classification accuracy for E5 and BGE-M3 models.

Model	Accuracy (%)
BGE-M3	80.67
E5	89.67

Table 6. Accuracy scores following supervised fine-tuning (75% training, 25% testing).

Model	Accuracy (%)
BGE-M3	93.00
E5	92.68

Table 7. Accuracy and estimated inference cost for LLMs using few-shot prompting.

Model	Accuracy (%)	Cost (USD)
Claude 3.7 Sonnet	90.00	4.41
GPT-4o	89.00	3.19
GPT-3.5-turbo-0125	67.00	0.81

Table 8. Qualitative comparison of classification correctness for LLMs and embeddings.

Outcome Type	Count	Complaint Excerpt	True Category	LLM Prediction	Embedding Prediction
Both correct	192	“Why does route N stop and go unpredictably?”	Transport Problems	Transport Problems	Transport Problems
Only LLM correct	4	“Route N, vehicle number NUMBER, failed to operate at 19:08”	Equipment faults	Equipment faults	Transport Problems
Only embedding correct	13	“I can’t register my child’s transport card…”	Other complaints	Organizational and technical problems	Other complaints
Both incorrect	7	“Bus N didn’t stop at bus stop…”	Personnel problems	Transport Problems	Transport Problems

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Automated Classification of Public Transport Complaints via Text Mining Using LLMs and Embeddings

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Models and Experimental Configurations

2.2.1. Few-Shot Prompting Configuration for LLMs

2.2.2. Zero-Shot Configuration for Embeddings

2.2.3. Instruction-Based Configuration for Embeddings

2.2.4. Supervised Fine-Tuning Configuration for Embeddings

2.3. Evaluation Strategy

2.3.1. Confusion Matrix Analysis

2.3.2. Bootstrap-Based Confidence Estimation

3. Results

3.1. Model-Level Accuracy Comparison

3.1.1. Language-Specific Instruction Adaptation

3.1.2. Zero-Shot Performance Across Embedding Models

3.1.3. Fine-Tuned Performance Across Embedding Models

3.1.4. Few-Shot Classification with LLMs

3.2. Error Patterns and Category-Level Insights

3.2.1. Instruction-Based Models: Language-Specific Confusion Pattern

3.2.2. Label-Only Zero-Shot Classification

3.2.3. Fine-Tuned Embedding Models

3.2.4. LLM-Based Few-Shot Classification Performance

4. Discussion

4.1. Misclassification Trends and Error Distribution

Code-Switching and Contextual Ambiguity

4.2. Comparative Performance of Modeling Approaches

4.3. Discovery of Latent Categories Through Embedding-Space Clustering

5. Application Case Study: Web-Based Complaint Classification System

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Instruction-Based Classification Reports for Public Transportation Complaints

Appendix B. Zero-Shot Classification Reports for Public Transportation Complaints

Appendix C. Fine-Tuned Classification Reports for Public Transportation Complaints

Appendix D. Few-Shot Classification Reports for Public Transportation Complaints

References

Article Metrics

Citations

Article Access Statistics