1. Introduction
Customer service systems process millions of queries daily, requiring accurate matching between user intentions and available services. This matching process critically impacts user satisfaction, operational efficiency, and business outcomes. The challenge lies not merely in semantic similarity but in understanding the complex interplay among user needs, business objectives, and service capabilities.
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in understanding and reasoning about natural language. However, effectively leveraging these capabilities for production systems presents unique challenges. First, the computational requirements of large models make direct deployment prohibitive for real-time applications. Second, ensuring consistent and interpretable scoring requires careful prompt design and output structuring. Third, balancing relevance with diversity in recommendations remains an open problem in practical systems.
The telecommunications industry exemplifies the complexity of query–service matching. Users express needs using colloquial language (“cancel my account”), while services are defined in business terminology (“service termination procedure”). This semantic gap, combined with the need to consider business objectives and user satisfaction simultaneously, creates a multi-faceted optimization problem.
Traditional approaches face several limitations. Keyword-based methods lack semantic understanding, missing queries with different surface forms but identical intents. Embedding-based approaches, while capturing semantic similarity, fail to incorporate business logic and cannot provide interpretable explanations for their rankings. Fine-tuned BERT models require extensive domain-specific training data and struggle with reasoning about business constraints.
Our key insight is that LLMs can bridge this gap by performing structured reasoning about query–service matches. By carefully designing prompts that encode evaluation criteria and business logic, we can leverage the model’s reasoning capabilities while maintaining interpretability and control over the matching process.
We present a comprehensive framework for LLM-based query–service matching with three key innovations:
Reasoning-First Scoring Architecture: We restructure the traditional scoring approach by requiring models to generate explanatory reasoning before numerical scores. This leverages the autoregressive nature of language models, where later tokens benefit from the context established by earlier ones. Our experiments show that this ordering improves scoring consistency by 18% compared to score-first approaches.
Multi-Criteria Evaluation Framework: We decompose the matching problem into five weighted criteria: semantic relevance (10%), practical utility (30%), attractiveness (30%), precision matching (10%), and business understanding (20%). Each criterion addresses specific aspects of the matching problem, from user intent alignment to business value optimization.
Knowledge Distillation Pipeline: To enable production deployment, we develop a systematic approach for distilling the reasoning capabilities of large models into efficient student models. Our pipeline generates high-quality training data using 480B parameter models and trains specialized 0.5B parameter models that retain 94% of the performance while reducing inference costs by three orders of magnitude.
This paper makes the following contributions:
1. Theoretical Framework: We formalize query–service matching as a multi-objective optimization problem and prove that our weighted evaluation scheme provides Pareto-optimal solutions under reasonable assumptions about criterion independence.
2. Algorithmic Innovations: We introduce the reasoning-first prompt architecture and adaptive perturbation algorithm, providing theoretical analysis of their convergence properties and empirical validation of their effectiveness.
3. System Design: We present a complete system architecture for production deployment, including parallel evaluation pipelines, caching strategies, and monitoring frameworks that ensure reliable performance at scale.
4. Empirical Validation: Through extensive experiments on real-world telecommunications data, we demonstrate significant improvements over state-of-the-art baselines across multiple metrics, including 89% Precision@1 and 4.3/5.0 user satisfaction scores.
5. Reproducibility: We provide detailed implementation guidelines, hyperparameter configurations, and ablation studies that enable reproduction and extension of our work.
The remainder of this paper is organized as follows.
Section 2 reviews related work on query understanding, large language models, information retrieval, diversity optimization, knowledge distillation, and prompt engineering.
Section 3 presents our methodology, including the problem formulation, multi-criteria evaluation framework, reasoning-first architecture, adaptive perturbation mechanism, knowledge distillation pipeline, and system architecture.
Section 4 describes the experimental setup and presents comprehensive results and analysis.
Appendix A provides the theoretical analysis with formal proofs.
Section 5 offers discussion on insights, limitations, and future directions. Finally,
Section 6 concludes the paper.
4. Experiments and Results
4.1. Dataset
We conduct our evaluation on a comprehensive telecommunications service dataset that reflects the complexity and diversity of real-world customer service scenarios. The dataset comprises 10,000 user queries written in Chinese, representing authentic customer requests collected from actual service interactions during a six-month period from January to June 2024. These queries span various service categories: account management (25%), billing inquiries (20%), technical support (30%), service modifications (15%), and promotional information (10%), providing comprehensive coverage of typical customer needs.
Table 1 provides a structured summary of the dataset characteristics, and
Table 2 shows representative query–service pair examples.
The service catalog contains 500 unique telecommunications services, each with detailed descriptions, eligibility criteria, pricing information, and business priority rankings. Each service entry includes structured metadata such as service category, target customer segments, processing complexity, and revenue impact, enabling multi-faceted evaluation beyond simple textual similarity. Human annotators, consisting of domain experts with telecommunications industry experience, provided relevance scores on a 0–5 scale for query–service pairs, with inter-annotator agreement measured at k = 0.78, indicating substantial agreement.
Business value annotations complement the relevance scores by incorporating real-world business considerations such as profit margins, strategic importance, and customer retention impact. These annotations enable evaluation of our system’s ability to balance customer satisfaction with business objectives, a critical requirement for commercial deployment. The dataset captures real-world complexity through ambiguous queries that could match multiple services, synonym variations that require semantic understanding, and business terminology mismatches between customer language and official service descriptions.
4.2. Baseline Methods
Our experimental comparison includes four carefully selected baseline approaches that represent different paradigms in query–service matching. BM25 serves as our classical keyword-based retrieval baseline, implementing the probabilistic relevance framework with standard parameter tuning for Chinese text processing. This method represents traditional information retrieval approaches that rely primarily on term frequency and document frequency statistics without semantic understanding.
Sentence-BERT provides our dense embedding similarity baseline, utilizing pre-trained multilingual models fine-tuned for Chinese text. We employ the sentence-transformers library with the paraphrase-multilingual-MiniLM-L12-v2 model, which has demonstrated strong performance on Chinese semantic similarity tasks. This baseline represents modern neural approaches that capture semantic relationships through dense vector representations.
Our fine-tuned BERT baseline uses a domain-specific model trained on telecommunications service data, representing supervised approaches that require substantial labeled training data. We fine-tune the Chinese-BERT-wwm-ext model on our domain-specific dataset using standard classification objectives, providing a strong neural baseline that incorporates domain knowledge through supervised learning. Our fine-tuned BERT baseline uses a domain-specific model trained on telecommunications service data. Specifically, we fine-tune the Chinese-BERT-wwm-ext model (110M parameters) using mean squared error (MSE) regression loss between predicted scores and expert annotations. Training configuration: learning rate 2 × 10−5, batch size 32, maximum sequence length 512, with early stopping based on validation performance. The model converges within 5 epochs, achieving Precision@1 = 0.63 on the held-out test set. This supervised baseline represents the state-of-the-art for small-model approaches requiring substantial labeled data.
GPT-3.5 serves as our large language model baseline, implementing direct prompting without our specialized framework. This baseline uses straightforward prompts that request numerical scores for query–service pairs, representing the naive application of large language models to evaluation tasks without architectural innovations or specialized prompt design.
4.3. Evaluation Metrics
Our evaluation employs a comprehensive suite of metrics designed to capture multiple dimensions of system performance.
Relevance Metrics: We use standard information retrieval metrics including Precision@k, Recall@k, and normalized Discounted Cumulative Gain (nDCG@k) for . Specifically, Precision@k measures the proportion of relevant services among the top-k ranked results: . For our primary metric Precision@1, we evaluate whether the top-ranked service is labeled as highly relevant (relevance score ≥ 4 on the 0–5 scale) by human annotators, representing the probability that the single topmost recommendation satisfies the user’s query. Recall@k measures the proportion of all relevant services retrieved within top-k: . nDCG@k accounts for ranking position with discounted gains: where and is the ideal DCG obtained by perfect ranking.
Diversity measurement employs Intra-List Diversity (ILD) calculations that quantify the average pairwise dissimilarity between services in the top-k results. We compute ILD using both semantic embeddings and categorical service classifications, ensuring comprehensive diversity assessment. Higher ILD scores indicate better diversity, reflecting our system’s ability to provide varied options that meet different aspects of user needs.
Business value assessment incorporates domain-specific metrics that reflect real-world deployment considerations. Conversion rate measures the percentage of recommended services that users actually select or purchase, while revenue per query quantifies the average business value generated by our recommendations. These metrics ensure that our system optimizations translate to tangible business outcomes rather than merely improving abstract similarity measures.
User satisfaction evaluation employs 5-point Likert scale ratings collected from human evaluators who assess the overall quality and usefulness of ranked service lists. Evaluators consider factors such as result relevance, diversity, and practical utility, providing a holistic assessment of user experience. We collect satisfaction ratings from 20 domain experts who evaluate 500 randomly sampled query–result pairs, ensuring statistically significant assessment of user-perceived quality.
4.4. Implementation Details
Our experimental implementation utilizes carefully optimized configurations designed to balance performance, efficiency, and reproducibility.
Teacher Model Architecture (Qwen-480B): The teacher model is based on the Qwen-480B transformer architecture with the following specifications: 480 billion parameters distributed across 100 transformer layers, 128 attention heads per layer, hidden dimension size of 16,384, and intermediate FFN dimension of 65,536. The model uses multi-head self-attention with rotary position embeddings (RoPE), RMSNorm for layer normalization, and SwiGLU activation functions. Pre-training was performed on a diverse multilingual corpus exceeding 10 trillion tokens with strong emphasis on Chinese language data, enabling robust semantic understanding and reasoning capabilities. For our application, we use the pre-trained checkpoint without additional fine-tuning—the model serves purely as a zero-shot evaluator through carefully designed prompts. Model serving uses bfloat16 precision to optimize memory usage while maintaining numerical stability for score computations, enabling efficient deployment with pipeline parallelism across 8 NVIDIA A100 GPUs (80 GB each).
Student Model Architecture (Qwen-0.5B): The student model uses a substantially compressed architecture with 500 million parameters: 24 transformer layers, 16 attention heads per layer, a hidden dimension of 1024, and an intermediate FFN dimension of 4096. This 960× parameter reduction (from 480B to 0.5B) is achieved while preserving the same architectural patterns (RoPE, RMSNorm, SwiGLU) as the teacher, facilitating effective knowledge transfer. The student model is initialized from the pre-trained Qwen-0.5B checkpoint and then fine-tuned using our knowledge distillation pipeline for 30 epochs on the teacher-generated dataset of 10,000 training samples. Training uses AdamW optimizer with learning rate 5 × 10−5, linear warmup over 10% of steps, cosine decay schedule, batch size 32, gradient accumulation steps 4, and maximum sequence length of 512 tokens. The total training time is approximately 8 GPU-hours on a single NVIDIA A100 GPU.
Performance Analysis: Despite the 960× size difference, the student model retains 94% of the teacher’s Precision@1 performance (0.84 vs 0.89), validating the effectiveness of our reasoning-preserving distillation approach. The performance gap manifests primarily in complex queries with semantic ambiguity (32% of dataset), where the teacher’s larger capacity enables more nuanced reasoning. For straightforward queries with clear intent, the student achieves near-parity with the teacher (P@1 difference < 3%). The student model achieves 96× faster inference (25 ms vs 2400 ms per query) and 1200× lower deployment cost ($0.001 vs $1.20 per 1 K requests), making it the practical choice for production deployment while reserving the teacher model for offline dataset generation and periodic quality audits.
Batch processing configurations use size 32 for optimal GPU utilization, with dynamic batching adjustments based on query complexity and system load. These configurations were determined through extensive hyperparameter optimization experiments.
Perturbation parameters are set to a range of 0.001, providing subtle diversity enhancement without significantly degrading relevance scores. This value was selected through systematic analysis of the diversity–relevance tradeoff across different perturbation magnitudes, ensuring optimal balance for our application domain. Criteria weights follow the distribution of 0.1, 0.3, 0.3, 0.1, 0.2 for semantic relevance, practical utility, attractiveness, precision matching, and business understanding, respectively, based on domain expert consultation and empirical validation.
All experiments use consistent random seeds for reproducibility, with statistical significance testing performed using paired t-tests across multiple random splits of the evaluation data. Computing infrastructure consists of NVIDIA A100 GPUs with 80 GB memory, enabling efficient processing of large model inference and parallel evaluation workflows.
4.5. Reproducibility and Practical Considerations
4.5.1. Teacher Model and Knowledge Distillation Strategy
We acknowledge that the direct training and deployment of a 480B-parameter model (Qwen-480B) is not practically reproducible for most research groups due to computational resource constraints. Our approach specifically addresses this limitation through knowledge distillation: the large teacher model serves exclusively as a knowledge source for generating high-quality training data, not for direct deployment in production systems. The teacher model was trained and hosted on a cluster of 8 NVIDIA A100 GPUs with 80 GB memory each, using distributed training with pipeline parallelism. This configuration was used only for the offline stage of generating reasoning explanations and scores on the 10,000 queries in our dataset, which took approximately 48 GPU-hours. The generated dataset (comprising query–service pairs, scores, and reasoning texts) is fixed and can be made available to facilitate reproducibility without requiring access to such large-scale computational resources.
The core contribution lies in the student model (Qwen-0.5B), which is designed for practical deployment and is readily accessible. This 0.5B model can be trained on commodity hardware (a single NVIDIA A100 GPU with 80 GB memory or even smaller GPUs with gradient accumulation). Our distillation training process required approximately 8 GPU-hours for 30 epochs over the 10,000 training samples (after an 80–20 train–test split).
4.5.2. Dataset Collection and Annotation Details
To strengthen the reproducibility and transparency of our experimental setup, we provide detailed dataset specifications:
Query Collection: The 10,000 user queries were collected from authentic customer service interactions spanning a six-month period (January to June 2024) from a major Chinese telecommunications provider. Queries were anonymized and de-identified to comply with privacy regulations. The queries span diverse service categories: account management 25%), billing inquiries (20%), technical support (30%), service modifications (15%), and promotional information (10%).
Service Catalog: The 500 unique services were extracted from the provider’s official service catalog and business documentation. Each service entry includes: (1) service name and description (average length 150 tokens), (2) eligibility criteria (structured constraints such as minimum contract duration, geographic availability, customer tier), (3) pricing information (subscription cost, usage-based fees), (4) business priority ranking (assigned by product managers on a scale of 1–10 based on revenue and strategic importance), and (5) categorical tags (service type, target demographic).
Label Generation Process: Human annotation was performed by 15 domain experts from the telecommunications provider, each with at least 3 years of customer service or product management experience. For each of the 5000 query–service pairs selected for annotation (a stratified sample ensuring balanced representation across service categories), annotators provided:
Relevance Score: On a 0–5 scale, where 0 = completely irrelevant, 3 = moderately relevant, 5 = perfectly relevant. The average inter-annotator agreement (Cohen’s kappa) was 0.78, indicating substantial agreement and validating annotation quality.
Business Value Score: On a 0–5 scale, capturing profit margins, strategic importance, and customer retention potential.
Confidence: Annotators indicated their confidence in the annotation (low/medium/high), allowing us to weight disagreements appropriately during training.
To address potential annotation bias, we implemented multiple quality assurance mechanisms: (1) inter-annotator agreement calculations with automatic flagging of pairs with kappa < 0.60 for re-annotation, (2) expert review of 10% of annotations by senior product managers, and (3) cross-validation against historical customer service interaction logs to ensure consistency with real user behavior patterns.
Noise Characteristics and Dataset Complexity: The dataset intentionally captures real-world complexity through three dimensions of challenge:
Ambiguous Queries (32% of dataset): Queries that could legitimately match multiple services, such as “I want to upgrade my service”, which could refer to device upgrades, data plan increases, or service tier enhancements.
Synonym Variations (28% of dataset): Queries expressing the same intent through different terminology, such as “all meaning account termination with subtle semantic differences”.
Terminology Mismatch (40% of dataset): Queries using colloquial or regional language that differs from formal service descriptions, such as “can’t make calls” versus formal “call functionality exception”.
4.5.3. Hardware and System Configuration
Our experimental infrastructure utilizes NVIDIA A100 GPUs with 80 GB memory, enabling the efficient processing of large model inference and parallel evaluation workflows. For the student model deployment, we also validated compatibility with smaller GPUs (NVIDIA RTX 4090 with 24 GB memory) using gradient accumulation and model quantization (bfloat16 precision), demonstrating practical accessibility for researchers with limited computational budgets. All experiments were conducted on a system with 256 GB CPU RAM and NVMe SSD storage (1 TB) for efficient batch loading and caching.
4.5.4. Deployment Latency Correction
We clarify a previous communication discrepancy: the student model achieves 25 ms latency per query (not 15 ms) on a single NVIDIA A100 GPU with batch size 32. This 25 ms latency includes end-to-end processing: prompt construction (2 ms), model inference (18 ms), output parsing (3 ms), and result aggregation (2 ms). For real-time production deployment with multiple concurrent queries, batching substantially improves throughput, achieving approximately 1280 queries per second per GPU.
4.6. Main Results
Table 3 presents the comprehensive evaluation results across all baseline methods and our proposed approach, demonstrating substantial improvements in multiple performance dimensions. Our teacher model achieves exceptional performance with Precision@1 of 0.89, representing a remarkable 41% improvement over the strong Fine-tuned BERT baseline and an 18% improvement over GPT-3.5’s direct prompting approach. The Precision@3 metric shows consistent superiority, with 0.83 for our teacher model compared to 0.71 for GPT-3.5, indicating that our reasoning-first architecture maintains high precision across multiple retrieved results.
Figure 2 provides a comprehensive visualization of these performance metrics across all evaluation dimensions, clearly illustrating the substantial advantages of our approach over baseline methods.
As illustrated in
Figure 2, our proposed approach consistently outperforms all baseline methods across four evaluation dimensions: precision, ranking quality, diversity, and user satisfaction.
In terms of retrieval precision, our Teacher model achieves the highest scores across all cut-off levels, attaining P@1 = 0.89, P@3 = 0.83, and P@5 = 0.78, representing substantial improvements over the strongest baseline GPT-3.5 (P@1 = 0.75, P@3 = 0.71, P@5 = 0.68). Notably, our Student model also delivers competitive performance with P@1 = 0.84, P@3 = 0.79, and P@5 = 0.75, surpassing GPT-3.5 by considerable margins while maintaining a much lighter computational footprint. Traditional retrieval methods such as BM25, S-BERT, and F-BERT lag significantly behind, with BM25 yielding the lowest precision scores (P@1 = 0.42, P@3 = 0.38, P@5 = 0.33).
A similar trend is observed for nDCG@3, where our Teacher model reaches 0.85 and the Student model achieves 0.81, both substantially exceeding GPT-3.5 (0.73). The performance gap between our models and the conventional baselines is even more pronounced: BM25, S-BERT, and F-BERT obtain nDCG@3 scores of only 0.39, 0.54, and 0.61, respectively, indicating that our approach not only retrieves more relevant items but also ranks them in a more desirable order.
Beyond relevance-oriented metrics, our method also excels in recommendation diversity. The Teacher model achieves an intra-list diversity score of 0.68, and the Student model reaches 0.69—both meaningfully higher than GPT-3.5 (0.61) and the remaining baselines (BM25: 0.48, S-BERT: 0.52, F-BERT: 0.55). This demonstrates that our approach successfully avoids redundancy in the recommended lists while maintaining high relevance, striking an effective balance between accuracy and diversity. It is worth noting that the Student model slightly surpasses the Teacher model on this metric, suggesting that the distillation process may introduce a mild regularization effect that benefits diversity.
Most importantly, the human evaluation results further corroborate the superiority of our approach. Our Teacher model receives an average user satisfaction rating of 4.3 out of 5, and the Student model achieves 4.2, both comfortably surpassing the “Good” threshold (4.0) and significantly outperforming GPT-3.5 (3.8), which only marginally exceeds the "Good" line. In contrast, BM25 (2.1), S-BERT (2.7), and F-BERT (3.0) all fall at or below the "Neutral" threshold (3.0), indicating that users perceive a tangible qualitative difference between our method and conventional approaches. The narrow confidence intervals observed for our models further suggest that the improvements are robust and consistent across different users and queries.
The nDCG@3 scores reveal that our approach excels at ranking quality, with the teacher model achieving 0.85 compared to traditional methods that struggle to exceed 0.61. This improvement reflects our multi-criteria evaluation framework’s ability to capture nuanced relevance relationships that single-criterion approaches miss. The diversity metrics show interesting patterns, with our student model achieving the highest Intra-List Diversity (ILD) score of 0.69, slightly outperforming the teacher model at 0.68, suggesting that the perturbation-based diversity enhancement is particularly effective in the distilled model. User satisfaction scores demonstrate practical value, where our teacher model reaches 4.3 on a 5-point scale, representing a 43% improvement over Fine-tuned BERT.
4.7. Ablation Study
Figure 3 illustrates the systematic contribution analysis of each architectural component to overall system performance. Our comprehensive ablation analysis reveals that the reasoning-first prompt ordering emerges as the most significant contributor, providing an 18% improvement in Precision@1 when compared to standard score-first prompting approaches. This substantial gain validates our hypothesis that leveraging the autoregressive nature of language models through structured reasoning leads to more consistent and accurate evaluations.
The multi-criteria evaluation framework contributes a 15% improvement over single-criterion approaches, demonstrating the value of decomposing complex relevance judgments into interpretable components. When we remove individual criteria from the evaluation framework, semantic relevance and practical utility show the largest impact, with business understanding and attractiveness contributing more modestly but still meaningfully. Perturbation-based diversity enhancement provides an 8% improvement in user satisfaction scores, with particularly strong effects on queries where multiple viable service options exist. The knowledge distillation component analysis reveals that preserving both scores and reasoning patterns during distillation maintains 94% of teacher performance while enabling dramatic efficiency gains. Ablation Study Methodology: To rigorously evaluate the contribution of each architectural component, we conduct systematic ablation experiments where individual components are removed while maintaining all other elements constant. All ablation studies evaluate the student model (Qwen-0.5B parameters) to ensure fair comparison and practical relevance to production deployment, as the student model represents our deployable solution. Each variant is trained and evaluated using identical procedures: training for 30 epochs with the same hyperparameters (learning rate 5 × 10−5, batch size 32), evaluation on the same held-out test set of 2000 query–service pairs, and averaging metrics across three random seeds to ensure statistical reliability.
The ablation variants are constructed as follows:
w/o Reasoning: Replaces the reasoning-first prompting architecture with score-first prompting. Instead of generating explanatory reasoning before numerical scores, this variant directly prompts the model to output numerical scores without intermediate reasoning steps. The prompt template is modified to request only without the reasoning field, testing whether the reasoning generation step provides information-theoretic benefits beyond direct scoring.
w/o Multi-Criteria: Uses a single weighted aggregate score instead of decomposing evaluation into five separate criteria. The model receives a simplified prompt requesting an overall relevance score without explicit consideration of semantic relevance, practical utility, attractiveness, precision matching, and business understanding as distinct dimensions. This tests whether explicit multi-criteria decomposition improves evaluation quality compared to holistic scoring.
w/o Perturbation: Disables diversity enhancement by setting perturbation parameter . Rankings are produced using unperturbed scores directly from the model without any stochastic perturbation applied during result aggregation. This isolates the contribution of controlled randomness to diversity while measuring relevance degradation.
w/o Distillation: Trains the student model (Qwen-0.5B) directly on the human-annotated query–service pairs without leveraging teacher-generated reasoning and scores. The model is fine-tuned using supervised learning with mean squared error loss between predictions and expert annotations, representing a conventional supervised baseline without knowledge transfer from the large teacher model.
Results represent averaged metrics (Precision@1, Precision@3, nDCG@3, ILD, User Satisfaction) across three random seeds (42, 123, 456), with paired t-tests confirming statistical significance (p < 0.05) for all reported performance differences. The full model with all components enabled serves as the reference baseline, and performance degradation for each ablation variant quantifies that component’s contribution to overall system performance.
4.8. Multi-Criteria Analysis
Figure 4 presents a detailed analysis of how different evaluation criteria contribute to overall performance. The weight distribution reflects domain expert consultation, with practical utility and attractiveness receiving the highest weights at 0.3 each, recognizing their critical importance in real-world service matching scenarios.
The performance contribution analysis shows that practical utility contributes 0.25 to overall performance, validating its high weight assignment. Semantic relevance, despite its lower weight of 0.1, contributes 0.08 to performance, demonstrating efficient utilization of this fundamental criterion. Business understanding with a weight of 0.2 contributes 0.18 to performance, reflecting the importance of commercial considerations in telecommunications service matching.
4.9. Reasoning Architecture Impact
Figure 5 demonstrates the substantial benefits of our reasoning-first architecture compared to traditional score-first approaches. The score distribution analysis reveals that reasoning-first prompting produces more confident and accurate predictions, with a higher mean score of 0.74 compared to 0.42 for score-first approaches.
The consistency analysis across different query types reveals dramatic improvements in prediction stability. For account cancellation queries, standard deviation decreases from 0.18 in score-first approaches to 0.08 in reasoning-first, representing a 56% reduction in variance. Similar patterns hold across all query categories, with consistency improvements ranging from 45% to 60%, demonstrating the robustness of our reasoning-enhanced approach.
4.10. Diversity–Relevance Trade-Off
Figure 6 demonstrates the carefully balanced relationship between perturbation strength and performance metrics across our evaluation dataset. The systematic analysis reveals that at very low perturbation levels below 0.0005, diversity gains are minimal with less than 10% improvement in ILD scores, while relevance remains virtually unchanged. The optimal operating point occurs at
, where diversity increases by 24% while maintaining 97% of original relevance performance.
Beyond this optimal point, perturbation strengths above 0.002 begin showing diminishing returns, with diversity gains plateauing while relevance degradation accelerates. At , diversity improvement reaches only 28% while relevance drops to 89% of baseline performance, representing an unfavorable trade-off for practical applications. The analysis across different query types reveals interesting variations in optimal perturbation levels, with simple queries benefiting from lower perturbation around 0.0008 while complex queries achieve better satisfaction with slightly higher levels up to 0.0012.
4.11. Model Efficiency Analysis
Table 4 presents a comprehensive comparison of model efficiency across different scales, demonstrating the superior performance–efficiency trade-off achieved by our approach. The analysis spans models from 0.11B to 480B parameters, revealing critical insights about the scaling characteristics of query–service matching systems.
Our teacher model demonstrates the highest absolute performance across all precision metrics, achieving P@1 of 0.89 and surpassing even GPT-4’s 0.87 despite similar parameter counts. This 2.3% improvement validates our reasoning-first architecture’s effectiveness. More remarkably, our student model with only 0.5B parameters achieves 0.84 P@1, outperforming GPT-3.5 (175B parameters) by 12% while being 350× smaller and 400× more cost-effective.
The efficiency analysis reveals compelling trade-offs across the model spectrum. Traditional fine-tuned models like BERT-large achieve reasonable latency (28 ms) and cost ($0.0008/1 K) but suffer from a limited performance ceiling at 0.63 P@1. Large language models achieve better performance but impose prohibitive computational costs, with GPT-4 requiring 3.5 s per query and $1.50 per thousand requests. Our distilled student model uniquely occupies the optimal region of this trade-off space, delivering near-state-of-the-art performance (0.84 P@1) with small-model efficiency (25 ms latency, $0.001/1 K cost).
4.12. Knowledge Distillation Effectiveness
Table 5 and
Figure 7 demonstrate the remarkable efficiency gains achieved through our knowledge distillation approach while maintaining competitive performance levels. Our student model achieves 0.84 Precision@1 compared to the teacher’s 0.89, representing 94% performance retention with a model that is 960 times smaller.
The latency improvements are particularly dramatic, with inference time reducing from 2.4 s to 25 ms, enabling real-time deployment scenarios that would be impractical with the teacher model. The learning curve analysis shows that our reasoning-preserving distillation approach enables the student model to reach 0.84 performance within 10 training epochs, while direct training without distillation plateaus at 0.65 even after extended training. This demonstrates the substantial value of knowledge transfer from the teacher model’s reasoning patterns.
4.13. Case Study Analysis
Figure 8 provides a detailed visualization of our approach’s superior discrimination capabilities through the analysis of an “account cancellation” query. The heatmap clearly illustrates how our method achieves better separation between relevant and irrelevant services compared to baseline approaches.
Our reasoning-first approach correctly identifies “Account Termination Service” as the primary match with a score of 0.94, supported by detailed reasoning that explains the semantic alignment between user intent and service functionality. The system also appropriately ranks related but distinct services such as “Scheduled Cancellation” at 0.85 and “Cancellation Refund” at 0.87, demonstrating a nuanced understanding of different cancellation-related needs. Baseline methods show various failure patterns, with traditional approaches producing much lower and less discriminative scores across all services, highlighting the importance of our structured reasoning approach.
4.14. Error Analysis
Our comprehensive error analysis across 1000 challenging queries reveals three primary failure categories that account for the majority of incorrect predictions. Semantic ambiguity errors represent 32% of failures and occur when queries contain terms with multiple valid interpretations in the telecommunications domain. For example, “upgrade” might refer to device upgrades, service plan improvements, or technical infrastructure enhancements, requiring additional context that our current approach sometimes lacks. Business logic conflicts account for 28% of errors and arise when user preferences conflict with service availability rules or business constraints, such as premium features for basic accounts or conflicting service combinations. Terminology mismatch errors represent 40% of failures and reflect the gap between colloquial customer language and formal service descriptions, where regional dialects and generational language differences create matching challenges. Future improvements should focus on enhanced context modeling through conversation history integration, proactive user intent clarification through follow-up questions, and expanded terminology mapping that bridges the gap between customer language and service vocabularies.
Beyond these common failure categories, we specifically investigated instances of reasoning hallucinations and logically plausible but incorrect reasoning during our “Reasoning quality assessment” process. Hallucinations, defined as factually incorrect or unsupported statements within the generated reasoning, typically occurred in approximately 5% of complex cases. For instance, for a user query like “I want a plan that offers free international calls”, the model occasionally generated reasoning that referenced specific, non-existent data packages or features (e.g., “This plan includes 100 free international call minutes to 100 countries globally”), which were not actually offered by the recommended service. The reasoning itself was coherent and sounded authoritative, but contained fabricated details, leading to a misleading explanation. These direct factual inaccuracies highlight the challenge of strictly grounding LLM-generated content to available service knowledge.
More subtly, instances of logically plausible but incorrect reasoning were observed in about 10% of the cases, often intertwined with the aforementioned ‘Semantic ambiguity errors’ and ‘Business logic conflicts.’ For example, consider an ambiguous query like “I want to upgrade my service”. If the user intended to increase their data allowance, but the model interpreted “upgrade” as a device upgrade (e.g., for a new phone contract), it might generate a perfectly logical reasoning chain detailing the benefits of a new flagship smartphone and contract renewal options. This reasoning would be internally consistent and plausible, but fundamentally incorrect with respect to the user’s actual intent. Another instance arises from ‘Business logic conflicts’: if a user inquired about “cancelling my premium subscription” during a contractual lock-in period, the model might describe a general cancellation process, logically detailing steps like “contact customer support to handle it” and “return leased equipment”, but critically omit or misrepresent the non-cancellable nature of the contract at that time. Such reasoning appears sound on the surface but leads to an unexecutable or misleading recommendation. These cases underscore the difficulty in ensuring that LLM reasoning is not only coherent but also perfectly aligned with all external, dynamic, and potentially conflicting real-world constraints. Addressing these challenges will be a key focus of future research, particularly through enhanced external knowledge integration and robust conflict resolution mechanisms.
4.15. Generalizability and Cross-Lingual Performance
While our primary empirical validation focuses on the telecommunications domain with a Chinese query dataset, the theoretical foundations and architectural innovations of our framework are designed to be language-agnostic and domain-transferable. The reasoning-first scoring mechanism and multi-criteria evaluation framework leverage the inherent natural language understanding and reasoning capabilities of large language models (LLMs), which are increasingly multilingual and general-purpose. Knowledge distillation, as demonstrated in
Section 2.5, efficiently transfers capabilities to smaller models, making deployment in diverse contexts feasible.
To further demonstrate this generalizability, we conducted supplementary experiments on an English customer service dataset from the IT support domain. This dataset comprises 8000 user queries in English and a catalog of 400 IT support services. The annotation process and evaluation metrics mirror those used in our primary study.
Table 6 presents the performance comparison on this English dataset, showcasing our approach against representative English-specific baselines.
These initial results suggest that our approach maintains significant performance advantages in other linguistic and domain contexts, affirming the transferability of the core methodology. Specifically, our teacher model achieves a P@1 of 0.91, representing a substantial improvement over English-specific baselines. The student model also demonstrates strong performance with a P@1 of 0.86, further validating the effectiveness of our knowledge distillation pipeline across languages. Further comprehensive cross-lingual and cross-domain evaluations are planned as part of future work to quantify the framework’s adaptability more broadly.
4.16. Robustness to Adversarial and Noisy Queries
The practical deployment of query–service matching systems necessitates robustness against various forms of input perturbation, including noisy or intentionally adversarial queries. While our current experimental setup did not include a dedicated robustness evaluation against synthetically generated adversarial examples or natural language noise (e.g., typos, grammatical errors, informal phrasing), the inherent design of our LLM-based reasoning architecture offers theoretical advantages in this regard [
61].
Large Language Models are known for their strong contextual understanding and ability to handle linguistic variations, suggesting a degree of inherent resilience to minor perturbations. The reasoning-first approach, by explicitly generating explanatory steps before producing a final score, may also provide a more stable intermediate representation that is less susceptible to superficial input changes compared to direct scoring mechanisms. The ‘Semantic ambiguity errors’ identified in
Section 4.14 demonstrate the model’s struggle with *inherent* ambiguity rather than direct noise. However, the model’s performance under intentionally crafted adversarial examples or high levels of unstructured noise warrants specific investigation.
Preliminary qualitative observations suggest that our system exhibits moderate robustness to common noise types, such as minor typos (e.g., “kankel my account” instead of “cancel my account”) or rephrasing. In these cases, the LLM’s strong natural language understanding often allows it to infer the correct intent and maintain a high matching accuracy. However, for queries containing homophones, highly informal slang, or strategically constructed adversarial perturbations designed to confuse the model’s reasoning pathways, performance degradation is expected.
Future work will systematically investigate the framework’s robustness by:
Injecting controlled noise: Evaluating performance under various levels of lexical and syntactic noise in queries (e.g., character insertions/deletions, word substitutions, grammatical errors).
Adversarial attacks: Testing resilience against examples specifically crafted to induce misclassifications, leveraging techniques from adversarial machine learning (e.g., gradient-based attacks on embedding spaces or prompt injection attacks).
Human-generated noisy queries: Analyzing real-world, naturally noisy queries that might arise from speech-to-text errors or hurried user input in a production environment.
Quantifying and improving this robustness will be critical for hardening the system against real-world deployment challenges and ensuring reliable performance in imperfect input scenarios, further enhancing its practical relevance.