1. Introduction
The field of artificial intelligence has been fundamentally reshaped by the advent of Large Language Models (LLMs), sophisticated deep learning systems that demonstrate a remarkable capacity to understand, generate, and reason with human language [
1]. Foundational models such as Meta’s Llama 3 series [
2], Google’s Gemma 3 [
3], and Alibaba’s Qwen3 [
4] now represent the state-of-the-art, powering applications that span from complex code generation to nuanced creative writing and scientific discovery. Their advanced capabilities have unlocked new paradigms for human-computer interaction and process automation across countless industries.
One of the most promising domains for LLM application is e-commerce, a highly competitive landscape where user experience is a critical determinant of success. The ability of a system to accurately interpret and act upon user requests expressed in natural, often informal, language is a significant competitive advantage. This paradigm, known as conversational commerce, aims to create more intuitive and personalized shopping experiences [
5]. Tasks such as robust product classification from noisy user descriptions [
6] and understanding complex user intent are central to this vision [
7]. Major industry players like eBay have already begun developing their own in-house, domain-specific LLMs to capitalize on these opportunities, underscoring the strategic importance of this technology [
7].
However, the deployment of the most powerful, state-of-the-art commercial models, such as OpenAI’s GPT-4, presents significant practical and economic barriers. These models are computationally massive, and their use via API services generates continuous operational costs, creates vendor lock-in, and raises concerns regarding data privacy and security. Furthermore, the immense energy, water, and carbon footprint associated with serving billions of queries from these large-scale models is a growing concern for sustainable AI development [
8,
9,
10]. The high cost and resource intensity of these “frontier” models thus create a substantial obstacle to their widespread adoption, particularly for small to medium-sized enterprises.
In response to these challenges, a powerful trend has emerged: the rise of smaller, highly optimized, open-weight language models. A growing body of research demonstrates that these smaller models, when specialized for a specific domain, can achieve performance comparable or even superior to that of much larger, general-purpose models. For instance, fine-tuned small models have been shown to outperform GPT-4 on specific tasks like arithmetic [
11], mental health understanding [
12], and developing pedagogical tools [
13]. A large-scale study fine-tuning over 300 models confirmed that specialized models can consistently rival GPT-4 on narrow tasks, suggesting that a “many small models” approach is a viable and efficient strategy [
14]. This paradigm shift from monolithic, “one-size-fits-all” models to a diverse ecosystem of specialized agents motivates our research.
Two key technologies are central to unlocking the potential of these smaller models: parameter-efficient fine-tuning (PEFT) and post-training quantization (PTQ). PEFT techniques allow a pretrained model to be adapted to a new task by training only a small fraction of its parameters. Low-Rank Adaptation (LoRA) has become a standard approach [
13], and its successor, Quantized LoRA (QLoRA), further democratized this process by enabling the fine-tuning of large models on consumer-grade hardware through aggressive 4-bit quantization of the base model during training [
15].
Concurrently, post-training quantization has become essential for efficient inference. PTQ methods reduce the numerical precision of a model’s weights (and sometimes activations) after training, leading to significant reductions in memory footprint and potential increases in inference speed. Techniques like GPTQ [
16] and AWQ [
17] have demonstrated the ability to compress models to 4-bit precision with minimal accuracy loss. However, recent research highlights that the benefits of quantization are not universal; they are highly dependent on the model size, task, and underlying hardware architecture, creating a complex web of trade-offs between performance, energy, and output quality [
15,
17,
18].
This paper bridges these research threads by conducting a comprehensive, hardware-aware investigation into the practical viability of a small, optimized open-weight model for a critical e-commerce task. We hypothesize that by combining QLoRA-based fine-tuning with aggressive post-training quantization, it is possible to specialize a 1-billion-parameter model to achieve accuracy close to the much larger, state-of-the-art GPT-4.1 on a structured intent recognition task. Crucially, we extend beyond mere accuracy to analyze the real-world performance trade-offs, measuring inference speed, memory consumption, and energy efficiency on both GPU and CPU environments to provide a holistic view of the model’s operational characteristics.
Our contributions are fourfold. First, we introduce an end-to-end methodology for building a highly efficient, specialized language model for multilingual e-commerce intent recognition, including a novel synthetic data generation process based on “metaprompting”. Second, we supply empirical evidence that a properly fine-tuned 1-billion-parameter Llama 3.2 model reaches 98.8% accuracy on this task, coming close to GPT-4.1 performance on our evaluation set. Third, we present a detailed, hardware-aware performance analysis that surfaces nuanced, sometimes counter-intuitive effects of quantization, showing that 4-bit GPTQ can slow inference on older GPUs while GGUF formats yield significant speedups on CPUs. Fourth, we release our multilingual synthetic dataset for e-commerce intent recognition to support further research and reproducibility.
The remainder of this paper is organized as follows.
Section 2 discusses the background and related work in model specialization and optimization.
Section 3 details our methodology, including task definition, dataset generation, and the experimental pipeline.
Section 4 presents our results on model accuracy and operational performance.
Section 5 discusses the implications of our findings, and
Section 6 concludes the paper with a summary and directions for future work.
3. Methodology
To systematically evaluate the performance trade-offs of optimizing small language models, we designed a multi-stage experimental pipeline. The process encompasses task definition, synthetic dataset generation, model selection, fine-tuning, post-training quantization, and a dual-pronged evaluation of both accuracy and operational performance.
3.1. Task Definition and Dataset Generation
The core task of our experiment is structured intent extraction from natural language user queries in an e-commerce context. The model’s objective is to parse a user’s free-form text request for managing a shopping cart and output a structured JSON object containing three key fields: ‘action’ (e.g., “add” or “remove”), ‘product’ (the canonical name of the item), and ‘quantity’ (an integer).
Given the lack of publicly available, multilingual datasets for this specific task, we generated a high-quality synthetic dataset. We employed a “metaprompting” strategy, using the GPT-4.1 model as a sophisticated data generator. A Python 3.12 script systematically constructed detailed prompts that guided the generator model to produce diverse and realistic examples.
To increase realism, the prompt templates and noise-injection rules were informed by patterns observed in real-world, user-generated e-commerce queries collected in production settings. Because these real queries contained sensitive private information, we did not include or reproduce any raw user text; instead, we distilled high-level guidelines (e.g., phrasing, common typos, and code-switching patterns) to steer the synthetic generation.
For each example, the script programmatically defined the ground-truth parameters: the language (cycling through English, Croatian, and Spanish), the action (either ‘add’ or ‘remove’), a product randomly selected from a predefined catalog, and a quantity chosen via a weighted random selection to mimic realistic purchasing patterns. To introduce stylistic diversity, a wide range of prompt templates was employed, generating user expressions that varied from formal requests to brief imperative commands.
To further enhance realism and robustness, the script strategically injected noise. This included linguistic noise such as typos or slang like “pls” and “thx”, contextual noise such as greetings, emojis, or unrelated brand names, and instances of code-switching, for example embedding English phrases like “free shipping” into Croatian sentences.
This pipeline produced a dataset of 6000 examples. An example of a data point is shown below:
User input: "Can you delet 12 lip balms for me?"
->
JSON: {"action": "remove", "product": "Lip Balm", "quantity": 12}
3.2. Models and Baselines
Our investigation centered on optimizing a small, open-weight model. The primary model for fine-tuning and quantization was Llama 3.2 1B, a recent and efficient model from Meta [
2]. To contextualize its performance, we also evaluated a representative set of other open-weight models, including Gemma 3 1B and Gemma 2 2B from Google [
3], and Qwen 2.5 1.5B and Qwen 2.5 3B from Alibaba [
4].
To establish a state-of-the-art performance benchmark, we used leading commercial models from OpenAI’s GPT-4.1 series. These proprietary models serve as an upper bound for accuracy on the given task and were evaluated using a few-shot prompting approach, where several examples were provided in the prompt to guide the model’s response format.
For consistency with the accuracy table, we clarify that the reported fine-tuned Gemma 3 1B result was obtained using the same supervised QLoRA fine-tuning protocol and exact-match evaluation procedure used for the Llama 3.2 1B model. However, Gemma 3 1B was included only as an auxiliary fine-tuned baseline; the subsequent post-training quantization and deployment analysis focuses on Llama 3.2 1B, which is the primary model studied in depth.
3.3. Experimental Setup
The main experimental setup is summarized in
Table 1. In brief, the study evaluates multilingual e-commerce intent extraction as a structured JSON generation task, using a common held-out test set to compare commercial baselines, open-weight models, fine-tuned variants, and quantized deployment formats.
3.4. Experimental Pipeline
The core of our experiment involved a two-stage process: specializing the base model for our task via fine-tuning, and then optimizing the specialized model for efficient deployment via quantization.
The full procedure is summarized as a step-by-step workflow in
Figure 1. This illustration also provides a concise visual overview of the main methodological phases of the study.
3.4.1. Fine-Tuning
We employed QLoRA (Quantized Low-Rank Adaptation) [
15], a parameter-efficient fine-tuning technique, to train our model. The process was orchestrated using the ‘transformers’, ‘peft’, and ‘bitsandbytes’ libraries. The base Llama 3.2 1B model was loaded with its weights quantized to 4-bit precision using the NormalFloat4 (NF4) data type. A LoRA adapter was then applied with the following key hyperparameters: rank (
r) of 16, alpha (
lora_alpha) of 32, and a dropout rate of 0.15. The adapter targeted all linear projections within the self-attention and feed-forward network blocks. The model was trained for 3 epochs using the ‘SFTTrainer’ with a batch size of 8 and a learning rate of 2
. Crucially, the loss was calculated only on the completion part of the sequence (the JSON output), focusing the learning process exclusively on generating the correct structured data.
To improve reproducibility, the fine-tuning implementation details are provided in the accompanying Colab notebook (
https://colab.research.google.com/drive/1D6uC0-0EJtB-WXR5ipEnSCsv9yO2Oy-O?usp=sharing, accessed on 12 April 2026). In the executed training configuration, the dataset was split into training and validation subsets using a 90/10 split with seed 42. The tokenizer padding token was set to the end-of-sequence token when no native padding token was available. The 4-bit loading configuration used NF4 quantization with FP16 compute precision. The LoRA adapter targeted the
q_proj,
k_proj,
v_proj,
o_proj,
gate_proj,
up_proj, and
down_proj modules, with
bias=none and causal-language-model adaptation. The supervised fine-tuning configuration used a per-device batch size of 8, gradient accumulation of 1, 3 epochs, a learning rate of 2
, warmup ratio of 0.05, maximum gradient norm of 1.0, BF16 training enabled, epoch-level evaluation and checkpointing, and best-checkpoint selection based on validation loss. During training, an exact-match callback also evaluated JSON output correctness on the validation split.
3.4.2. Post-Training Quantization
After fine-tuning, the trained LoRA adapter was merged with the original, full-precision (FP16) Llama 3.2 1B base model. This step created a single, unified, specialized model in FP16 format, which served as the foundation for all subsequent quantization.
GPTQ for GPU: We used the ‘gptqmodel’ library to apply GPTQ (Generative Pre-trained Transformer Quantization) [
16]. Using a calibration set of 300 random samples from our training data, we quantized the merged FP16 model to 4-bit precision, creating a version optimized for GPU inference. The calibration set used in the released notebook was language-balanced, consisting of 100 English, 100 Croatian, and 100 Spanish prompt-completion texts, which were concatenated before quantization.
GGUF for CPU: We used the ‘llama.cpp’ v3.1 toolchain to convert the merged FP16 model into the GGUF format, which is highly optimized for CPU inference. We generated three distinct versions to analyze the impact of bit depth: an aggressive 3-bit version (Q3_K_M), a balanced 4-bit version (Q4_K_M), and a high-quality 5-bit version (Q5_K_M).
3.5. Evaluation Framework
3.5.1. Accuracy Metric
Performance was measured using exact match accuracy. For a prediction to be considered correct, the generated output had to be a syntactically valid JSON object, and all three key-value pairs (‘action‘, ‘product’, and ‘quantity’) had to be identical to the ground truth labels. This strict metric was chosen because partial correctness in this e-commerce task can lead to significant functional errors in a production system.
3.5.2. Performance Metrics and Hardware
Beyond accuracy, we profiled the key operational characteristics of our fine-tuned Llama 3.2 1B model in its FP16 and quantized forms. We measured inference speed (tokens generated per second), latency (mean and P95 end-to-end latency in seconds), memory consumption (GB of VRAM for GPU tests and GB of RAM for CPU tests), and energy consumption, where power draw in Watts was monitored using nvidia-smi during GPU tests to calculate energy efficiency. GPU-based evaluations (FP16 vs. GPTQ) were conducted on an NVIDIA T4 GPU within the Google Colab environment, while CPU-based evaluations (GGUF variants) were performed on a Hetzner CPX32 instance (4 vCPU) with an AMD EPYC Genoa processor and 8 GB RAM.
4. Results
This section presents the empirical findings of our study. We first compare the accuracy of all evaluated models to establish a performance baseline and validate our primary hypothesis. We then provide a detailed analysis of the operational characteristics of our optimized model, examining the hardware-dependent trade-offs of quantization on both GPU and CPU platforms.
4.1. Model Accuracy Comparison
The primary goal of our experiment was to determine if a small, specialized open-weight model could achieve accuracy parity with large-scale commercial models. The exact match accuracy scores for all models on the unseen test set are presented in
Table 2.
As expected, the commercial GPT-4.1 series models set a high bar, achieving near-perfect scores. The base open-weight models showed varied but generally respectable performance, with the Llama 3.2 1B base model achieving an accuracy of 0.76. The most significant finding is the dramatic performance increase after fine-tuning. The Llama 3.2 1B model, after fine-tuning with QLoRA, saw its accuracy jump to 0.988. This result brings the specialized 1B parameter model within 1.2 percentage points of GPT-4.1 (1.000) and places it in the same performance range as the GPT-4.1-mini and GPT-4.1-nano baselines. This outcome provides strong support for our central hypothesis. The overall performance landscape is visualized in
Figure 2.
4.2. The Impact of Quantization on Accuracy
A key aspect of our investigation was to determine how post-training quantization affects the accuracy of the specialized model. The results, shown in
Table 2, reveal that modern quantization techniques can preserve performance with high fidelity. Among the evaluated variants, the 5-bit GGUF (Q5_K_M) model achieved the highest accuracy among quantized models (0.984), closely followed by the 4-bit GGUF (Q4_K_M) and 4-bit GPTQ variants, which exhibited only a marginal reduction relative to the full-precision fine-tuned model. These results confirm that high-quality 4- and 5-bit quantization can retain near-FP16 accuracy for specialized tasks while delivering substantial efficiency gains.
Furthermore, more aggressive quantization did not lead to a dramatic collapse in performance within the tested range. The 4-bit GGUF (Q4_K_M) and 5-bit GGUF (Q5_K_M) versions both achieved an accuracy of 0.984, representing only a slight decrease from the FP16 model’s 0.988. Even the 3-bit GGUF (Q3_K_M) variant maintained strong performance at 0.972, indicating that moderate quantization can be applied with minimal loss in accuracy. These results suggest that, for this specialized task, the “quantization cliff” [
23] is not encountered until beyond 3-bit quantization and that high-quality quantized models can closely approach the accuracy of their full-precision counterparts.
This robustness is plausible in the context of structured intent extraction because the task has a constrained output space and a relatively low-entropy target format. Unlike open-ended generation, the model is not required to produce long, stylistically varied, or knowledge-intensive responses; it must map each input to a small set of discrete decisions: the action, canonical product name, and quantity. Fine-tuning therefore concentrates the model’s behavior around stable schema-following patterns and task-specific lexical cues. Moderate 4-bit and 5-bit quantization can perturb individual weights and token probabilities, but these perturbations are less likely to change the final exact-match output when the decision boundary is already clear and the generated JSON structure is short and highly regular. This also explains why the finding should be interpreted as task-specific: similarly aggressive quantization may have a larger impact on open-ended reasoning, long-form generation, or tasks requiring subtle semantic distinctions.
4.3. GPU Performance Profile (FP16 vs. GPTQ)
We profiled the FP16 fine-tuned model against its 4-bit GPTQ version on an NVIDIA T4 GPU to assess performance in a typical accelerated environment. The results reveal a nuanced and counterintuitive relationship between quantization, memory, and speed on this specific hardware architecture.
As shown in
Figure 3, quantization delivered substantial memory savings. The 4-bit GPTQ model reduced total VRAM consumption from 3.27 GB to 1.93 GB (a 41% reduction) and cut the model’s parameter-only footprint from 2.30 GB to just 0.96 GB.
Despite these memory benefits, the GPTQ model was significantly slower and less energy-efficient during inference, as detailed in
Figure 4. While the model load time decreased dramatically by 93.4% (from 16.95 s to 1.12 s), the inference speed dropped from 44.56 tokens/second for the FP16 model to just 7.92 tokens/second for the GPTQ version, an 82.2% slowdown. The slowdown suggests that the T4 GPU did not execute computations directly in 4-bit. Instead, it appears that the quantized weights were converted back to higher precision during inference, adding overhead. Consequently, the energy consumed per token was 489.3% higher for the quantized model, making it less efficient for inference on this architecture.
4.4. CPU Performance Profile (GGUF)
In contrast to the GPU results, the GGUF-quantized models demonstrated significant performance gains on a CPU (Hetzner CPX32 instance: 4 vCPU, AMD EPYC Genoa, 8 GB RAM), leveraging the highly optimized ‘llama.cpp’ library.
As shown in
Figure 5, the FP16 model achieved 4.5 tokens/second. In contrast, all quantized GGUF versions provided a clear speedup. The 4-bit (Q4_K_M) version was the fastest at 19.5 tokens/second, representing a 4.3× improvement compared to the FP16 baseline.
These speed advantages were complemented by substantial reductions in RAM consumption and load times.
Figure 5 also shows that the RAM footprint was reduced from 3.30 GB for the FP16 model to 0.93–1.14 GB for the GGUF versions, corresponding to roughly a 65–72% reduction. This makes it feasible to run the model on standard consumer hardware. Consequently, load times also saw a significant improvement, as shown in
Figure 6.
In addition to throughput, we measured end-to-end latency for the CPU variants. As shown in
Figure 7, the Q4_K_M model achieved the lowest mean latency (0.82 s) and P95 latency (1.16 s), compared to 3.55 s (mean) and 5.27 s (P95) for FP16. The Q3_K_M and Q5_K_M variants also reduced latency substantially, with mean latencies of 1.23 s and 1.55 s, and P95 latencies of 1.79 s and 2.27 s, respectively.
4.5. Accuracy vs. Efficiency Trade-Offs on CPU
Table 3 summarizes the primary CPU deployment trade-offs among the FP16 baseline and the GGUF quantization variants. The FP16 model achieved the highest accuracy (0.988) but required the most memory (3.30 GB) and delivered the lowest throughput (4.5 tok/s). In contrast, all GGUF variants substantially reduced memory and improved throughput.
Among quantized models, the 4-bit Q4_K_M variant offers the best overall balance: it preserves near-FP16 accuracy (0.984), achieves the highest throughput (19.5 tok/s), and maintains a low memory footprint (1.04 GB). The 5-bit Q5_K_M variant matches Q4_K_M in accuracy (0.984) but trades speed for a slightly higher memory footprint. The more aggressive 3-bit Q3_K_M variant minimizes memory (0.93 GB) and remains relatively fast (13.0 tok/s), but incurs a larger accuracy drop (0.972).
5. Discussion
The results of our study offer a multi-faceted view of the opportunities and challenges associated with deploying small, optimized language models. Our findings confirm that specialization can elevate a small model to state-of-the-art performance; however, they also reveal that the efficiency gains from optimization techniques such as quantization are deeply intertwined with the underlying hardware and software ecosystem.
5.1. Small Models Can Achieve State-of-the-Art Accuracy
Our central finding—that a 1B parameter Llama 3.2 model can reach accuracy close to GPT-4.1 after specialized fine-tuning—contributes to a growing body of evidence challenging the notion that larger models are constantly superior. This result underscores the power of domain specialization. While massive models possess a broad, generalist knowledge base, a smaller model trained on a high-quality, task-specific dataset can develop a deep, expert-level competency within its narrow domain. This principle has been demonstrated across diverse fields, from outperforming GPT-4 on arithmetic tasks [
11] to achieving comparable performance in pedagogical applications [
13] and medical language understanding [
21]. The comprehensive study in LoRA Land, which found that over 200 fine-tuned small models surpassed GPT-4 on specific tasks, solidifies this paradigm [
14]. For many business applications, particularly in e-commerce where user interactions follow predictable patterns [
7], deploying a fleet of small, expert models is a more cost-effective, private, and computationally efficient strategy than relying on a single, oversized generalist model.
5.2. Quantization Is Not a Free Lunch: The Hardware-Software Synergy
Perhaps the most critical insight from our performance analysis is that the benefits of quantization are not intrinsic to the algorithm itself but emerge from the synergy between the model format, the inference engine, and the hardware architecture. The stark contrast between our GPU and CPU results illustrates this point perfectly.
The 82% slowdown of the 4-bit GPTQ model on the NVIDIA T4 GPU, despite a 41% reduction in VRAM, highlights a common pitfall. The slowdown suggests that the T4 GPU did not execute computations directly in 4-bit. Instead, it appears that the quantized weights were converted back to higher precision during inference, adding overhead that negated the benefits of reduced memory bandwidth. This finding aligns with systematic characterizations of LLM quantization, which show that performance and energy gains are highly dependent on the interplay between the quantization scheme and the GPU’s capabilities [
9,
15]. On modern GPUs designed with native support for low-precision arithmetic, these results would likely be reversed, leading to significant speedups [
18].
Conversely, the strong performance of the GGUF formats on the CPU (achieving up to a 4.3× speedup) is a testament to software optimization. The llama.cpp v3.1 library is purpose-built to leverage CPU vector instructions (such as AVX) for highly efficient low-bit integer matrix multiplications. This demonstrates that with the right software, even ubiquitous consumer hardware can become a powerful platform for LLM inference. This principle of hardware-software co-design is a central theme in efficient AI deployment, from large-scale data centers dynamically managing resources [
10] to hyper-efficient accelerators for edge devices [
15,
24]. Our results provide a clear, practical example of this principle in action.
5.3. Practical Recommendations
Our analysis of the accuracy-vs-efficiency trade-offs, particularly for CPU deployment, allows us to formulate concrete recommendations for practitioners. The choice of which model variant to deploy is not a simple one but a strategic decision based on application-specific priorities.
When correctness is paramount, the FP16 model remains the most accurate option (0.988). Among the quantized CPU variants, the 4- and 5-bit GGUF models (Q4_K_M and Q5_K_M) retain high accuracy (0.984) while also reducing memory usage and improving throughput.
When latency is the primary goal and a modest dip in accuracy is acceptable, the 4-bit GGUF (Q4_K_M) model is the best choice on our CPU testbed, achieving the highest throughput.
When RAM is the binding constraint, the 3-bit GGUF (Q3_K_M) variant offers the smallest footprint (0.93 GB), at the cost of a larger accuracy reduction (0.972).
This decision-making process, balancing multiple objectives such as accuracy, latency, and energy, resonates with calls to move beyond single metrics and adopt more holistic evaluation frameworks, such as considering energy-per-token [
19,
24].
5.4. Limitations of the Study
While our findings provide valuable insights, it is important to acknowledge the limitations of this study, which also point to avenues for future research. First, our model was trained and evaluated exclusively on synthetically generated data. While carefully designed with strategic noise injection, synthetic data may not fully capture the complexity and unpredictability of real-world user queries. The challenges of ensuring realism and avoiding distributional mismatch in synthetic data are well-documented [
20]. In particular, this reliance on synthetic data may limit the generalizability of our findings, as even high-quality generated examples can underrepresent rare, ambiguous, culturally specific, or rapidly evolving query patterns that appear in production e-commerce settings. As a result, model performance observed on our evaluation set may overestimate robustness when the system is exposed to authentic user behavior, spelling variation, code-switching, and intent formulations not adequately captured by the synthetic data generation process. Future work should therefore include validation and, where possible, further adaptation using real-world annotated queries to better assess external validity. A real-world holdout set or external structured e-commerce intent benchmark was not included in the present revision because the available production-style queries contain sensitive user information and would require additional anonymization, annotation, and governance before use. Second, our hardware scope was limited to a single older-generation GPU (NVIDIA T4) and a single virtualized CPU environment (Hetzner CPX32: 4 vCPU, AMD EPYC Genoa, 8 GB RAM). As discussed, performance results for GPTQ are expected to be significantly different and more favorable on newer GPUs. Third, our investigation focused on a single, well-defined task. The model’s excellent performance in intent extraction does not guarantee similar success on other, more complex e-commerce tasks, such as those benchmarked in ShoppingBench, which require multi-step reasoning and tool use [
6]. Finally, our use of a strict exact match accuracy metric, while appropriate for this task, is binary and does not capture nuances of partially correct answers.
6. Conclusions
This paper investigated the feasibility of using small, optimized open-weight language models as a practical alternative to large commercial systems for a specialized e-commerce task. Our findings demonstrate that this approach is not only viable but highly effective, offering a pathway to building state-of-the-art AI solutions that are both powerful and resource-efficient.
We have successfully shown that through parameter-efficient fine-tuning on a high-quality synthetic dataset, a 1-billion-parameter Llama 3.2 model can achieve 98.8% exact match accuracy in a structured intent recognition task. This result places its performance close to the significantly larger, state-of-the-art GPT-4.1 model, confirming our primary hypothesis. However, our analysis of post-training quantization revealed a critical layer of complexity: the operational benefits are profoundly dependent on the deployment environment. On an older GPU architecture, 4-bit GPTQ quantization, while drastically reducing memory, led to a counter-intuitive 82% slowdown in inference due to dequantization overhead. In stark contrast, GGUF formats on a CPU achieved up to a 4.3× improvement in inference throughput and up to a 72% reduction in RAM usage compared to the FP16 baseline, making sophisticated LLM inference feasible on resource-constrained deployments.
The primary conclusion of this work is that the future of many applied AI solutions may not lie solely with ever-larger generalist models, but in a diverse ecosystem of smaller, highly specialized, and hardware-aware models. For developers and organizations, our results provide a clear directive: the choice of an optimization strategy cannot be made in isolation from the target hardware. A model that is highly efficient in one context can be impractical in another. This paradigm enables organizations to develop customized solutions that are more cost-effective, private, and computationally efficient.
Building on these findings, several avenues for future research emerge. First, validating the fine-tuned model’s performance on a large corpus of real-world, anonymized user data would provide the definitive confirmation of its practical utility. Second, replicating the GPU performance analysis on modern architectures (e.g., NVIDIA Ampere or Hopper) is essential to quantify the potential inference speedups that GPTQ can offer when paired with native low-precision hardware support. Finally, the scope of this work could be expanded by fine-tuning the model to handle a broader range of e-commerce tasks, such as product recommendation or order status inquiries, and by exploring alternative optimization techniques, such as AWQ quantization or other PEFT methods like DoRA.
Ultimately, this research serves as a practical demonstration that with the right specialization and optimization strategies, small models can indeed stand shoulder-to-shoulder with giants, offering a more accessible and sustainable path for the widespread adoption of advanced AI.