Performance Trade-Offs of Optimizing Small Language Models for E-Commerce

Licardo, Josip Tomo; Tanković, Nikola; Osman, Ivan; Lorencin, Ivan; Baressi Šegota, Sandi

doi:10.3390/bdcc10050155

Open AccessArticle

Performance Trade-Offs of Optimizing Small Language Models for E-Commerce

by

Josip Tomo Licardo

¹

,

Nikola Tanković

¹

,

Ivan Osman

²,

Ivan Lorencin

¹ and

Sandi Baressi Šegota

^1,*

¹

Faculty of Informatics, Juraj Dobrila University of Pula, Zagrebačka 30, 52100 Pula, Croatia

²

Infobip Ltd., London EC4V 6BW, UK

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(5), 155; https://doi.org/10.3390/bdcc10050155

Submission received: 9 March 2026 / Revised: 29 April 2026 / Accepted: 12 May 2026 / Published: 14 May 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training quantization techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 98.8% accuracy, approaching the performance of the significantly larger GPT-4.1 model. A detailed performance analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older GPU architecture (NVIDIA T4) due to dequantization overhead. Conversely, GGUF formats on a CPU achieved a speedup of up to

4.3 \times

in inference throughput and up to a 72% reduction in RAM consumption compared to the FP16 baseline. We conclude that small, properly optimized open-weight models are not just a viable but a more suitable alternative for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.

Keywords:

large language models; e-commerce; fine-tuning; QLoRA; quantization; GPTQ

1. Introduction

The field of artificial intelligence has been fundamentally reshaped by the advent of Large Language Models (LLMs), sophisticated deep learning systems that demonstrate a remarkable capacity to understand, generate, and reason with human language [1]. Foundational models such as Meta’s Llama 3 series [2], Google’s Gemma 3 [3], and Alibaba’s Qwen3 [4] now represent the state-of-the-art, powering applications that span from complex code generation to nuanced creative writing and scientific discovery. Their advanced capabilities have unlocked new paradigms for human-computer interaction and process automation across countless industries.

One of the most promising domains for LLM application is e-commerce, a highly competitive landscape where user experience is a critical determinant of success. The ability of a system to accurately interpret and act upon user requests expressed in natural, often informal, language is a significant competitive advantage. This paradigm, known as conversational commerce, aims to create more intuitive and personalized shopping experiences [5]. Tasks such as robust product classification from noisy user descriptions [6] and understanding complex user intent are central to this vision [7]. Major industry players like eBay have already begun developing their own in-house, domain-specific LLMs to capitalize on these opportunities, underscoring the strategic importance of this technology [7].

However, the deployment of the most powerful, state-of-the-art commercial models, such as OpenAI’s GPT-4, presents significant practical and economic barriers. These models are computationally massive, and their use via API services generates continuous operational costs, creates vendor lock-in, and raises concerns regarding data privacy and security. Furthermore, the immense energy, water, and carbon footprint associated with serving billions of queries from these large-scale models is a growing concern for sustainable AI development [8,9,10]. The high cost and resource intensity of these “frontier” models thus create a substantial obstacle to their widespread adoption, particularly for small to medium-sized enterprises.

In response to these challenges, a powerful trend has emerged: the rise of smaller, highly optimized, open-weight language models. A growing body of research demonstrates that these smaller models, when specialized for a specific domain, can achieve performance comparable or even superior to that of much larger, general-purpose models. For instance, fine-tuned small models have been shown to outperform GPT-4 on specific tasks like arithmetic [11], mental health understanding [12], and developing pedagogical tools [13]. A large-scale study fine-tuning over 300 models confirmed that specialized models can consistently rival GPT-4 on narrow tasks, suggesting that a “many small models” approach is a viable and efficient strategy [14]. This paradigm shift from monolithic, “one-size-fits-all” models to a diverse ecosystem of specialized agents motivates our research.

Two key technologies are central to unlocking the potential of these smaller models: parameter-efficient fine-tuning (PEFT) and post-training quantization (PTQ). PEFT techniques allow a pretrained model to be adapted to a new task by training only a small fraction of its parameters. Low-Rank Adaptation (LoRA) has become a standard approach [13], and its successor, Quantized LoRA (QLoRA), further democratized this process by enabling the fine-tuning of large models on consumer-grade hardware through aggressive 4-bit quantization of the base model during training [15].

Concurrently, post-training quantization has become essential for efficient inference. PTQ methods reduce the numerical precision of a model’s weights (and sometimes activations) after training, leading to significant reductions in memory footprint and potential increases in inference speed. Techniques like GPTQ [16] and AWQ [17] have demonstrated the ability to compress models to 4-bit precision with minimal accuracy loss. However, recent research highlights that the benefits of quantization are not universal; they are highly dependent on the model size, task, and underlying hardware architecture, creating a complex web of trade-offs between performance, energy, and output quality [15,17,18].

This paper bridges these research threads by conducting a comprehensive, hardware-aware investigation into the practical viability of a small, optimized open-weight model for a critical e-commerce task. We hypothesize that by combining QLoRA-based fine-tuning with aggressive post-training quantization, it is possible to specialize a 1-billion-parameter model to achieve accuracy close to the much larger, state-of-the-art GPT-4.1 on a structured intent recognition task. Crucially, we extend beyond mere accuracy to analyze the real-world performance trade-offs, measuring inference speed, memory consumption, and energy efficiency on both GPU and CPU environments to provide a holistic view of the model’s operational characteristics.

Our contributions are fourfold. First, we introduce an end-to-end methodology for building a highly efficient, specialized language model for multilingual e-commerce intent recognition, including a novel synthetic data generation process based on “metaprompting”. Second, we supply empirical evidence that a properly fine-tuned 1-billion-parameter Llama 3.2 model reaches 98.8% accuracy on this task, coming close to GPT-4.1 performance on our evaluation set. Third, we present a detailed, hardware-aware performance analysis that surfaces nuanced, sometimes counter-intuitive effects of quantization, showing that 4-bit GPTQ can slow inference on older GPUs while GGUF formats yield significant speedups on CPUs. Fourth, we release our multilingual synthetic dataset for e-commerce intent recognition to support further research and reproducibility.

The remainder of this paper is organized as follows. Section 2 discusses the background and related work in model specialization and optimization. Section 3 details our methodology, including task definition, dataset generation, and the experimental pipeline. Section 4 presents our results on model accuracy and operational performance. Section 5 discusses the implications of our findings, and Section 6 concludes the paper with a summary and directions for future work.

2. Background and Related Work

The rapid evolution of LLMs is characterized by a dual trend: the scaling of massive, general-purpose models and the proliferation of smaller, highly specialized models. While large models set performance benchmarks, their practical deployment is often limited. This has spurred significant research into methodologies for creating and optimizing smaller models that are both powerful and efficient. Our work is situated at the intersection of three key research areas: domain specialization through fine-tuning, performance optimization via quantization, and the generation of high-quality data for these processes.

2.1. From Generalists to Specialists: The Power of Fine-Tuning

The prevailing paradigm is shifting from relying on a single, monolithic model to deploying smaller models tailored for specific tasks. A growing body of evidence shows that this specialization is highly effective. Large-scale studies, such as the one conducted in LoRA Land, have demonstrated that fine-tuning a diverse set of 7B models with LoRA can result in performance that consistently rivals or exceeds that of GPT-4 on many narrow tasks [14]. This effect holds across various domains; for instance, the Goat model, a fine-tuned LLaMA, surpassed GPT-4 on complex arithmetic tasks [11], and specialized models have shown comparable performance to GPT-4 in sensitive fields such as mental health understanding [12] and for creating pedagogical tools [13]. This trend is also prominent in industry, with companies such as eBay developing their own in-house, e-commerce-focused LLMs to gain a competitive edge [7].

The primary mechanism for achieving this specialization is fine-tuning. However, full fine-tuning, which updates all of a model’s billions of parameters, is resource-prohibitive. This has led to the dominance of Parameter-Efficient Fine-Tuning (PEFT) methods. Low-Rank Adaptation (LoRA) is a foundational PEFT technique that freezes the pretrained model weights and injects small, trainable low-rank matrices, dramatically reducing the number of trainable parameters and memory requirements [13]. Building on this, Quantized LoRA (QLoRA) introduced a breakthrough by quantizing the base model to an aggressive 4-bit precision during fine-tuning. This, combined with innovations such as the NormalFloat4 (NF4) data type and paged optimizers, enables the fine-tuning of very large models (e.g., 65B) on a single consumer GPU while maintaining the performance of 16-bit fine-tuning [15]. The PEFT landscape continues to evolve with more advanced techniques such as DoRA, which dynamically allocates rank during training [13], further enhancing the toolkit for creating specialized models.

2.2. Post-Training Quantization for Efficient Inference

While QLoRA uses quantization during training, Post-Training Quantization (PTQ) is a critical step for optimizing models for efficient deployment. The primary goals of PTQ are to reduce the model’s memory footprint, decrease inference latency, and lower energy consumption [9,19]. Numerous PTQ methods have been developed, with GPTQ and AWQ being among the most prominent. GPTQ is a one-shot weight quantization method that uses approximate second-order information to iteratively quantize weights while compensating for errors, achieving high accuracy even at 3 or 4-bit precision [16]. In contrast, Activation-aware Weight Quantization (AWQ) identifies that a small fraction of weights are disproportionately important for performance. It protects these salient weights by scaling them up before uniform quantization, a simple yet highly effective strategy [17]. Other methods such as SmoothQuant tackle the problem by mathematically migrating quantization difficulty from activations, which often have problematic outliers, to the more easily quantized weights [18].

However, the benefits of quantization are not a “free lunch.” Recent comprehensive benchmarks reveal a complex landscape of trade-offs. The effectiveness of a given PTQ method is highly dependent on the model size, the target task, and the underlying hardware [17,18]. For instance, smaller models often suffer greater accuracy degradation from 4-bit quantization than larger ones [17]. Furthermore, quantization can disproportionately affect certain capabilities, such as mathematical reasoning [16], and may even alter a model’s truthfulness under certain conditions [15]. A systematic characterization of LLM quantization shows that choices regarding tensor parallelism and GPU architecture can dramatically alter the realized gains in latency and energy, underscoring the need for hardware-aware evaluation [15].

2.3. Synthetic Data and Structured Output Generation

The success of fine-tuning is fundamentally dependent on the quality and volume of the training data. When domain-specific data is scarce, synthetic data generation using LLMs has become an indispensable technique [20]. The Self-Instruct methodology first demonstrated that an LLM could bootstrap its own instruction-following dataset from a small seed set [21]. This concept has evolved into sophisticated, agentic pipelines such as AgentInstruct [21] and MetaSynth [20], which use multi-LLM systems to generate, refine, and diversify synthetic data at a massive scale. The quality of the data generator is key, as different LLMs exhibit complementary strengths in this role [6].

A specific challenge within this domain, and one central to our work, is generating structured outputs such as JSON. This capability is critical for integrating LLMs into automated workflows and RAG systems. Recent research has focused on benchmarking this capability [12,22] and developing techniques to enforce strict schema adherence, such as reinforcement learning strategies [21]. Studies comparing fine-tuned small models against prompted large models for generating structured low-code workflows have found that targeted fine-tuning often yields more robust and domain-aware results [12]. However, it is also known that imposing strict format restrictions can sometimes impair a model’s underlying reasoning ability, suggesting a trade-off between structural rigidity and performance [12]. Our work builds on these insights by using an advanced LLM to generate a structured, noisy, multilingual dataset tailored for a real-world e-commerce task, providing the foundation for our specialization experiments.

3. Methodology

To systematically evaluate the performance trade-offs of optimizing small language models, we designed a multi-stage experimental pipeline. The process encompasses task definition, synthetic dataset generation, model selection, fine-tuning, post-training quantization, and a dual-pronged evaluation of both accuracy and operational performance.

3.1. Task Definition and Dataset Generation

The core task of our experiment is structured intent extraction from natural language user queries in an e-commerce context. The model’s objective is to parse a user’s free-form text request for managing a shopping cart and output a structured JSON object containing three key fields: ‘action’ (e.g., “add” or “remove”), ‘product’ (the canonical name of the item), and ‘quantity’ (an integer).

Given the lack of publicly available, multilingual datasets for this specific task, we generated a high-quality synthetic dataset. We employed a “metaprompting” strategy, using the GPT-4.1 model as a sophisticated data generator. A Python 3.12 script systematically constructed detailed prompts that guided the generator model to produce diverse and realistic examples.

To increase realism, the prompt templates and noise-injection rules were informed by patterns observed in real-world, user-generated e-commerce queries collected in production settings. Because these real queries contained sensitive private information, we did not include or reproduce any raw user text; instead, we distilled high-level guidelines (e.g., phrasing, common typos, and code-switching patterns) to steer the synthetic generation.

For each example, the script programmatically defined the ground-truth parameters: the language (cycling through English, Croatian, and Spanish), the action (either ‘add’ or ‘remove’), a product randomly selected from a predefined catalog, and a quantity chosen via a weighted random selection to mimic realistic purchasing patterns. To introduce stylistic diversity, a wide range of prompt templates was employed, generating user expressions that varied from formal requests to brief imperative commands.

To further enhance realism and robustness, the script strategically injected noise. This included linguistic noise such as typos or slang like “pls” and “thx”, contextual noise such as greetings, emojis, or unrelated brand names, and instances of code-switching, for example embedding English phrases like “free shipping” into Croatian sentences.

This pipeline produced a dataset of 6000 examples. An example of a data point is shown below:

User input: "Can you delet 12 lip balms for me?"
->
JSON: {"action": "remove", "product": "Lip Balm", "quantity": 12}

The dataset, named jtlicardo/ecommerce-intent-6k, was published on the Hugging Face Hub (https://huggingface.co/datasets/jtlicardo/ecommerce-intent-6k, accessed on 12 April 2026) and split into a 90% training set and a 10% validation set.

3.2. Models and Baselines

Our investigation centered on optimizing a small, open-weight model. The primary model for fine-tuning and quantization was Llama 3.2 1B, a recent and efficient model from Meta [2]. To contextualize its performance, we also evaluated a representative set of other open-weight models, including Gemma 3 1B and Gemma 2 2B from Google [3], and Qwen 2.5 1.5B and Qwen 2.5 3B from Alibaba [4].

To establish a state-of-the-art performance benchmark, we used leading commercial models from OpenAI’s GPT-4.1 series. These proprietary models serve as an upper bound for accuracy on the given task and were evaluated using a few-shot prompting approach, where several examples were provided in the prompt to guide the model’s response format.

For consistency with the accuracy table, we clarify that the reported fine-tuned Gemma 3 1B result was obtained using the same supervised QLoRA fine-tuning protocol and exact-match evaluation procedure used for the Llama 3.2 1B model. However, Gemma 3 1B was included only as an auxiliary fine-tuned baseline; the subsequent post-training quantization and deployment analysis focuses on Llama 3.2 1B, which is the primary model studied in depth.

3.3. Experimental Setup

The main experimental setup is summarized in Table 1. In brief, the study evaluates multilingual e-commerce intent extraction as a structured JSON generation task, using a common held-out test set to compare commercial baselines, open-weight models, fine-tuned variants, and quantized deployment formats.

3.4. Experimental Pipeline

The core of our experiment involved a two-stage process: specializing the base model for our task via fine-tuning, and then optimizing the specialized model for efficient deployment via quantization.

The full procedure is summarized as a step-by-step workflow in Figure 1. This illustration also provides a concise visual overview of the main methodological phases of the study.

3.4.1. Fine-Tuning

We employed QLoRA (Quantized Low-Rank Adaptation) [15], a parameter-efficient fine-tuning technique, to train our model. The process was orchestrated using the ‘transformers’, ‘peft’, and ‘bitsandbytes’ libraries. The base Llama 3.2 1B model was loaded with its weights quantized to 4-bit precision using the NormalFloat4 (NF4) data type. A LoRA adapter was then applied with the following key hyperparameters: rank (r) of 16, alpha (lora_alpha) of 32, and a dropout rate of 0.15. The adapter targeted all linear projections within the self-attention and feed-forward network blocks. The model was trained for 3 epochs using the ‘SFTTrainer’ with a batch size of 8 and a learning rate of 2

\times 10^{- 4}

. Crucially, the loss was calculated only on the completion part of the sequence (the JSON output), focusing the learning process exclusively on generating the correct structured data.

To improve reproducibility, the fine-tuning implementation details are provided in the accompanying Colab notebook (https://colab.research.google.com/drive/1D6uC0-0EJtB-WXR5ipEnSCsv9yO2Oy-O?usp=sharing, accessed on 12 April 2026). In the executed training configuration, the dataset was split into training and validation subsets using a 90/10 split with seed 42. The tokenizer padding token was set to the end-of-sequence token when no native padding token was available. The 4-bit loading configuration used NF4 quantization with FP16 compute precision. The LoRA adapter targeted the q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj modules, with bias=none and causal-language-model adaptation. The supervised fine-tuning configuration used a per-device batch size of 8, gradient accumulation of 1, 3 epochs, a learning rate of 2

\times 10^{- 4}

, warmup ratio of 0.05, maximum gradient norm of 1.0, BF16 training enabled, epoch-level evaluation and checkpointing, and best-checkpoint selection based on validation loss. During training, an exact-match callback also evaluated JSON output correctness on the validation split.

3.4.2. Post-Training Quantization

After fine-tuning, the trained LoRA adapter was merged with the original, full-precision (FP16) Llama 3.2 1B base model. This step created a single, unified, specialized model in FP16 format, which served as the foundation for all subsequent quantization.

GPTQ for GPU: We used the ‘gptqmodel’ library to apply GPTQ (Generative Pre-trained Transformer Quantization) [16]. Using a calibration set of 300 random samples from our training data, we quantized the merged FP16 model to 4-bit precision, creating a version optimized for GPU inference. The calibration set used in the released notebook was language-balanced, consisting of 100 English, 100 Croatian, and 100 Spanish prompt-completion texts, which were concatenated before quantization.
GGUF for CPU: We used the ‘llama.cpp’ v3.1 toolchain to convert the merged FP16 model into the GGUF format, which is highly optimized for CPU inference. We generated three distinct versions to analyze the impact of bit depth: an aggressive 3-bit version (Q3_K_M), a balanced 4-bit version (Q4_K_M), and a high-quality 5-bit version (Q5_K_M).

3.5. Evaluation Framework

To ensure a rigorous and fair comparison, we used a separate, unseen test set of 1000 examples, jtlicardo/ecommerce-intent-eval (https://huggingface.co/datasets/jtlicardo/ecommerce-intent-eval, accessed on 12 April 2026), structured identically to the training data.

3.5.1. Accuracy Metric

Performance was measured using exact match accuracy. For a prediction to be considered correct, the generated output had to be a syntactically valid JSON object, and all three key-value pairs (‘action‘, ‘product’, and ‘quantity’) had to be identical to the ground truth labels. This strict metric was chosen because partial correctness in this e-commerce task can lead to significant functional errors in a production system.

3.5.2. Performance Metrics and Hardware

Beyond accuracy, we profiled the key operational characteristics of our fine-tuned Llama 3.2 1B model in its FP16 and quantized forms. We measured inference speed (tokens generated per second), latency (mean and P95 end-to-end latency in seconds), memory consumption (GB of VRAM for GPU tests and GB of RAM for CPU tests), and energy consumption, where power draw in Watts was monitored using nvidia-smi during GPU tests to calculate energy efficiency. GPU-based evaluations (FP16 vs. GPTQ) were conducted on an NVIDIA T4 GPU within the Google Colab environment, while CPU-based evaluations (GGUF variants) were performed on a Hetzner CPX32 instance (4 vCPU) with an AMD EPYC Genoa processor and 8 GB RAM.

4. Results

This section presents the empirical findings of our study. We first compare the accuracy of all evaluated models to establish a performance baseline and validate our primary hypothesis. We then provide a detailed analysis of the operational characteristics of our optimized model, examining the hardware-dependent trade-offs of quantization on both GPU and CPU platforms.

4.1. Model Accuracy Comparison

The primary goal of our experiment was to determine if a small, specialized open-weight model could achieve accuracy parity with large-scale commercial models. The exact match accuracy scores for all models on the unseen test set are presented in Table 2.

As expected, the commercial GPT-4.1 series models set a high bar, achieving near-perfect scores. The base open-weight models showed varied but generally respectable performance, with the Llama 3.2 1B base model achieving an accuracy of 0.76. The most significant finding is the dramatic performance increase after fine-tuning. The Llama 3.2 1B model, after fine-tuning with QLoRA, saw its accuracy jump to 0.988. This result brings the specialized 1B parameter model within 1.2 percentage points of GPT-4.1 (1.000) and places it in the same performance range as the GPT-4.1-mini and GPT-4.1-nano baselines. This outcome provides strong support for our central hypothesis. The overall performance landscape is visualized in Figure 2.

4.2. The Impact of Quantization on Accuracy

A key aspect of our investigation was to determine how post-training quantization affects the accuracy of the specialized model. The results, shown in Table 2, reveal that modern quantization techniques can preserve performance with high fidelity. Among the evaluated variants, the 5-bit GGUF (Q5_K_M) model achieved the highest accuracy among quantized models (0.984), closely followed by the 4-bit GGUF (Q4_K_M) and 4-bit GPTQ variants, which exhibited only a marginal reduction relative to the full-precision fine-tuned model. These results confirm that high-quality 4- and 5-bit quantization can retain near-FP16 accuracy for specialized tasks while delivering substantial efficiency gains.

Furthermore, more aggressive quantization did not lead to a dramatic collapse in performance within the tested range. The 4-bit GGUF (Q4_K_M) and 5-bit GGUF (Q5_K_M) versions both achieved an accuracy of 0.984, representing only a slight decrease from the FP16 model’s 0.988. Even the 3-bit GGUF (Q3_K_M) variant maintained strong performance at 0.972, indicating that moderate quantization can be applied with minimal loss in accuracy. These results suggest that, for this specialized task, the “quantization cliff” [23] is not encountered until beyond 3-bit quantization and that high-quality quantized models can closely approach the accuracy of their full-precision counterparts.

This robustness is plausible in the context of structured intent extraction because the task has a constrained output space and a relatively low-entropy target format. Unlike open-ended generation, the model is not required to produce long, stylistically varied, or knowledge-intensive responses; it must map each input to a small set of discrete decisions: the action, canonical product name, and quantity. Fine-tuning therefore concentrates the model’s behavior around stable schema-following patterns and task-specific lexical cues. Moderate 4-bit and 5-bit quantization can perturb individual weights and token probabilities, but these perturbations are less likely to change the final exact-match output when the decision boundary is already clear and the generated JSON structure is short and highly regular. This also explains why the finding should be interpreted as task-specific: similarly aggressive quantization may have a larger impact on open-ended reasoning, long-form generation, or tasks requiring subtle semantic distinctions.

4.3. GPU Performance Profile (FP16 vs. GPTQ)

We profiled the FP16 fine-tuned model against its 4-bit GPTQ version on an NVIDIA T4 GPU to assess performance in a typical accelerated environment. The results reveal a nuanced and counterintuitive relationship between quantization, memory, and speed on this specific hardware architecture.

As shown in Figure 3, quantization delivered substantial memory savings. The 4-bit GPTQ model reduced total VRAM consumption from 3.27 GB to 1.93 GB (a 41% reduction) and cut the model’s parameter-only footprint from 2.30 GB to just 0.96 GB.

Despite these memory benefits, the GPTQ model was significantly slower and less energy-efficient during inference, as detailed in Figure 4. While the model load time decreased dramatically by 93.4% (from 16.95 s to 1.12 s), the inference speed dropped from 44.56 tokens/second for the FP16 model to just 7.92 tokens/second for the GPTQ version, an 82.2% slowdown. The slowdown suggests that the T4 GPU did not execute computations directly in 4-bit. Instead, it appears that the quantized weights were converted back to higher precision during inference, adding overhead. Consequently, the energy consumed per token was 489.3% higher for the quantized model, making it less efficient for inference on this architecture.

4.4. CPU Performance Profile (GGUF)

In contrast to the GPU results, the GGUF-quantized models demonstrated significant performance gains on a CPU (Hetzner CPX32 instance: 4 vCPU, AMD EPYC Genoa, 8 GB RAM), leveraging the highly optimized ‘llama.cpp’ library.

As shown in Figure 5, the FP16 model achieved 4.5 tokens/second. In contrast, all quantized GGUF versions provided a clear speedup. The 4-bit (Q4_K_M) version was the fastest at 19.5 tokens/second, representing a 4.3× improvement compared to the FP16 baseline.

These speed advantages were complemented by substantial reductions in RAM consumption and load times. Figure 5 also shows that the RAM footprint was reduced from 3.30 GB for the FP16 model to 0.93–1.14 GB for the GGUF versions, corresponding to roughly a 65–72% reduction. This makes it feasible to run the model on standard consumer hardware. Consequently, load times also saw a significant improvement, as shown in Figure 6.

In addition to throughput, we measured end-to-end latency for the CPU variants. As shown in Figure 7, the Q4_K_M model achieved the lowest mean latency (0.82 s) and P95 latency (1.16 s), compared to 3.55 s (mean) and 5.27 s (P95) for FP16. The Q3_K_M and Q5_K_M variants also reduced latency substantially, with mean latencies of 1.23 s and 1.55 s, and P95 latencies of 1.79 s and 2.27 s, respectively.

4.5. Accuracy vs. Efficiency Trade-Offs on CPU

Table 3 summarizes the primary CPU deployment trade-offs among the FP16 baseline and the GGUF quantization variants. The FP16 model achieved the highest accuracy (0.988) but required the most memory (3.30 GB) and delivered the lowest throughput (4.5 tok/s). In contrast, all GGUF variants substantially reduced memory and improved throughput.

Among quantized models, the 4-bit Q4_K_M variant offers the best overall balance: it preserves near-FP16 accuracy (0.984), achieves the highest throughput (19.5 tok/s), and maintains a low memory footprint (1.04 GB). The 5-bit Q5_K_M variant matches Q4_K_M in accuracy (0.984) but trades speed for a slightly higher memory footprint. The more aggressive 3-bit Q3_K_M variant minimizes memory (0.93 GB) and remains relatively fast (13.0 tok/s), but incurs a larger accuracy drop (0.972).

5. Discussion

The results of our study offer a multi-faceted view of the opportunities and challenges associated with deploying small, optimized language models. Our findings confirm that specialization can elevate a small model to state-of-the-art performance; however, they also reveal that the efficiency gains from optimization techniques such as quantization are deeply intertwined with the underlying hardware and software ecosystem.

5.1. Small Models Can Achieve State-of-the-Art Accuracy

Our central finding—that a 1B parameter Llama 3.2 model can reach accuracy close to GPT-4.1 after specialized fine-tuning—contributes to a growing body of evidence challenging the notion that larger models are constantly superior. This result underscores the power of domain specialization. While massive models possess a broad, generalist knowledge base, a smaller model trained on a high-quality, task-specific dataset can develop a deep, expert-level competency within its narrow domain. This principle has been demonstrated across diverse fields, from outperforming GPT-4 on arithmetic tasks [11] to achieving comparable performance in pedagogical applications [13] and medical language understanding [21]. The comprehensive study in LoRA Land, which found that over 200 fine-tuned small models surpassed GPT-4 on specific tasks, solidifies this paradigm [14]. For many business applications, particularly in e-commerce where user interactions follow predictable patterns [7], deploying a fleet of small, expert models is a more cost-effective, private, and computationally efficient strategy than relying on a single, oversized generalist model.

5.2. Quantization Is Not a Free Lunch: The Hardware-Software Synergy

Perhaps the most critical insight from our performance analysis is that the benefits of quantization are not intrinsic to the algorithm itself but emerge from the synergy between the model format, the inference engine, and the hardware architecture. The stark contrast between our GPU and CPU results illustrates this point perfectly.

The 82% slowdown of the 4-bit GPTQ model on the NVIDIA T4 GPU, despite a 41% reduction in VRAM, highlights a common pitfall. The slowdown suggests that the T4 GPU did not execute computations directly in 4-bit. Instead, it appears that the quantized weights were converted back to higher precision during inference, adding overhead that negated the benefits of reduced memory bandwidth. This finding aligns with systematic characterizations of LLM quantization, which show that performance and energy gains are highly dependent on the interplay between the quantization scheme and the GPU’s capabilities [9,15]. On modern GPUs designed with native support for low-precision arithmetic, these results would likely be reversed, leading to significant speedups [18].

Conversely, the strong performance of the GGUF formats on the CPU (achieving up to a 4.3× speedup) is a testament to software optimization. The llama.cpp v3.1 library is purpose-built to leverage CPU vector instructions (such as AVX) for highly efficient low-bit integer matrix multiplications. This demonstrates that with the right software, even ubiquitous consumer hardware can become a powerful platform for LLM inference. This principle of hardware-software co-design is a central theme in efficient AI deployment, from large-scale data centers dynamically managing resources [10] to hyper-efficient accelerators for edge devices [15,24]. Our results provide a clear, practical example of this principle in action.

5.3. Practical Recommendations

Our analysis of the accuracy-vs-efficiency trade-offs, particularly for CPU deployment, allows us to formulate concrete recommendations for practitioners. The choice of which model variant to deploy is not a simple one but a strategic decision based on application-specific priorities.

When correctness is paramount, the FP16 model remains the most accurate option (0.988). Among the quantized CPU variants, the 4- and 5-bit GGUF models (Q4_K_M and Q5_K_M) retain high accuracy (0.984) while also reducing memory usage and improving throughput.

When latency is the primary goal and a modest dip in accuracy is acceptable, the 4-bit GGUF (Q4_K_M) model is the best choice on our CPU testbed, achieving the highest throughput.

When RAM is the binding constraint, the 3-bit GGUF (Q3_K_M) variant offers the smallest footprint (0.93 GB), at the cost of a larger accuracy reduction (0.972).

This decision-making process, balancing multiple objectives such as accuracy, latency, and energy, resonates with calls to move beyond single metrics and adopt more holistic evaluation frameworks, such as considering energy-per-token [19,24].

5.4. Limitations of the Study

While our findings provide valuable insights, it is important to acknowledge the limitations of this study, which also point to avenues for future research. First, our model was trained and evaluated exclusively on synthetically generated data. While carefully designed with strategic noise injection, synthetic data may not fully capture the complexity and unpredictability of real-world user queries. The challenges of ensuring realism and avoiding distributional mismatch in synthetic data are well-documented [20]. In particular, this reliance on synthetic data may limit the generalizability of our findings, as even high-quality generated examples can underrepresent rare, ambiguous, culturally specific, or rapidly evolving query patterns that appear in production e-commerce settings. As a result, model performance observed on our evaluation set may overestimate robustness when the system is exposed to authentic user behavior, spelling variation, code-switching, and intent formulations not adequately captured by the synthetic data generation process. Future work should therefore include validation and, where possible, further adaptation using real-world annotated queries to better assess external validity. A real-world holdout set or external structured e-commerce intent benchmark was not included in the present revision because the available production-style queries contain sensitive user information and would require additional anonymization, annotation, and governance before use. Second, our hardware scope was limited to a single older-generation GPU (NVIDIA T4) and a single virtualized CPU environment (Hetzner CPX32: 4 vCPU, AMD EPYC Genoa, 8 GB RAM). As discussed, performance results for GPTQ are expected to be significantly different and more favorable on newer GPUs. Third, our investigation focused on a single, well-defined task. The model’s excellent performance in intent extraction does not guarantee similar success on other, more complex e-commerce tasks, such as those benchmarked in ShoppingBench, which require multi-step reasoning and tool use [6]. Finally, our use of a strict exact match accuracy metric, while appropriate for this task, is binary and does not capture nuances of partially correct answers.

6. Conclusions

This paper investigated the feasibility of using small, optimized open-weight language models as a practical alternative to large commercial systems for a specialized e-commerce task. Our findings demonstrate that this approach is not only viable but highly effective, offering a pathway to building state-of-the-art AI solutions that are both powerful and resource-efficient.

We have successfully shown that through parameter-efficient fine-tuning on a high-quality synthetic dataset, a 1-billion-parameter Llama 3.2 model can achieve 98.8% exact match accuracy in a structured intent recognition task. This result places its performance close to the significantly larger, state-of-the-art GPT-4.1 model, confirming our primary hypothesis. However, our analysis of post-training quantization revealed a critical layer of complexity: the operational benefits are profoundly dependent on the deployment environment. On an older GPU architecture, 4-bit GPTQ quantization, while drastically reducing memory, led to a counter-intuitive 82% slowdown in inference due to dequantization overhead. In stark contrast, GGUF formats on a CPU achieved up to a 4.3× improvement in inference throughput and up to a 72% reduction in RAM usage compared to the FP16 baseline, making sophisticated LLM inference feasible on resource-constrained deployments.

The primary conclusion of this work is that the future of many applied AI solutions may not lie solely with ever-larger generalist models, but in a diverse ecosystem of smaller, highly specialized, and hardware-aware models. For developers and organizations, our results provide a clear directive: the choice of an optimization strategy cannot be made in isolation from the target hardware. A model that is highly efficient in one context can be impractical in another. This paradigm enables organizations to develop customized solutions that are more cost-effective, private, and computationally efficient.

Building on these findings, several avenues for future research emerge. First, validating the fine-tuned model’s performance on a large corpus of real-world, anonymized user data would provide the definitive confirmation of its practical utility. Second, replicating the GPU performance analysis on modern architectures (e.g., NVIDIA Ampere or Hopper) is essential to quantify the potential inference speedups that GPTQ can offer when paired with native low-precision hardware support. Finally, the scope of this work could be expanded by fine-tuning the model to handle a broader range of e-commerce tasks, such as product recommendation or order status inquiries, and by exploring alternative optimization techniques, such as AWQ quantization or other PEFT methods like DoRA.

Ultimately, this research serves as a practical demonstration that with the right specialization and optimization strategies, small models can indeed stand shoulder-to-shoulder with giants, offering a more accessible and sustainable path for the widespread adoption of advanced AI.

Author Contributions

Conceptualization, J.T.L., I.O. and N.T.; methodology, J.T.L., N.T. and I.L.; software, J.T.L. and I.O.; validation, J.T.L., N.T. and I.L.; formal analysis, J.T.L. and N.T.; investigation, J.T.L., N.T. and I.O.; resources, I.L. and S.B.Š.; data curation, J.T.L. and I.O.; writing—original draft preparation, J.T.L. and N.T.; writing—review and editing, J.T.L., N.T., I.O., I.L. and S.B.Š.; visualization, J.T.L. and I.O.; supervision, I.L. and I.O.; project administration, I.L.; funding acquisition, I.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was (partially) supported by the EC Digital Europe Programme EDIH Adria 2.0 (101256325); SPIN projects IP.1.1.03.0120, IP.1.1.03.0158 and IP.1.1.03.0039; and NextGenerationEU University grants: IIP_010144, IIP_010136, UNIN-TEH-25-1-8; and Infobip Global Communication Platform (PK.1.1.07.0001), part of the Important Project of Common European Interest on Next Generation Cloud Infrastructure and Services (IPCEI-CIS) consortium.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding author upon reasonable request. The Colab notebook documenting the Llama 3.2 1B fine-tuning and GPTQ quantization workflow is available at https://colab.research.google.com/drive/1D6uC0-0EJtB-WXR5ipEnSCsv9yO2Oy-O?usp=sharing (accessed on 11 May 2026).

Acknowledgments

The authors acknowledge the use of generative artificial intelligence (GenAI) such as ChatGPT 5 tools for text writing, language editing, and drafting support. All GenAI-generated output was critically reviewed, checked, and verified by the authors. All scientific content, data analysis, interpretations, and final manuscript decisions were performed and verified by the authors.

Conflicts of Interest

Author Ivan Osman was employed by Infobip Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
QLoRA	Quantized Low-Rank Adaptation
GPTQ	GPT Quantization
GGUF	GGML Unified Format
GPU	Graphics Processing Unit
CPU	Central Processing Unit
VRAM	Video Random Access Memory
RAM	Random Access Memory

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 Technical Report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Kaiser, M.; Schulze, C. ChatGPT Referrals to E-Commerce Websites: Do LLMs Outperform Traditional Channels? SSRN 2025, 5585812. [Google Scholar] [CrossRef]
Kim, S.; Suk, J.; Yue, X.; Viswanathan, V.; Lee, S.; Wang, Y.; Gashteovski, K.; Lawrence, C.; Welleck, S.; Neubig, G. Evaluating language models as synthetic data generators. arXiv 2025, arXiv:2412.03679. [Google Scholar]
Zhou, W.; Li, T.; Vougiouklis, P.; Steedman, M.; Pan, J.Z. A usage-centric take on intent understanding in e-commerce. arXiv 2024, arXiv:2402.14901. [Google Scholar] [CrossRef]
Jiang, P.; Sonne, C.; Li, W.; You, F.; You, S. Preventing the immense increase in the life-cycle energy and carbon footprints of LLM-powered intelligent chatbots. Engineering 2024, 40, 202–210. [Google Scholar] [CrossRef]
Fernandez, J.; Na, C.; Tiwari, V.; Bisk, Y.; Luccioni, S.; Strubell, E. Energy considerations of large language model inference and efficiency optimizations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 32556–32569. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training compute-optimal large language models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Zhao, J.; Wang, T.; Abid, W.; Angus, G.; Garg, A.; Kinnison, J.; Sherstinsky, A.; Molino, P.; Addair, T.; Rishi, D. LoRA Land: 310 Fine-tuned LLMs That Rival GPT-4, A Technical Report. arXiv 2024, arXiv:2405.00732. [Google Scholar] [CrossRef]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2022, arXiv:2210.17323. [Google Scholar] [CrossRef]
Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.M.; Wang, W.C.; Xiao, G.; Dang, X.; Gan, C.; Han, S. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. GetMobile Mob. Comput. Commun. 2024, 28, 12–17. [Google Scholar] [CrossRef]
Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv 2023, arXiv:2211.10438. [Google Scholar]
Khan, T.; Motie, S.; Kocak, S.A.; Raza, S. Optimizing Large Language Models: Metrics, Energy Efficiency, and Case Study Insights. In Proceedings of the 2025 IEEE Conference on Artificial Intelligence (CAI); IEEE: New York, NY, USA, 2025; pp. 370–375. [Google Scholar]
Nadas, M.; Diosan, L.; Tomescu, A. Synthetic Data Generation Using Large Language Models: Advances in Text and Code. IEEE Access 2025, 13, 134615–134633. [Google Scholar] [CrossRef]
Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-instruct: Aligning language models with self-generated instructions. arXiv 2023, arXiv:2212.10560. [Google Scholar]
Li, X.; Lan, Y.; Yang, C. TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning. Proc. AAAI Conf. Artif. Intell. 2025, 39, 24485–24493. [Google Scholar] [CrossRef]
Ahmadian, A.; Dash, S.; Chen, H.; Venkitesh, B.; Gou, Z.S.; Blunsom, P.; Üstün, A.; Hooker, S. Intriguing Properties of Quantization at Scale. Adv. Neural Inf. Process. Syst. 2023, 36, 34278–34294. [Google Scholar]
Wilhelm, P.; Wittkopp, T.; Kao, O. Beyond test-time compute strategies: Advocating energy-per-token in LLM inference. arXiv 2025, arXiv:2603.20224. [Google Scholar]

Figure 1. Step-by-step methodological workflow for model specialization, quantization, and evaluation.

Figure 2. Overall Model Performance Comparison. The fine-tuned Llama 3.2 1B model and its high-quality quantized variants achieve accuracy scores comparable to the top commercial baselines.

Figure 3. GPU Memory Comparison. GPTQ 4-bit quantization significantly reduces total VRAM, peak VRAM, and the memory required for model parameters alone.

Figure 4. GPU performance profile on NVIDIA T4. GPTQ loads much faster but runs with lower throughput and higher energy per token.

Figure 5. CPU performance overview. GGUF variants improve throughput and reduce RAM usage compared to the FP16 baseline.

Figure 6. CPU Load Time Comparison. Quantized GGUF models load significantly faster than the full-precision version.

Figure 7. CPU latency comparison (mean and P95). The Q4_K_M variant achieves the lowest latency among the evaluated CPU deployments.

Table 1. Summary of the experimental setup and data configuration.

Component	Configuration
Task and output	Multilingual e-commerce intent extraction; each query is mapped to action, product, and quantity.
Data	6000 synthetic multilingual examples with a 90/10 train-validation split, plus a separate 1000-example held-out test set.
Models	GPT-4.1 baselines, Qwen and Gemma open-weight baselines, and Llama 3.2 1B base, fine-tuned, and quantized variants.
Fine-tuning	QLoRA SFT with LoRA rank 16, alpha 32, dropout 0.15, 3 epochs, per-device batch size 8, gradient accumulation 1, learning rate $2 \times 10^{- 4}$ , and completion-only JSON loss.
Quantization	GPTQ 4-bit for GPU-oriented inference; GGUF Q3_K_M, Q4_K_M, and Q5_K_M for CPU-oriented inference.
Hardware	NVIDIA T4 GPU in Google Colab for FP16/GPTQ profiling; Hetzner CPX32 CPU instance for GGUF profiling.
Metrics	Exact-match accuracy with 95% Wilson confidence intervals, latency, throughput, memory, load time, and energy consumption.

Table 2. Comparison of Exact Match Accuracy Across All Evaluated Models with 95% Wilson Confidence Intervals.

Model Family	Model Variant	Accuracy	95% Wilson CI
Commercial Baselines (Few-shot)
GPT (OpenAI)	GPT 4.1	1.000	[0.9962, 1.0000]
	GPT 4.1-mini	0.995	[0.9883, 0.9979]
	GPT 4.1-nano	0.987	[0.9779, 0.9924]
Open-Weight Models (Base and Fine-tuned)
Qwen (Alibaba)	Qwen 2.5 1.5B	0.810	[0.7845, 0.8331]
	Qwen 2.5 1.5B Instruct	0.857	[0.8339, 0.8773]
	Qwen 2.5 3B	0.962	[0.9483, 0.9722]
	Qwen 2.5 3B Instruct	0.964	[0.9506, 0.9739]
Gemma (Google)	Gemma 3 1B	0.773	[0.7460, 0.7979]
	Gemma 3 1B (finetune)	0.905	[0.8852, 0.9217]
	Gemma 2 2B	0.788	[0.7616, 0.8122]
Llama (Meta)	Llama 3.2 1B	0.764	[0.7367, 0.7893]
	Llama 3.2 1B (finetune)	0.988	[0.9791, 0.9931]
	Llama 3.2 1B (finetune, GPTQ 4-bit)	0.979	[0.9681, 0.9862]
	Llama 3.2 1B (finetune, GGUF 3-bit)	0.972	[0.9598, 0.9806]
	Llama 3.2 1B (finetune, GGUF 4-bit)	0.984	[0.9742, 0.9901]
	Llama 3.2 1B (finetune, GGUF 5-bit)	0.984	[0.9742, 0.9901]

Table 3. CPU Performance Comparison (single-threaded inference). Bold values indicate the best result per individual metric.

Model	Accuracy	Speed (tok/s)	Memory (GB)
FP16 (full precision)	0.988	4.5	3.30
Q3_K_M	0.972	13.0	0.93
Q4_K_M	0.984	19.5	1.04
Q5_K_M	0.984	10.3	1.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Licardo, J.T.; Tanković, N.; Osman, I.; Lorencin, I.; Baressi Šegota, S. Performance Trade-Offs of Optimizing Small Language Models for E-Commerce. Big Data Cogn. Comput. 2026, 10, 155. https://doi.org/10.3390/bdcc10050155

AMA Style

Licardo JT, Tanković N, Osman I, Lorencin I, Baressi Šegota S. Performance Trade-Offs of Optimizing Small Language Models for E-Commerce. Big Data and Cognitive Computing. 2026; 10(5):155. https://doi.org/10.3390/bdcc10050155

Chicago/Turabian Style

Licardo, Josip Tomo, Nikola Tanković, Ivan Osman, Ivan Lorencin, and Sandi Baressi Šegota. 2026. "Performance Trade-Offs of Optimizing Small Language Models for E-Commerce" Big Data and Cognitive Computing 10, no. 5: 155. https://doi.org/10.3390/bdcc10050155

APA Style

Licardo, J. T., Tanković, N., Osman, I., Lorencin, I., & Baressi Šegota, S. (2026). Performance Trade-Offs of Optimizing Small Language Models for E-Commerce. Big Data and Cognitive Computing, 10(5), 155. https://doi.org/10.3390/bdcc10050155

Article Menu

Performance Trade-Offs of Optimizing Small Language Models for E-Commerce

Abstract

1. Introduction

2. Background and Related Work

2.1. From Generalists to Specialists: The Power of Fine-Tuning

2.2. Post-Training Quantization for Efficient Inference

2.3. Synthetic Data and Structured Output Generation

3. Methodology

3.1. Task Definition and Dataset Generation

3.2. Models and Baselines

3.3. Experimental Setup

3.4. Experimental Pipeline

3.4.1. Fine-Tuning

3.4.2. Post-Training Quantization

3.5. Evaluation Framework

3.5.1. Accuracy Metric

3.5.2. Performance Metrics and Hardware

4. Results

4.1. Model Accuracy Comparison

4.2. The Impact of Quantization on Accuracy

4.3. GPU Performance Profile (FP16 vs. GPTQ)

4.4. CPU Performance Profile (GGUF)

4.5. Accuracy vs. Efficiency Trade-Offs on CPU

5. Discussion

5.1. Small Models Can Achieve State-of-the-Art Accuracy

5.2. Quantization Is Not a Free Lunch: The Hardware-Software Synergy

5.3. Practical Recommendations

5.4. Limitations of the Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI