A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines

Yusuf, Campanale Haakim bin; Ong, Lee-Yeng

doi:10.3390/app16136630

Open AccessArticle

A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines

by

Campanale Haakim bin Yusuf

and

Lee-Yeng Ong

^*

Centre for Advanced Analytics, CoE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(13), 6630; https://doi.org/10.3390/app16136630

Submission received: 24 May 2026 / Revised: 20 June 2026 / Accepted: 29 June 2026 / Published: 2 July 2026

(This article belongs to the Special Issue Generative Artificial Intelligence for Clinical Decision Support System and Healthcare)

Download

Browse Figures

Versions Notes

Abstract

The adoption of Large Language Models (LLMs) in highly regulated, domain-specific sectors is constrained by high computational costs, cloud dependency, and strict data privacy regulations. Furthermore, specific-domain knowledge is usually locked in static, unstructured document formats, preventing automated reasoning. To address these challenges, this study proposes a generalizable end-to-end pipeline for developing domain-specialized Small Language Models (SLMs) optimized for resource-constrained environments starting from unstructured data. To validate the proposed pipeline, Malaysian Clinical Practice Guidelines (CPGs) in PDF format were used as a test case. The methodology systematically digitizes these unstructured data into a NoSQL database and employs an isomorphic teacher model to generate a strictly grounded synthetic instruction-tuning dataset. Through Quantized Low-Rank Adaptation (QLoRA) and 4-bit Post-Training Quantization (PTQ), a general-purpose model is transformed into a highly compressed, domain-specialized SLM, named SpecioSLM. Systematic workstation benchmarking across four candidate architectures identified the Microsoft Phi-3-Mini (3.8B) variant as the optimal model. The model achieved a throughput of 91.59 tokens per second (TPS), a Time to First Token (TTFT) of 0.17 s, and a semantic fidelity BERTScore of 90.27. A preliminary ARM64-based simulation is further conducted targeting a specific edge device to validate architectural and memory footprint viability.

Keywords:

small language models; unstructured data extraction; knowledge distillation; QLoRA fine-tuning; privacy-preserving AI

1. Introduction

The ongoing Artificial Intelligence (AI) revolution with the advent of Large Language Models (LLMs) is rapidly transforming industries, shifting processes from manual to digital [1,2]. In high-stakes sectors such as healthcare, they have emerged as a major advancement, demonstrating strong capabilities in complex text and image understanding [3] and domain-detailed outputs generation such as explanations, documentation and summaries [4].

Despite these advancements, the practical deployment of LLMs in highly regulated and domain-specific settings remains challenging. Their deep transformer-based architectures demand substantial computational resources for both training and inference, typically requiring high-performance GPU or TPU infrastructure [5,6]. To overcome this limitation, organizations often offload computation to third-party cloud servers. However, cloud-based deployment introduces significant concerns regarding data privacy and regulatory compliance, as sensitive domain information is transmitted to external servers [7].

In recent years, Small Language Models (SLMs) have emerged as an efficient and privacy-preserving alternative for domain-specific tasks, allowing models to execute entirely offline without cloud dependency [7,8]. Nevertheless, the development of SLMs is still impractical for small organizations due to limited computational resources and data unavailability. From a computational perspective, existing approaches rely on massive GPU clusters during the whole training phase. From a data perspective, training an SLM from scratch requires extensive data collection, which is both costly and time-consuming. To bypass data collection, recent studies have developed SLM from pre-trained LLMs via knowledge distillation and fine-tuning techniques. However, this approach is often restricted by the use of highly structured and pre-curated datasets.

Motivated by the need for a methodology to develop an SLM under limited computational resources from unstructured data, this study proposes a custom end-to-end pipeline for SLM development starting from an unstructured knowledge base. The study utilizes Malaysian Clinical Practice Guidelines (CPGs) as a domain corpus. This selection is conducted because they represent authoritative, evidence-based knowledge tailored to a specific domain (the Malaysian healthcare system), while also being available in unstructured PDF format. The specific novel contributions of this work can be concluded as follows:

The design of a custom end-to-end pipeline implementable on consumer-grade GPUs that successfully transforms unstructured knowledge into SLMs.
A systematic evaluation of multiple SLMs processed through this pipeline to define the optimal trade-off architecture between execution throughput, memory footprint, and semantic fidelity.
The implementation of an ARM-based simulation to validate the practical viability and execution constraints of the optimal SLM on a targeted edge architecture.

The organization of this study is as follows. Section 2 provides a comprehensive literature review, discussing limitations of LLMs in high-stakes domains, the shift toward Small Language Models (SLMs) as a practical alternative, methodologies and challenges in SLM development, established benchmarking metrics and current research gaps. Section 3 details the proposed end-to-end pipeline for SLM development from unstructured data under resource constraints. Section 4 describes the experimental setup, detailing the hardware constraints, datasets used and specific evaluation metrics chosen to assess inference performance and semantic fidelity. Section 5 details the implementation and benchmark results, identifying the optimal architecture among SLMs developed, comparing this optimal model against larger baseline LLMs, and discussing findings from a preliminary ARM64-based simulation. Finally, Section 6 concludes the study by summarizing key findings, acknowledging existing limitations and suggesting directions for future research.

2. Literature Review

2.1. Limitations of Large Language Models

Powered by deep Transformer architectures and massive embedding sizes, LLMs have brought impressive advancements by enabling multi-step reasoning, text generation, and contextual understanding. Despite their technical achievements, these models face critical limitations that hinder their feasibility in resource-constrained and data-sensitive settings such as healthcare [5]. From a development perspective, training LLMs is extremely costly and time-consuming, typically requiring distributed clusters of high-performance GPUs/TPUs for weeks or months [6]. At deployment, computational requirements needed, such as high-performance GPUs or TPUs and distributed computing infrastructure, exponentially increase operational costs, making them affordable only for well-funded organizations or large tech companies [9]. Their environmental sustainability is also a big concern, both for energy consumption and carbon footprint, with approximately 29 kilowatts while about fifty centiliters of water are used for one OpenAI’s ChatGPT (GPT-3) query [5,6]. This demand conflicts with global goals like the Paris Agreement, potentially hindering adoption in developing regions [8,10]. Because local devices cannot host LLMs, by offloading computation to the cloud, smaller organizations can mitigate these development and deployment issues [10]. However, this increases dependency on remote servers and the transmission of sensitive data, raising privacy risks and potential non-compliance with regulations such as Malaysia’s Personal Data Protection Act (PDPA) 2010 [11].

2.2. The Shift to Small Language Models

In contrast, Small Language Models (SLMs), thanks to their optimized and lightweight Transformer variants, are increasingly adopted in domain-specific and edge-based applications such as healthcare, finance, and legal services where efficiency, privacy, and adaptability are crucial [7,8]. Their compact size, obtained through fine-tuning and compression techniques, allows for faster development and deployment on edge devices. They can be trained within days on relatively small datasets and can run on consumer-grade GPUs with 8–16 GB VRAM [7,8]. This efficiency also leads to low energy consumption, with studies reporting usage of only 0.0017–0.0041 Wh per query on NVIDIA Jetson Nano [6,8]. Moreover, SLMs can represent a solution to LLMs’ cloud dependency since they can function offline on edge devices without constant cloud access and provide real-time applications [12]. By performing inference directly on-device, enhanced data privacy is ensured since data are locally processed and not shared with third-party cloud servers [13]. This localized processing is a critical necessity to ensure strict compliance with regional privacy frameworks, such as the PDPA 2010. In addition, this local inference feature enables ultra-low inference time, making SLMs suitable for responsive and real-time applications [9].

2.3. Small Language Models Development Processes

Small Language Models (SLMs) can be developed using two distinct approaches: training the model from scratch or deriving it from an existing LLM. The first approach, developing an SLM from scratch, is a highly resource-intensive and time-consuming process that involves a multi-stage pipeline that must be preceded by massive dataset collection and careful data preprocessing. This step is crucial for performance, since SLMs are less tolerant of noisy inputs and perform best with clean, simple, and comprehensible data [14]. The pipeline then proceeds to the development phase with pre-training, where the model learns general language knowledge from large corpora using either Masked Language Modeling (MLM) for BERT-style encoders or Next Token Prediction (NTP) for GPT-style decoders. After pre-training, the model is fine-tuned on domain-specific datasets to improve performance, often using specialized loss functions and Parameter-Efficient Fine-Tuning (PEFT) methods such as Low-Rank Adaptation (LoRA). Finally, decoding strategies are applied to guide output generation by iteratively selecting the most appropriate next token during inference [15].

In contrast, the second approach, deriving an SLM from an existing LLM, is often favored in resource-constrained environments because it significantly reduces the training time and data required [15]. Instead of training a model from scratch, this approach leverages the knowledge already available in a pre-trained LLM and adapts it into a smaller and more efficient model, eliminating thereby the need for data collection and pre-processing. The pipeline for this approach follows three main stages:

Teacher and Student Selection: Two pre-existing, well-trained LLMs are selected based on their performance and compressibility. The first model, known as the teacher model, is a large and highly performing model with typically 70 billion parameters or more. Leveraging its rich linguistic understanding, broad world knowledge, and strong reasoning capabilities acquired during large-scale pre-training, this model is used as a gold-standard source of knowledge from which the student model will learn during the knowledge distillation process. The second model, known as the student model, is a smaller and more efficient general-purpose LLM, typically containing fewer than 8 billion parameters. During the distillation process, the student model learns and distills rich linguistic understanding and general reasoning abilities [16]. This selection is a critical design choice that should balance performance, compressibility and license constraints because it directly affects SLM’s performance, accuracy and size [17].
Fine-Tuning: The model is trained on a smaller, task-specific dataset generated by the teacher model to adapt the student model’s general knowledge to the specific domain desired. This phase is essential to improve the accuracy and linguistic fluency of the specialized model [18]. Fine-Tuning can be conducted using Full Fine-Tuning or Parameter-Efficient Fine-Tuning (PEFT). In Full Fine-Tuning, all model parameters are updated, resulting in stronger task adaptation. However, this entails high computational requirements. On the other hand, PEFT reduces memory and computing requirements by keeping the majority of the model’s weights frozen and training only small, added adapter modules [18]. PEFT techniques include Low-Rank Adaptation (LoRA), which trains a small set of trainable parameters through lightweight adapter matrices; Infused Adapter by Inhibiting and Amplifying Inner Activations (IA³), which introduces a small number of trainable scaling vectors to modify transformer activations and Quantized LoRA (QLoRA), which combines LoRA with 4-bit quantization, allowing fine-tuning to be performed with much lower memory usage [18].
Compression Techniques: To minimize memory footprint while preserving performance, the model undergoes one or more compression techniques among pruning, knowledge distillation, and quantization. Pruning removes less important parameters, either individually (unstructured) or by deleting entire structural components like neurons or layers (structured). Knowledge Distillation trains a smaller student model to mimic a larger teacher using soft labels, via white-box (internal access) or black-box (outputs only) methods. Quantization lowers memory and computation by converting weights to lower precision (e.g., 8-bit or 4-bit), using Post-Training Quantization (PTQ) after training or Quantization-Aware Training (QAT) during training [15]. Studies demonstrate that these lower-precision models are able to achieve significant memory and computational savings while preserving more than 98% of accuracy [19].

Figure 1 shows the general flowchart for these two distinct SLM development approaches.

2.4. Malaysian Clinical Practice Guidelines as a Domain-Specific Knowledge Source

Recognizing the transformative potential of AI in healthcare, the Ministry of Health (MoH) Malaysia has prioritized AI adoption through the Health White Paper 2023 [20] and the 12th and 13th Malaysian Economic Planning Unit [21,22], with the goal of leveraging AI technologies for equitable and sustainable health services by 2030. However, the practical deployment of LLM systems within healthcare remains constrained by strict requirements for patient data privacy and regulatory compliance. The cloud-dependent nature of LLMs exposes sensitive healthcare data to transmission to external servers, posing direct conflicts with Malaysia’s PDPA [11]. Moreover, AI systems should be able to generate recommendations that are consistent with nationally endorsed Clinical Practice Guidelines (CPGs). These recommendations are statements systematically developed to guide healthcare practitioners in making informed decisions. Although medical evidence is global, in each state, professional bodies or health authorities develop their own CPGs with the aim of providing evidence-based recommendations according to the healthcare system, regulatory policies, and cultural considerations of the State. By tailoring recommendations to specific needs and constraints of their population, state-level CPGs promote healthcare standardization across hospitals and clinics, whether public or private [23,24]. In Malaysia, CPGs are stored as static, unstructured documents in Portable Document Format (PDF). While this format effectively preserves visual fidelity of text, tables, and flowcharts for human readers, it lacks semantic metadata and internal structures required for direct machine interpretation. This “unstructured problem” prevents Language Models from directly accessing and reasoning over the guidelines’ complex logic [25].

2.5. Small Language Models: Current Situation

Although several countries have initiated the development of SLMs, their development remains constrained by challenges related to data availability and computational requirements. Training an SLM from scratch is particularly resource-intensive. For instance, the healthcare-domain BioMedLM was developed using a pre-training dataset of 300 billion tokens, followed by a training phase on a corpus of 34.6 billion tokens. The entire process required several days of computation on a high-performance cluster consisting of 128 NVIDIA A100 GPUs [26]. Similarly, the financial-domain SLM FinGPT was pre-trained on a corpus collected from web crawls, news archives, and social media sources of 38 billion tokens, iterated over 8 epochs. The total training run of 300 billion tokens is executed on the LUMI supercomputer using a high-performance cluster of 192 nodes [27]. Even utilizing the second approach of deriving a specialized SLM from an existing LLM remains computationally expensive. Meerkat-7B-v1.0 [28], a Korean healthcare-oriented SLM built on Mistral-7B, was trained on a massive and complex dataset of 441,034 samples derived from existing datasets and medical textbooks, requiring a cluster of 8 NVIDIA A100 GPUs [29]. To mitigate computational and data collection challenges, some studies have attempted to use synthetic datasets instead of large existing structured corpora. MEDFIT-LLM-3B [30], a United States healthcare-oriented SLM derived from Llama-3.2-3B-Instruct, was trained using a small synthetic dataset of 10,000 question-and-answer pairs [31]. However, these samples were generated entirely through prompting the Phi-4 model without grounding outputs in authoritative reference documents. While this approach significantly reduces data acquisition costs, it introduces risks of propagating hallucinations and general noise into the student model during training. Table 1 summarizes the development approaches, dataset sizes, data sources, and computational requirements of representative SLMs discussed in this section.

In Malaysia, SLM development is still at early stages, with no established models for domain-specific tasks. The only publicly available Malaysian model is MaLLaM, a general-purpose language model that fully supports the Malaysian language. The model, developed by Mesolitica, is released on Hugging Face in 1.1B, 3B, and 5B parameter variants [32,33,34]. Moreover, two emerging initiatives, ILMU (Intelek Luhur Malaysia Untukmu) and Sihat AI, were announced in 2025, but both remain in early stages with no publicly released technical specifications or performance benchmarks.

2.6. Benchmarking Language Models

As SLM research remains in its early stages, evaluation practices are not yet standardized. Existing studies commonly employ application-specific benchmarks that assess task performance within a particular domain [27,29], making direct comparison across different SLMs challenging. While such benchmarks are well suited for evaluating domain-specific capabilities, they provide limited insight into the general inference characteristics of language models. To assess inference efficiency and response quality, language models are commonly evaluated using domain-agnostic metrics such as Time to First Token (TTFT), Tokens Per Second (TPS) [35] and BERTScore [36]. These metrics enable model evaluation independent of the target application domain.

Time To First Token (TTFT): TTFT is a key metric for evaluating LLM inference performance, measuring the latency between the user’s query (input prompt) and the start of the model’s response (first output token) [37]. Lower TTFT values indicate better responsiveness of the model. Based on Nielsen’s classic usability guidelines [38], Human–Computer Interaction (HCI) research identifies TTFT as the primary metric for conversational capability with a recommended threshold of less than 1 s [39,40]. Indeed, a TTFT exceeding 1 s creates a noticeable delay that degrades the user experience in conversational systems.

T T F T = t_{f i r s t_t o k e n} {- t}_{p r o m p t_s u b m i t t e d}

(1)

Tokens Per Second (TPS): Alongside TTFT, TPS is used to measure the generation speed (throughput), which refers to the model generation rate, and so the average number of tokens generated per second, after the first token has appeared [41]. Psycholinguistic evidence suggests that humans typically read complex technical text with an average of ~250 words/minute speed, which is approximately 5–6 tokens/second [42]. Therefore, a minimum of 30 TPS is considered fast enough to create the perception of instant text availability in the human mind.

T P S = \frac{N_{t o k e n s}}{t_{e n d} - t_{f i r s t_t o k e n}}

(2)

Fidelity (BERTScore): This metric evaluates how faithful a generated response is not only by matching exact words but by matching the intended meaning. Similar to accuracy, fidelity is used to evaluate correctness, but it is more suitable for specific-domain settings because it also considers semantic equivalence, considering different terms that carry the same meaning (e.g., “renal failure” vs. “kidney failure”) [36]. BERTScore leverages a pre-trained transformer architecture to generate contextual token embeddings for both the generated response and the reference text. These embeddings are then compared using cosine similarity to compute Precision, Recall, and F1 score, which together quantify the semantic overlap. With a score ranging from 0 to 1, in several NLP applications, BERTScore values between 80 and 90 are commonly observed as high-quality system performance [43].

However, since inference efficiency and response quality are inherently subject to trade-offs, these metrics should be evaluated collectively to identify the optimal balance or “sweet spot” [44].

2.7. Research Gap

The current literature reveals several important gaps. The first one is the lack of data specificity and grounding approaches in SLM training. High-performance SLMs are usually developed using massive pre-structured datasets for pre-training and fine-tuning, which contain general knowledge that may not be fully aligned with specific domain requirements. While synthetic data generation has been proposed as a way to reduce data collection burden, existing methods are not grounded in domain-specific knowledge. This may lead to hallucinations, factual inconsistencies, and generalized noise within the generated tuning dataset. The second gap identified is related to computational resource requirements. Current pre-training and fine-tuning pipelines require substantial computational resources and prolonged training durations, making them heavily dependent on high-performance multi-GPU clusters or large cloud computing infrastructure. Consequently, such approaches are less accessible to researchers and organizations with limited resources.

3. Proposed Pipeline

Following the second approach, deriving a Small Language Model (SLM) from an existing LLM, this study proposes an end-to-end pipeline optimized for resource-constrained environments capable of transforming unstructured domain-specific knowledge into a fully operational SLM. As illustrated in Figure 2, the proposed architecture is divided into a Development Phase and an Inference Phase. During the development phase, a careful selection of teacher and student models is performed. In contrast to conventional approaches, the selection adopts an isomorphic teacher–student configuration, where both models are selected from the same parameter scale. The process starts by addressing the unstructured, specific-domain knowledge through a hybrid extraction pipeline that stores content in a structured NoSQL database. This database is first used to guide the teacher model in the generation process of a synthetic dataset for student model fine-tuning. This is particularly achieved by providing the teacher model with the structured knowledge base along with a grounding-prompt to enforce faithful instructions-tuning generation. The synthesized dataset is used to adapt the general knowledge of the student through QLoRA fine-tuning by targeting specific self-attention and feed-forward projection modules. The trained adapter weights are subsequently merged with the base student model, followed by 4-bit Post-Training Quantization (PTQ). For inference, the structured NoSQL database is also used to construct a vector database that is accessed by the domain-specific SLM during inference through the Retrieval-Augmented Generation (RAG) framework. During inference, users’ queries are augmented with retrieved contextual chunks and fed into the offline SLM to generate accurate and grounded responses in the domain knowledge base.

3.1. Teacher and Student Selection

Driven by the objective of ensuring the pipeline remains computationally feasible under memory-constrained environments, teacher and student model selection is performed by identifying models that provide the optimal balance between performance, memory requirements, compressibility, and licensing constraints. While closed-source LLMs like GPT-4 or Claude Opus represent the best choice in terms of instruction generation quality due to their strong reasoning capabilities, their usage for distillation techniques is constrained by restrictive licenses, excessively large parameter counts and black-box methods. Conversely, open-source LLMs provide a white-box environment and fewer license restrictions, enabling more compressibility and advanced distillation methods while still achieving high-performance [17]. The Phi-3 family, developed by Microsoft, is a transformer-based model released under the permissive MIT license and designed for efficient deployment on resource-constrained devices. Trained on synthetic “textbook-quality” data, Phi-3 demonstrates strong reasoning performance comparable to larger models while maintaining a compact size [45]. It is available in multiple variants such as Mini (3.8B), Small (7B), and Medium (14B) and supports context windows ranging from 4 K to 128 K tokens. The Qwen-2.5 family, developed by Alibaba Cloud and released under the Apache 2.0 license, includes models ranging from 0.5B to 72B parameters. Thanks to its training on large multilingual datasets, Qwen-2.5 demonstrates strong multilingual capabilities and robust performance across reasoning, coding, and long-context tasks, with context windows ranging from 32K to 1M tokens depending on model size [46]. The Llama-3 family, developed by Meta, represents state-of-the-art performance among open-weight models and consists of three variants,8B, 70B and 405B. Employing a dense Transformer architecture with grouped query attention and a 128K-token context window, Llama models are able to achieve performance comparable to proprietary models [47].

Among these open-source alternatives, Qwen-2.5 (0.5B, 1.5B, and 3B) and Microsoft Phi-3 Mini (3.8B) are selected as student models because of their memory footprint and reported performance. Qwen-2.5 is specifically chosen due to its multilingual training with strong support for Southeast Asian languages such as Bahasa Melayu and because the availability of multiple parameter sizes enables systematic evaluation across different SLM scales. Phi-3 Mini (3.8B), instead, is selected because it offers an exceptional reasoning-to-parameter ratio and because studies have demonstrated its ability to run on consumer devices such as smartphones after quantization to 4-bit precision at approximately 1.8 GB of memory [48]. Moreover, among models within the same parameter range, Qwen-2.5 (3B) is selected also as teacher model because it offers the optimal balance between reasoning quality, multilingual alignment, and computational efficiency. Table 2 summarizes the licensing, parameter counts, and original memory footprints of the four open-source models selected for this study. This design choice is motivated by the need to maintain strong model capability while ensuring feasibility under resource-constrained settings.

3.2. Data Extraction

To handle heterogeneous data formats and convert documents from a local directory into a machine-readable schema, the data extraction pipeline employs a hybrid extraction approach that integrates multiple extraction tools within a unified workflow. Text and tables are processed using a PDF parsing engine, while images are handled by an image extraction module integrated with Optical Character Recognition (OCR) for scanned content. The extracted elements, along with associated metadata such as file names and page numbers, are then stored in a structured NoSQL database. This storage is used to obtain both the synthetic tuning dataset for QLoRA training and the external knowledge for RAG inference.

For this study, the 101 PDF Clinical Practice Guidelines (CPGs) endorsed by the Ministry of Health [20] are converted into a machine-readable format using pdfplumber for extracting text and tables, and PyMuPDF combined with pytesseract OCR for processing images and scanned content. For data storage, MongoDB was selected due to its flexibility in handling heterogeneous and semi-structured documents such as texts, images and tables, enabling efficient storage and retrieval of diverse content formats required for downstream indexing and inference. Figure 3 illustrates the overall workflow of the proposed data extraction pipeline, while the following section details its implementation. After initializing configurations, the system connects to MongoDB and iterates through each PDF in the target folder. Each file is opened twice: once with pdfplumber to target text and tables, and once with PyMuPDF to handle images. For each page, the pipeline initially attempts to extract text and tables using pdfplumber, followed by a conditional check to determine if the text extraction was successful. If the text output is empty, suggesting a scanned page or image, the pipeline switches to using PyMuPDF and applies Optical Character Recognition (OCR) via pytesseract to extract it. All extracted contents are stored with metadata (PDF name, page number, etc.) as a JSON document into a MongoDB collection. This loop continues page-by-page and file-by-file until the entire corpus is processed.

3.3. Synthetic Tuning Dataset Generation

The NoSQL database is used to generate a specialized training dataset for QLoRA fine-tuning through a teacher–student knowledge distillation pipeline following contemporary state-of-the-art practices. However, unlike conventional distillation pipelines that typically rely on frontier large language models as a gold standard, distributed computing infrastructure, or cloud-based inference services, this study introduces a resource-constrained adaptation specifically designed for execution on a single consumer-grade GPU. The first design choice is the adoption of an isomorphic teacher–student distillation strategy instead of a classical large-to-small approach. In practice, rather than using a substantially larger teacher model (e.g., 70B parameters) to transfer knowledge to a smaller student, this study adopts teacher and student models of equivalent sizes. This choice reduces computational and memory requirements, enabling execution without cloud dependency or specialized computing infrastructure. The second design choice is the application of authoritative grounding during synthetic data generation. The teacher model is strictly forced by a grounding prompt to synthesize instruction-tuning pairs according to the provided domain-specific document. This reduces the risks of general-knowledge noise and hallucinations inside the final synthetic dataset. The third design choice is related to the output format during synthetic data generation. To reduce computational overhead and improve generation speed, constrained-decoding libraries that enforce JSON-compliant output during generation by modifying token probabilities are replaced by a post-hoc native RegEx parsing approach, where the teacher model generates responses through standard batched inference and subsequently Python regular expressions extract required fields. Lastly, unlike conventional approaches, which rely on multi-GPU clusters or distributed computing infrastructure, the proposed implementation leverages GPU-accelerated parallel processing, batch calibration, and optimized tensor precision to optimize computation efficiency and enable execution on a single GPU.

Building upon these design choices, the following section details the optimizations used in this study. To achieve authoritative grounding, the pipeline directly retrieves documents from the MongoDB database and injects them into the teacher model’s prompt context, forcing the model to analyze and generate instruction input–output pairs strictly anchored to it. GPU parallelization is achieved by offloading computational workloads from the CPU to the GPU through CUDA kernels. These kernels leverage the Single Instruction, Multiple Threads (SIMT) execution model to execute the same instruction sequence concurrently across multiple data elements. In practice, upon system configuration, GPU execution is enabled by setting device_map = “cuda” combined with BATCH_SIZE = 8. The first setting automatically loads model parameters and inference operations onto the CUDA kernels, while the batch size restriction forces parallel processing of eight chunks at a time. The batch size is constrained to eight based on the available GPU VRAM capacity of 10 GB. Memory and hardware utilization are further optimized by using torch_dtype = torch.float16 to convert weights from 32-bit to 16-bit precision. This halves the VRAM footprint and enables the use of GPU Tensor Cores for faster computation without compromising output quality. To maintain system stability during large-scale data processing, resilient database pagination is implemented using DOC_BATCH_SIZE = 50, which prevents memory overflows and MongoDB cursor timeouts by retrieving and processing documents in controlled batches. The script incorporates a continuous checkpointing mechanism by opening the output file in append mode (‘a’), tracking the start_offset of existing rows, and executing frequent f.flush() calls to save progress in real-time.

The pipeline starts by accessing and extracting from MongoDB JSON object documents of Malaysian CPGs. Each document is segmented into 1000-character chunks with a 100-character overlap to preserve contextual continuity across sections. These chunks are subsequently processed by a Qwen2.5-3B-Instruct teacher model, which transforms each segment into structured instruction–input–output pairs through prompt-based generation. The generated outputs are validated and incrementally stored in a JSONL file, forming the final synthetic tuning dataset. For the Malaysian CPGs domain, the resulting dataset comprises 15,903 samples. Figure 4 shows the flowchart of the optimized teacher-student knowledge distillation pipeline.

3.4. QLoRA Fine-Tuning

The tuning dataset is used to adapt a general-purpose, efficient LLM, referred to as the student or base model, to a domain-specific task through Domain-Specific Supervised Fine-Tuning (SFT) implemented using Parameter-Efficient Fine-Tuning (PEFT). Instead of conventionally retraining billions of parameters, PEFT freezes the base model and trains only a tiny fraction of newly added parameters. In this way, gradients are computed only for a small portion of parameters called LoRA adapters, reducing the number of trainable parameters involved in backpropagation [49]. In this study, the QLoRA approach is preferred over the LoRA technique because it quantizes the base model in 4-bit precision format before applying LoRA training, reducing memory requirements while maintaining the same training effectiveness.

The pipeline is divided into three main phases: pre-training, training and merging. In the first phase, the generated tuning dataset containing instruction input–output pairs is loaded, cleaned and split into training and validation subsets. Then, the student model and tokenizer are loaded, followed by instruction formatting. Subsequently, memory optimization techniques such as gradient checkpointing are applied to reduce VRAM usage. QLoRA hyperparameters such as LORA_R, LORA_ALPHA and LORA_DROPOUT are configured as per Table 3 to control the adaptation capacity and training stability. After 4-bit quantization and freezing of model weights, hyperparameters are injected into the model architecture. Different from traditional approaches, trainable adapter matrices are configured to target specific projection modules. Specifically, q_proj, k_proj, v_proj, and o_proj are targeted to adapt the self-attention mechanism by refining how tokens relate to each other. Targeting these layers, the student model is able to assign more importance to specific-domain terms, therefore, improving its ability to understand questions. The gate_proj, up_proj, and down_proj layers are used to improve the feed-forward network, which is responsible for transforming and refining internal representations. This helps the student model in semantic understanding and content retrieval processes. The pipeline then enters the Training Phase, featuring the SFTTrainer to execute the training loop across epochs and batches, performing forward passes in bf16 precision and optimizing only the adapter weights against the instruction dataset. To optimize memory usage and prevent out-of-memory errors, the training loop utilizes a paged optimizer (paged_adamw_32bit), which dynamically transfers optimizer states to system RAM during VRAM spikes. The training output consists of a compact set of learned adaptation parameters that encode domain-specific reasoning and stylistic characteristics learned during the QLoRA training. These lightweight adapters are then merged during the merging phase with the student model into a unified domain-specialized LLM in 16-bit. Importantly, the student model is loaded in high-precision 16-bit (float16) format to ensure mathematical accuracy during the combination process. Figure 5 shows the flowchart of this custom pipeline.

3.5. Compression

The domain-specialized LLM in FP16 format is converted into 4-bit format through the compression phase to finally obtain the domain-specialized SLM, referred to as SpecioSLM. The compression phase is necessary because high-precision formats such as FP16 or FP32 often suffer from memory bandwidth bottlenecks during inference, where the system struggles to transfer massive model weights to the processor’s compute units quickly enough [19]. Compression allows the model to be deployed efficiently on resource-constrained hardware by reducing memory requirements and computational overhead while preserving most of its predictive capability.

To achieve this, the system sets up a runtime environment with llama.cpp binaries and Python dependencies, converts the model into the GGUF format and quantizes to 4-bit. The specific design choice of using 4-bit Post-Training Quantization (PTQ) through llama.cpp is primarily motivated by the strict unified memory constraints of edge devices, which typically provide only 8 GB of unified memory. First, compared with pruning and knowledge distillation, quantization provides a more substantial reduction in memory usage while requiring lower implementation complexity. Furthermore, studies have shown that quantized models can retain semantic understanding and reasoning performance that is nearly comparable to their 16-bit counterparts, with only minimal accuracy degradation [49]. Second, PTQ is preferred over Quantization-Aware Training (QAT) because it can be applied after model training is completed. In contrast, QAT requires additional fine-tuning and prolonged training cycles, increasing both computational cost and implementation complexity [50]. Third, 4-bit quantization represents the optimal “sweet spot” for practical deployment, providing a good balance between memory efficiency and model quality [19]. Recent studies have demonstrated that this precision level can reduce the memory footprint by approximately 76%, allowing both the model parameters and the operational context window to fit comfortably within the limited memory resources available on edge devices [18]. Moreover, by transferring more weight data to the compute units in the same timeframe, the memory bandwidth bottleneck is alleviated. This results in faster inference speed compared to 8-bit or 16-bit formats, while still retaining over 98% of the baseline model’s capability [19]. Finally, by leveraging the llama.cpp framework, inference efficiency is further enhanced through its GGUF file format. This format is optimized for fast model loading and direct memory access with minimal overhead through memory mapping. Practically, rather than fully dequantizing entire weight tensors into higher-precision formats before computation, llama.cpp employs highly optimized C++ kernels that perform dequantization on-the-fly during execution [19]. This dequantize-and-compute approach ensures that only a small block of weights needs to be temporarily stored in high-precision format in the processor’s registers. In this way, more data can be transferred in the same amount of time, directly increasing the token generation rate.

3.6. RAG Framework Setup and Runtime Inference

For testing and benchmarking purposes, a question-answering system leveraging SpecioSLM with RAG is developed. Therefore, the NoSQL database is also converted into a searchable vector index that is used exclusively for RAG during the inference stage. This vector index serves as an external knowledge source to retrieve relevant domain-specific content at query time and ground generated responses with related references to improve accuracy and provide traceability. Following a standard preprocessing pipeline, each database record is transformed into a unified document representation and partitioned into smaller segments using a text chunking algorithm. This study employs a fixed-size chunking algorithm, which divides the text into chunks of 1000 characters with a 100-character overlap. This specific fixed size is selected because research demonstrates that a large chunk size significantly improves retrieval performance for long documents that require broader contextual understanding, ensuring semantic continuity and preventing splitting at awkward sentence boundaries [51]. Next, each chunk is processed and transformed into a high-dimensional vector representation through an embedding model. The sentence-transformers/all-MiniLM-L6-v2 embedding model is widely adopted because of its strong balance between semantic accuracy and computational efficiency [52]. However, this study employs the NeuML/pubmedbert-base-embeddings model to align with the domain of the selected test case. Trained on biomedical literature, this embedding model is better suited to capture the semantic meaning of medical terminology.

Finally, the generated dense vectors are indexed using a similarity search library and stored locally, creating an external knowledge source ready for fast retrieval. During the interactive inference loop, the system receives user queries as input, converts them into vector embeddings, and performs a similarity search over the local vector database generated to retrieve the top-k most relevant text chunks. These chunks are concatenated and provided to the SLM along with the user’s query and a prompt to enforce strict adherence to the retrieved context. SpecioSLM then processes this augmented input to generate a context-grounded response along with corresponding source metadata from the retrieved documents to support traceability. The complete architectural workflow, encompassing both the offline indexing pipeline and the runtime retrieval loop, is illustrated in Figure 6.

3.7. Edge Validation

Moving on to edge validation, research was first conducted to identify the most suitable edge device for SpecioSLM deployment. The NVIDIA Jetson Orin NX 8 GB was selected as the target edge device due to its high computational throughput and memory bandwidth, which enable efficient execution of quantized SLMs at the edge, as proven in the study of Lu et al. [13]. Specifically, the device delivers up to 157 TOPS, supporting CUDA-accelerated parallel computation and concurrent execution of auxiliary pipeline components such as RAG thanks to the NVIDIA JetPack SDK and the Ampere GPU with 32 Tensor Cores. Compared to CPU-centric devices such as the Raspberry Pi 5 and older Volta-based platforms like Jetson Xavier NX, the Orin NX offers AI computing performance suitable for on-device language model inference [48]. Considering target device specifications, a preliminary ARM-based simulation was conducted to verify hardware and software compatibility. Specifically, a Docker environment integrated with QEMU is configured to mimic the target device specs, as shown in Table 4.

4. Experiment Setup

All experiments are conducted across two local machines, a primary desktop workstation and a secondary laptop, to ensure reproducibility and to evaluate the pipeline’s feasibility under different hardware environments. The primary desktop workstation is equipped with an AMD Ryzen 9 3900X (12-core) CPU, an NVIDIA RTX 3080 GPU with 10 GB VRAM, 32 GB DDR4 RAM, and 512 GB SSD storage. It has been used throughout the entire SLM development phase, where it was configured to leverage NVIDIA CUDA technologies to enable highly efficient parallel programming and GPU acceleration, especially in resource-intensive tasks, such as synthetic fine-tuning dataset generation and QLoRA-based fine-tuning. The secondary machine, an Acer Predator Helios Neo 16, features a 13th Gen Intel^® Core™ i7-13700HX (2.10 GHz) CPU, an NVIDIA GeForce RTX 4050 Laptop GPU (6 GB GDDR6 VRAM), 32 GB DDR5 RAM, and 1 TB SSD storage. The laptop was utilized alongside the workstation to benchmark the fully trained SLMs, rigorously assessing inference performance and resource utilization across different consumer-grade hardware platforms. Development and benchmarking are performed on Windows 11. Finally, for the ARM-based simulation, a containerized Ubuntu 22.04 LTS environment is implemented using Docker with QEMU. Table 5 summarizes the detailed hardware specifications, software configurations, and designated roles for each of these experimental environments.

4.1. Datasets

For this study, three different datasets were constructed starting from domain-specific knowledge. The knowledge used consists of clinical practice guideline documents published by the Malaysian Ministry of Health through the Malaysian Health Technology Assessment Section (MAHTAS) [53]. Each dataset is created using a different methodology and serves a distinct purpose:

Synthetic Fine-Tuning Dataset: A collection of 15,903 instruction input–output samples in JSON Alpaca format, generated by the 3B teacher model, with one sample produced for each document object retrieved from the NoSQL database. This dataset is used to transfer domain-specific knowledge and linguistic style to the student models through QLoRA fine-tuning. During training, the dataset is split into 90% training and 10% validation subsets.
Test Dataset: A collection of 141 question–answer pairs manually created by human annotators. The questions are generated by randomly selecting three sections from each CPG PDF document, reviewing their contents, and formulating questions based on the identified context. This dataset is used to evaluate the performance of SpecioSLM.
Vector Database: A collection of dense vectors generated from CPG documents using an embedding model. Vectors are indexed and stored in local storage to enable similarity search retrieval during the inference stage of SpecioSLM.

4.2. Evaluation Metrics

Based on the benchmarking metrics discussed in Section 2.6, this study employs TPS, TTFT, and BERTScore to evaluate the four SpecioSLMs developed. Specifically, Tokens Per Second (TPS) is used to measure the throughput and generation speed of the model, providing the number of tokens the model can generate per second. A performance threshold of TPS > 30 is defined to ensure near-instantaneous generation. Time to First Token (TTFT) is used to assess model responsiveness by measuring the latency required for the model to produce its first output token, with a target threshold of TTFT < 1 s to support responsive conversational interaction. BERTScore is used to evaluate response quality by measuring semantic similarity between the generated output and the reference text, rather than simple word matching. A BERTScore greater than 90 (BERTScore > 90) is set to indicate high semantic fidelity between generated and reference outputs. However, it is crucial to acknowledge that BERTScore provides a highly effective, scalable, and automated metric for measuring semantic fidelity without considering the true correctness of the generated text. Its score is related to the ground truth exclusively and if the test data are wrong, then BERTScore fails. Therefore, it does not guarantee absolute clinical correctness or patient safety.

4.3. Benchmarking Strategy

Quantitative benchmarks were conducted on all four SpecioSLM variants under the same evaluation criteria and environmental conditions to ensure consistency and reproducibility. Specifically, the same test dataset, inference settings, and hyperparameters were applied across all evaluations. Each query was executed 50 times, and the reported results represent mean values across runs. This benchmarking protocol follows recommended practices for reliable latency and throughput measurement, ensuring the system remains responsive even under repeated usage [54].

Benchmarks were executed across two distinct hardware setups: the primary desktop workstation and the secondary laptop. Performance metrics were collected on both systems, and the final results were computed as the mean across the two platforms. This extra evaluation was conducted to reduce hardware-specific bias and ensure model performance remains stable across different hardware configurations.

5. Results and Discussion

The four student models, Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen2.5-3B, and Phi-3-Mini-3.8B, were trained using the same end-to-end pipeline illustrated in Section 3 to ensure a fair and consistent comparison between the resulting SpecioSLM models. To maintain consistency, the same teacher model, Qwen-2.5 (3B), was employed for synthetic tuning dataset generation. Training configurations, hyperparameters, and optimization strategy were also kept identical. All training procedures were conducted on the same desktop workstation to eliminate any hardware-induced variation, and an identical setup procedure was applied to each model. For each experiment, a separate development environment was standardized using a Python 3.10 virtual environment with the same libraries fixed to specific versions. Then, the entire pipeline is executed consistently for each student model with identical hyperparameter settings and system configurations.

5.1. Experiments

The first experiment was conducted using the ultra-lightweight Qwen-2.5 (0.5B) model as the student. It is used as a starting model since theoretically the smallest model should achieve the maximum throughput achievable, establishing an upper bound to tokens per second (TPS) and Time to First Token (TTFT) under extreme parameter compression. With a compact size of 373.71 MB, the resulting SLM achieved a generation speed of 193.81 TPS and an extraordinarily low TTFT of 0.1265 s, confirming the hypothesis that highly compact parameter models prioritize rapid generation and latency reduction. However, this efficiency comes at the expense of semantic fidelity performance, which is the lowest among the four SLMs at 82.84, far away from the target. This illustrates the model limitations on reasoning capabilities and world knowledge, which start to be degraded, increasing the risk of hallucinations not tolerable in clinical settings.

In the second experiment, the 1.5B variant of Qwen-2.5 was evaluated to study the relationship between model size, inference throughput and semantic fidelity. The resulting SLM, with a model size of 934.69 MB, demonstrated measurable improvements in semantic performance, achieving a BERTScore of 83.74. This improvement was accompanied by a moderate reduction in processing speed, with throughput decreasing to 152.46 tokens per second (TPS) and Time-to-First-Token (TTFT) increasing slightly to 0.1476 s, which still remain well within the targets of 30 TPS and 1 s TTFT. These findings suggest that increasing parameter capacity enhances semantic accuracy, highlighting the need to identify an optimal balance between computational performance and clinical accuracy.

For the third experiment, Qwen-2.5 (3B), the teacher model is used as the student model to investigate whether using the same architecture and scale would influence overall performance. However, consistent with the observed scaling pattern, inference performance showed a gradual reduction in processing speed, with throughput decreasing to 113.12 TPS and TTFT increasing to 0.1841 s, while model size increased to 1834.82 MB. In contrast, semantic fidelity continued its upward trend, with the BERTScore improving to 84.12. As a result, throughput metrics remain well within the predefined operational thresholds while the fidelity score still falls short of the targeted 90-point benchmark, making this model the practical upper bound for the Qwen family within this study.

The final experiment evaluated Microsoft Phi-3-Mini (3.8B). The trend is maintained, but this time, a small reduction in throughput metrics of TPS and TTFT, respectively, 91.59 and 0.1705, brings a substantial increment of 6 points in BERTScore, surpassing the fidelity threshold of 90 for clinical settings. Model size increased modestly to 2281.66 MB, showing the highest size reduction percentage of 68.7% compared to the original model size. These results validate the exceptional reasoning-to-parameter ratio reported by Abdin et al. [32] for Phi-3-Mini (3.8B), where the model, even after compression to a size of almost 2 GB, is able to retain strong reasoning capabilities. This performance may be attributed to its dense Transformer-decoder architecture and training on high-quality, “textbook-style” synthetic data. Despite the absence of explicit training on Southeast Asian languages, the model demonstrates the ability to effectively interpret Malaysian Clinical Practice Guidelines (CPGs), which are predominantly written in English with limited use of Bahasa Melayu. Following the benchmarking strategy outlined in Section 5.1, Table 6 reports the results of the four SpecioSLM variants developed across experiments. It provides a summary of each variant in terms of performance metrics measured in TPS and TTFT, model size in MB, and semantic fidelity using the BERTScore.

5.2. Discussion

5.2.1. Inference Performance: Throughput and Latency

By considering the two inference performance metrics, Tokens Per Second (TPS) and Time to First Token (TTFT), all tested models performed exceptionally well, surpassing established operational targets of TPS > 30 and TTFT < 1 s. As expected, the smallest model, Qwen 2.5 (0.5B), proved to be the most computationally efficient. With a highly optimized specialized SLM size of only 373.71 MB, it achieved the highest throughput at 193.81 TPS and the lowest latency with a TTFT of 0.1265 s. Then, as the parameter count scales up, the inverse trend in performance appears, as shown in Figure 7, with the Qwen 2.5 variants 1.5B and 3B dropping to 113.12 TPS and a TTFT increasing to 0.1841 s. Phi-3-Mini (3.8B), as the largest model among all with a size of 2281.66 MB, achieves the worst inference performance metrics with a decline to 91.59 TPS and a TTFT increase to 0.1705 s. Overall, performance indicators of all SpecioSLM variants surpass the target thresholds of 30 TPS, marked as a dashed red line, revealing a consistent trend. A clear inverse relationship between model size and inference performance is observed, where an increment in model size corresponds to reduced throughput and higher latency. This suggests that lightweight models are better suited for applications in which speed and low latency are the primary requirements and a minor trade-off in accuracy is acceptable.

5.2.2. Semantic Fidelity and Reasoning Quality

While high throughput is critical for responsiveness, itself alone, it is insufficient to make a model good for deployment. In practical applications, high semantic fidelity is equally crucial to ensure accurate and reliable contextual understanding. In this study, BERTScore is used as the primary benchmark for semantic fidelity, with a target threshold set above 90. The smallest model, Qwen-2.5 (0.5B), scores the lowest BERTScore (82.84), suggesting that ultra-small models simply lack the parameter capacity required to capture complex semantic nuances. Scaling up the Qwen architecture implies modest improvement with the largest model in the family, the Qwen 2.5 (3B) model, reaching a BERTScore of 84.12 with a specialized size of 1834.82 MB. This pattern suggests that larger models offer richer semantic understanding, being more suitable for tasks requiring deeper contextual understanding. Interestingly, the replacement of the architecture with Phi-3-Mini (3.8B) produces a big improvement in semantic fidelity, achieving a BERTScore of 90.27, despite a relatively small increase in model size to approximately 2 GB (2281.66 MB). This finding suggests that increasing computational cost provides limited gain in semantic reasoning performance, whereas the choice of the student model architecture has a significant impact on semantic performance. Figure 8 visually summarizes these findings, highlighting Phi-3-Mini as the only evaluated model to surpass the clinical target threshold of 90 BERTScore, marked as a dashed red line.

5.2.3. The “Sweet Spot” Model: Phi-3-Mini (3.8B)

The SpecioSLM variant built on Phi-3-Mini (3.8B) emerges as the undisputed “sweet spot”, offering the most balanced and effective performance profile among all evaluated models. Although it has the largest specialized footprint at 2281.66 MB, its size remains within acceptable limits for deployment on the target edge device. In terms of inference efficiency, the model maintains competitive performance, achieving 91.59 TPS and 0.1705 s TTFT. Both results are comfortably above the defined operational thresholds, ensuring a generation speed high enough to give users the perception of near-instant text delivery. More importantly, SpecioSLM-Phi-3-Mini (3.8B) is the only variant to surpass the semantic fidelity target, reaching a BERTScore of 90.27, substantially outperforming by 6 points the similarly sized Qwen 3B (84.12), as shown in Figure 9. This suggests that generated responses closely align with the knowledge base in terms of meaning. Overall, these results indicate that Phi-3-Mini (3.8B) represents the optimal architecture for the proposed SpecioSLM pipeline. This configuration has been shown to deliver strong semantic understanding while maintaining acceptable responsiveness and efficiency.

5.3. Retrieval Strategy Impact

An additional component-wise analysis was conducted to evaluate the impact of different RAG embedding strategies on response quality. As shown in Table 7, replacing the current domain-specific model with the general-purpose sentence-transformers/all-MiniLM-L6-v2 results in a decrease in the model’s fidelity from 90.27 to 89.44 BERTScore, indicating that the biomedical embedding is more effective in medical settings. Moreover, a Hybrid Search, which combines the NeuML/pubmedbert-base-embeddings model with BM25 re-ranking, is tested to further increase response quality. Although this approach achieved the highest BERTScore (90.35), the improvement over the standalone PubMedBERT configuration was marginal (+0.08) while reducing generation speed from 91.59 to 83.40 TPS and increasing latency from 0.1705 s to 0.1737 s TTFT. This performance degradation is likely caused by the additional computational burden of loading and executing a secondary cross-encoder model for re-ranking. Overall, these results demonstrate that the standalone PubMedBERT embedding provides the most favorable balance between performance efficiency and response quality for the test case of this study.

5.4. Baseline Comparisons

The optimal model, SpecioSLM-Phi-3-Mini (3.8B), is compared against two baseline models from different architectural families, Llama-3.1 (8B) and Qwen 2.5 (7B), to evaluate the effectiveness of the proposed pipeline against larger general-purpose models [47,48]. Results are presented in Table 8. Following the inverse relationship trend identified, as model size increases, throughput performance decreases. For this reason, although similar, SpecioSLM achieves slightly better performance compared with the baseline while requiring only approximately half its memory footprint. However, the most unexpected finding emerges in terms of semantic accuracy. SpecioSLM-Phi-3-Mini (3.8B) outperforms both baseline models in terms of BERTScore despite having fewer parameters. This suggests that the domain-specific adaptation and targeted fine-tuning within the proposed pipeline can enable specialized SLMs to surpass larger general-purpose models while maintaining lower computational and memory requirements. Specifically, adjusting self-attention and feed-forward projections on the grounded samples during QLoRA fine-tuning can effectively internalize domain knowledge, improving semantic alignment with task-specific queries. Conversely, general-purpose language models that rely solely on broad pre-training data may lack domain specialization, generating less contextually aligned outputs in specialized tasks.

5.5. Edge Simulation Findings

The preliminary ARM64-based simulation was conducted on the “Sweet Spot” model, SpecioSLM-Phi-3-Mini (3.8B), to verify hardware and software compatibility. As described in Section 3.7, the Docker environment integrated with QEMU was configured to mimic the targeted edge device, NVIDIA Jetson Orin NX 8GB. The successful execution of the model confirmed its operational compatibility with ARM64-based architectures and hardware specifications. In addition, benchmark evaluations were conducted to assess both performance and semantic fidelity. Results are presented in Table 9. The model achieved a BERTScore of 93.03, indicating strong semantic alignment with the reference texts and demonstrating that model generation quality remains reliable. However, performance metrics are significantly influenced by the limitations of the simulated environment. During model benchmarking within the CPU-bound QEMU environment, the system recorded a heavily constrained Time to First Token (TTFT) of 226.81 s and a generation throughput of 0.70 tokens per second (TPS). These degraded results are a direct consequence of both the ARM64 translation bottleneck and the reliance on sequential host-CPU processing. Existing studies indicate that this type of host–guest instruction translation in QEMU can reduce performance by up to 35× [55]. By applying this 35× translation penalty factor, the projected baseline performance for native ARM CPU execution is estimated at approximately 6.48 s for TTFT and 24.5 tokens per second. Moreover, this projection does not account for the presence of the NVIDIA Jetson Orin NX’s dedicated Ampere GPU and Tensor Core architecture, which enables highly parallelized computation. Such hardware capabilities cannot be replicated within a standard CPU-bound emulation environment. According to NVIDIA’s validated benchmarks [56], the Phi-3.5 (3.8B) model achieves a throughput of 35.90 to 40.90 TPS when executed with 4-bit (INT4) quantization. In conclusion, although this preliminary simulation effectively validates model compatibility with ARM64 architecture, hardware specifications and semantic fidelity, this strategy is not suitable to assess model performance. Evaluation of these metrics requires real deployment on the physical Orin NX device.

6. Conclusions

This study presents a resource-efficient end-to-end pipeline for SLM development starting from an unstructured knowledge base. The proposed approach aims to address key limitations in existing SLM development methods, which rely on large pre-structured datasets and substantial computational resources. Using Malaysian CPGs as a test case, the proposed pipeline demonstrates a practical methodology for SLM development in scenarios constrained by unstructured data formats and limited computational capacity. It effectively bridges the “unstructured knowledge gap” by systematically converting 101 static PDF documents into a structured NoSQL database. Then the pipeline leverages an isomorphic teacher model to distill an instruction-tuning dataset strictly grounded in Malaysian CPGs to reduce computational resources and risks of hallucinations. Finally, QLoRA fine-tuning is applied by targeting specific self-attention and feed-forward projection layers, followed by 4-bit PTQ through llama.cpp to derive the SpecioSLM model. In addition, a vector database is constructed from the CPGs corpus to support a RAG framework during inference, enforcing grounded responses. The entire pipeline is executed on a single consumer-grade GPU, demonstrating its efficiency and accessibility under limited computational resources. Moreover, systematic benchmarking across four SpecioSLMs and two baseline LLMs and evaluations on different embedding strategies are conducted to identify the optimal architecture for the proposed pipeline. The Microsoft Phi-3-Mini (3.8B) variant is identified as the optimal “sweet spot”, achieving a balanced trade-off between computational efficiency and semantic fidelity. It records a throughput of 91.59 Tokens Per Second (TPS), a Time to First Token (TTFT) of 0.1705 s, and a BERTScore of 90.27, while maintaining a compressed model size of approximately 2.28 GB. Finally, an ARM64-based simulation targeting the NVIDIA Jetson Orin NX is conducted using Docker and QEMU to validate architectural feasibility. However, CPU-bound sequential processing and host–guest instruction translation make for an unreliable performance benchmark.

The primary limitation of this study is the current unavailability of the physical NVIDIA Jetson Orin NX hardware, which restricts edge deployment validation to a simulated environment. While the Docker with QEMU emulation strategy successfully validated software deployability, it is not able to capture true GPU-accelerated throughput and latency metrics. Therefore, the implementation of this pipeline on physical edge hardware to conduct native benchmarking is a critical objective for future work. Another current limitation of this study is that the evaluation is restricted to a custom dataset derived from 101 Malaysian CPGs. Because of this, the extent to which the proposed pipeline can generalize across different unstructured domain-specific knowledge remains unstudied. Future work should test the model on a wider range of unstructured domain-specific knowledge to better evaluate the pipeline’s robustness. A further limitation is related to the evaluation metrics employed. Specifically, the model’s fidelity relies primarily on BERTScore, which effectively measures semantic similarity, but it is not able to guarantee absolute clinical correctness or safety, such as omissions of medical caveats. Therefore, to definitively quantify the clinical safety of the generated responses, future validation phases should incorporate medical experts’ evaluations. Furthermore, subsequent research phases may include systematic component-wise ablation studies to isolate and quantify the impact of key design choices, including chunking strategies, embedding models, and quantization levels, on both inference performance and response quality. Specifically, the current fixed-size chunking approach may be replaced with advanced chunking methods, such as adaptive chunking. By dynamically adjusting chunk lengths to preserve semantic and topic boundaries, it has been proven that this last method significantly enhances retrieval precision and context preservation [57].

Author Contributions

Funding acquisition, L.-Y.O.; Investigation, C.H.b.Y.; Project administration, L.-Y.O.; Supervision, L.-Y.O.; Visualization, C.H.b.Y.; Writing—original draft, C.H.b.Y.; Writing—review and editing, L.-Y.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data Availability Statement: The synthetic dataset and FAISS index folder generated during this study are openly available on GitHub (version v1.0) at https://github.com/Haakim22/Malaysian-CPG-SLM_Synthetic-Dataset-and-FAISS-Folder (accessed on 26 June 2026). These data were derived from the following resources available in the public domain: Clinical Practice Guidelines (CPG) from the Malaysian Health Technology Assessment Section (MAHTAS), Ministry of Health Malaysia, available at https://mymahtas.moh.gov.my/index.php/docman-list/publications/cpg-list (accessed on 28 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sharif, A.; Gurbuz, E.; Ay, S. The impact of AI on employment and jobs: A comprehensive analysis. Lond. J. Interdiscip. Sci. 2023, 1, 50–55. [Google Scholar] [CrossRef]
Gao, R.X.; Krüger, J.; Merklein, M.; Möhring, H.-C.; Váncza, J. Artificial Intelligence in manufacturing: State of the art, perspectives, and future directions. CIRP Ann. 2024, 73, 723–749. [Google Scholar] [CrossRef]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.; Lee, J.; Chung, H.W.; Scales, N.; Tanwani, A.K.; Cole-Lewis, H.; Pfohl, S.; et al. Publisher Correction: Large language models encode clinical knowledge. Nature 2023, 620, E19. [Google Scholar] [CrossRef] [PubMed]
Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.-B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Pontes Reis, E.; Seehofnerova, A.; et al. Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Res. Sq. 2023, preprint. [Google Scholar] [CrossRef] [PubMed]
Örpek, Z.; Tural, B.; Destan, Z. The language model revolution: LLM and SLM analysis. In Proceedings of the 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP), Online, 21–22 September 2024; pp. 1–4. [Google Scholar] [CrossRef]
Jang, S.; Morabito, R. Edge-first language model inference: Models, metrics, and tradeoffs. arXiv 2025, arXiv:2505.16508. [Google Scholar] [CrossRef]
Shams, D.; Salama, I.; Callixtus, I. Exploring the landscape of large and small language models: Advancements, trade-offs, and future directions. Preprints 2025, preprint. [Google Scholar] [CrossRef]
Ammanath, B. Small language models (SLMs). IEEE Softw. 2025, 42, 112–115. Available online: https://ieeexplore.ieee.org/document/11024079 (accessed on 7 January 2026).
Garg, M.; Raza, S.; Rayana, S.; Liu, X.; Sohn, S. The rise of small language models in healthcare: A comprehensive survey. arXiv 2025, arXiv:2504.17119. [Google Scholar] [CrossRef]
Yuan, L.; Han, D.-J.; Wang, S.; Brinton, C.G. Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings. arXiv 2025, arXiv:2502.11007. [Google Scholar] [CrossRef]
Perlindungan Data Peribadi (PDP). Personal Data Protection Act 2010 [Act 709]. 2025. Available online: https://www.pdp.gov.my/ppdpv1/en/akta/pdp-act-2010-en/ (accessed on 12 April 2026).
Ramachandran, A. Empowering Edge AI with Small Language Models: Architectures, Challenges, and Transformative Enterprise Applications. ResearchGate. 2024. Available online: https://www.researchgate.net/publication/385783062_Empowering_Edge_AI_with_Small_Language_Models_Architectures_Challenges_and_Transformative_Enterprise_Applications (accessed on 15 February 2026).
Lu, Z.; Li, X.; Cai, D.; Yi, R.; Liu, F.; Liu, W.; Luan, J.; Zhang, X.; Lane, N.D.; Xu, M. Demystifying Small Language Models for Edge Deployment. In Proceedings of the Association for Computational Linguistics (ACL), Vienna, Austria, 27 July–1 August 2025; pp. 14747–14764. [Google Scholar] [CrossRef]
Corradini, F.; Leonesi, M.; Piangerelli, M. State of the Art and Future Directions of Small Language Models: A Systematic Review. Big Data Cogn. Comput. 2025, 9, 189. [Google Scholar] [CrossRef]
Wang, F.; Zhang, Z.; Zhang, X.; Wu, Z.; Mo, T.; Lu, Q.; Wang, W.; Li, R.; Xu, J.; Tang, X.; et al. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness. arXiv 2024, arXiv:2411.03350. [Google Scholar] [CrossRef]
Gu, Y.; Dong, L.; Wei, F.; Huang, M. Knowledge Distillation of Large Language Models. arXiv 2023, arXiv:2306.08543. [Google Scholar] [CrossRef]
Xu, X.; Li, M.; Tao, C.; Shen, T.; Cheng, R.; Li, J.; Xu, C.; Tao, D.; Zhou, T. A Survey on Knowledge Distillation of Large Language Models. arXiv 2024, arXiv:2402.13116. [Google Scholar] [CrossRef]
Xu, L.; Xie, H.; Qin, S.-Z.J.; Tao, X.; Wang, F.L. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. arXiv 2023, arXiv:2312.12148. [Google Scholar] [CrossRef]
Sparrenberg, L.; Deußer, T.; Berger, A.; Sifa, R. Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama. cpp. In 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA); IEEE: Birmingham, UK, 2025; pp. 1–10. [Google Scholar] [CrossRef]
Ministry of Health Malaysia. Health White Paper for Malaysia: Strengthening People’s Health, Future-Proofing the Nation’s Health System. 2023. Available online: https://www.moh.gov.my/images/04-penerbitan/kertas-putih/Kertas_Putih_Kesihatan_ENG_compressed.pdf (accessed on 12 April 2026).
Economic Planning Unit. Twelfth Malaysia Plan 2021–2025 (RMK-12); Prime Minister’s Department: Putrajaya, Malaysia, 2021. Available online: https://rmke12.ekonomi.gov.my/en/documents/twelfth-plan (accessed on 12 April 2026).
Economic Planning Unit. Thirteenth Malaysia Plan (RMK-13) 2026–2030: Melakar Semula Pembangunan/Restructuring Development; Prime Minister’s Department: Putrajaya, Malaysia, 2025. Available online: https://rmk13.ekonomi.gov.my/wp-content/uploads/2025/09/Executive_Summary_Thirteenth_Malaysia_Plan.pdf (accessed on 12 April 2026).
Shiffman, R.N. Clinical Practice Guidelines: Supporting Decisions, Optimizing Care. In Pediatric Informatics; Lehmann, C.U., Kim, G.R., Johnson, K.B., Eds.; Springer: New York, NY, USA, 2009; pp. 185–197. [Google Scholar] [CrossRef] [PubMed]
Harrison, M.B.; Legare, F.; Graham, I.D.; Fervers, B. Adapting clinical practice guidelines to local context and assessing barriers to their use. Can. Med. Assoc. J. 2009, 182, E78–E84. [Google Scholar] [CrossRef] [PubMed]
Fortmann, J.; Lutz, M.; Spreckelsen, C. System for Context-Specific Visualization of Clinical Practice Guidelines (GuLiNav): Concept and Software Implementation. JMIR Form. Res. 2022, 6, e28013. [Google Scholar] [CrossRef] [PubMed]
Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M.; et al. BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text. arXiv 2024. [Google Scholar] [CrossRef]
Luukkonen, R.; Komulainen, L.; Luoma, J.; Eskelinen, A.; Kanerva, J.; Kupari, H.-M.; Ginter, F.; Laippala, V.; Muennighoff, N.; Piktus, A.; et al. FinGPT: Large Generative Models for a Small Language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Singapore, 2023. [Google Scholar] [CrossRef]
Hugging Face. dmis-lab/meerkat-7b-v1.0. 2024. Available online: https://huggingface.co/dmis-lab/meerkat-7b-v1.0 (accessed on 12 April 2026).
Kim, H.; Hwang, H.; Lee, J.; Park, S.; Kim, D.; Lee, T.; Yoon, C.; Sohn, J.; Choi, D.; Kang, J. Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks. arXiv 2024. [Google Scholar] [CrossRef]
Hugging Face. adityak74/medfit-llm-3B. 2025. Available online: https://huggingface.co/adityak74/medfit-llm-3B (accessed on 12 April 2026).
Rao, A.K.G.; Jaggi, A.; Naidu, S. MEDFIT-LLM: Medical Enhancements Through Domain-Focused Fine Tuning of Small Language Models. In Proceedings of the 2025 2nd International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE), Chennai, India, 7–8 May 2025; pp. 1–5. [Google Scholar] [CrossRef]
Hugging Face. mesolitica/mallam-1.1B-4096. 2025. Available online: https://huggingface.co/mesolitica/mallam-1.1B-4096 (accessed on 12 April 2026).
Hugging Face. mesolitica/mallam-3B-4096. 2025. Available online: https://huggingface.co/mesolitica/mallam-3B-4096 (accessed on 12 April 2026).
Hugging Face. mesolitica/mallam-5B-4096. 2025. Available online: https://huggingface.co/mesolitica/mallam-5B-4096 (accessed on 12 April 2026).
Agrawal, A.; Agarwal, A.; Kedia, N.; Mohan, J.; Kundu, S.; Kwatra, N.; Ramjee, R.; Tumanov, A. Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems. arXiv 2024. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2020, arXiv:1904.09675. [Google Scholar] [CrossRef]
Liu, J.; Chen, B.; Zhang, C. Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation. In Proceedings of the 42nd International Conference on Machine Learning, PMLR, Vancouver, BC, Canada, 13–19 July 2025; pp. 38188–38209. Available online: https://proceedings.mlr.press/v267/liu25g.html (accessed on 3 April 2026).
Nielsen, J. Usability Engineering; Nielsen Norman Group: Fremont, CA, USA, 1993; Available online: https://www.nngroup.com/books/usability-engineering/ (accessed on 12 April 2026).
Conde, J.; González, M.; Reviriego, P.; Gao, Z.; Liu, S.; Lombardi, F. Speed and Conversational Large Language Models: Not All Is About Tokens per Second. Computer 2024, 57, 74–80. [Google Scholar] [CrossRef]
Fu, Y.; Xue, L.; Huang, Y.; Brabete, A.-O.; Ustiugov, D.; Patel, Y.; Mai, L. ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models. arXiv 2024, arXiv:2401.14351. [Google Scholar] [CrossRef]
Patwari, R.; Sirasao, A.; Das, D. Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling. arXiv 2025. [Google Scholar] [CrossRef]
Brysbaert, M. How many words do we read per minute? A review and meta-analysis of reading rate. J. Mem. Lang. 2019, 109, 104047. [Google Scholar] [CrossRef]
Jahan, I.; Rahman, T.; Peng, C.; Huang, J.X. A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks. Comput. Biol. Med. 2024, 171, 108189. [Google Scholar] [CrossRef] [PubMed]
Nguyen, V.A.; Ha, T.B.N.; Tran, M.N.; Pham, N.T.M.; Nguyen, T.L.; Vuong, T.Q.T. Quantifying the speed-accuracy trade-off of large language models on oral and maxillofacial surgery multiple-choice questions. Sci. Rep. 2025, 15, 40657. [Google Scholar] [CrossRef] [PubMed]
Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Archet, A.; Gac, N.; Orieux, F.; Ventroux, N. Embedded AI performances of Nvidia’s Jetson Orin SoC series. In Proceedings of the 17ème Colloque National du GDR SOC2, Lyon, France, 12–14 June 2023. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023. [Google Scholar] [CrossRef]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2023. [Google Scholar] [CrossRef]
Bhat, S.R.; Rudat, M.; Spiekermann, J.; Flores-Herr, N. Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis. arXiv 2025. [Google Scholar] [CrossRef]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv 2020, arXiv:2002.10957. [Google Scholar] [CrossRef]
Ministry of Health Malaysia; Malaysian Health Technology Assessment Section (MAHTAS). Clinical Practice Guidelines (CPG). Available online: https://mymahtas.moh.gov.my/index.php/docman-list/publications/cpg-list (accessed on 28 December 2025).
Dean, J.; Barroso, L.A. The tail at scale. Commun. ACM 2013, 56, 74–80. [Google Scholar] [CrossRef]
Parker, A.I. Boosting Cross-Architectural Emulation Performance by Foregoing the Intermediate Representation Model. arXiv 2025, arXiv:2501.03427. [Google Scholar] [CrossRef]
Dipert, B. NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules—Edge AI and Vision Alliance. Edge AI and Vision Alliance. 2025. Available online: https://www.edge-ai-vision.com/2025/01/nvidia-jetpack-6-2-brings-super-mode-to-nvidia-jetson-orin-nano-and-jetson-orin-nx-modules/ (accessed on 20 April 2026).
Gomez-Cabello, C.A.; Prabha, S.; Haider, S.A.; Genovese, A.; Collaco, B.G.; Wood, N.G.; Bagaria, S.; Forte, A.J. Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support. Bioengineering 2025, 12, 1194. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparative flowchart of Small Language Model (SLM) development methods.

Figure 2. General overview of the proposed end-to-end pipeline.

Figure 3. General flowchart for hybrid PDF data extraction.

Figure 4. General flowchart for synthetic fine-tuning dataset generation.

Figure 5. General flowchart for the QLoRA Parameter-Efficient Fine-Tuning (PEFT).

Figure 6. General flowchart for RAG framework configuration and inference loop.

Figure 7. Inference performance trade-offs by model size.

Figure 8. Semantic fidelity (BERTScore) results comparison.

Figure 9. SpecioSLM-Phi-3-Mini (3.8B) as “Sweet Spot”.

Table 1. Comparison of existing Small Language Models datasets and computational requirements.

Small Language Model	Development Approach	Dataset Size and Source	Computational Resources
BioMedLM	Approach 1	34.6B tokens (Existing Datasets)	Cluster (128× A100 GPUs)
FinGPT	Approach 1	38B tokens (Web Scraping)	Cluster (LUMI supercomputer with 192 nodes)
Meerkat	Approach 2	441,034 samples (Existing Datasets + Medical Textbooks)	Cluster (8× A100 GPUs)
MEDFIT-LLM	Approach 2	10,000 synthetic samples (Ungrounded)	Apple Silicon (MLX) ~16–32 GB

Table 2. Specifications and original memory footprints of the selected open-source models.

Model	Role	License	Parameters Count	Original Size (Approx. FP16/BF16)
Qwen 2.5 (0B)	Student	Apache 2.0	0.49 billion	~0.94 GB
Qwen 2.5 (1.5B)	Student	Apache 2.0	1.54 billion	~3.08 GB
Qwen 2.5 (3B)	Student and Teacher	Apache 2.0	3.09 billion	~6.18 GB
Phi-3-Mini (3.8B)	Student	MIT permissive	3.82 billion	~7.64 GB

Table 3. QLoRA hyperparameter configurations for model adaptation and stability.

Hyperparameter	Value	Reason
LORA_R	16	Provide enough dimensions to capture domain features while keeping the model efficient
LORA_ALPHA	16	Maintain a 1x ratio to ensure weights updates follows the natural scale of adapters
LORA_DROPOUT	0.05	Apply light regularization to reduce overfitting while preserving stable learning

Table 4. Hardware specifications: emulated environment vs. native NVIDIA Jetson Orin NX.

Feature	Emulated Environment (Docker + QEMU)	Native NVIDIA Jetson Orin NX
CPU Architecture	ARM64 (via QEMU Translation)	ARM64 (Native)
RAM Limit	8 GB (Artificially Restricted)	8 GB (Native Unified Memory)
Compute Cores	8 Threads (CPU Host)	8-Core ARM Cortex-A78AE
GPU Acceleration	None (CPU Only)	NVIDIA Ampere (1024 CUDA Cores, 32 Tensor Cores)
NPU/NVDLA	None	2× NVDLA Engines
Inference Hardware	Sequential Processing (High Translation Overhead)	Highly Parallelized (CUDA/Mixed-Precision)

Table 5. Hardware specifications and software environments used for experiments.

Device	Processor (CPU)	Graphics (GPU)	Memory	Storage	Software Environment	Role
Desktop Workstation	AMD Ryzen 9 3900X (12-core)	NVIDIA GeForce RTX 3080 (10 GB VRAM)	32 GB DDR4	512 GB	Windows 11 w/ CUDA Parallel Computing	Dataset Generation, SLM Development and Benchmarking
Laptop (Acer Predator Helios Neo 16)	13th Gen Intel^® Core™ i7-13700HX (2.10 GHz)	NVIDIA GeForce RTX 4050 (6 GB VRAM)	32 GB DDR5	1 TB	Windows 11	Inference and Benchmarking
Docker Desktop w/QEMU	8 Threads ARM64 (via QEMU emulation)	N/A (CPU-Based)	8 GB (Artificially Restricted)	256 GB (Host Allocated)	Ubuntu 22.04 LTS (Docker Container)	ARM-Architecture Validation

Table 6. Benchmark results of the four SpecioSLM variants.

SpecioSLM Variants	LLM Original Size (fp16)	Specialized SLM Size (4-bit)	TPS (Tokens per Second) [Target > 30]	TTFT (Time to First Token) [Target < 1]	BERTScore (Fidelity) [Target > 90]
SpecioSLM_ Qwen 2.5 (0.5B)	~1 GB	373.71 MB	193.81	0.1265	82.84
SpecioSLM_ Qwen 2.5 (1.5B)	~3 GB	934.69 MB	152.46	0.1476	83.74
SpecioSLM_ Qwen 2.5 (3B)	~6 GB	1834.82 MB	113.12	0.1841	84.12
SpecioSLM_ Phi-3-Mini (3.8B)	~8 GB	2281.66 MB	91.59	0.1705	90.27

Table 7. Preliminary retrieval configuration results (Phi-3-Mini 3.8B).

Embedding Strategy	TPS (Tokens per Second) [Target > 30]	TTFT (Time to First Token) [Target < 1]	BERTScore (Fidelity) [Target > 90]
Baseline (all-MiniLM-L6-v2)	95.48	0.1685	89.44
Domain-Specific (PubMedBERT)	91.59	0.1705	90.27
Hybrid Search (PubMedBERT + BM25)	83.40	0.1737	90.35

Table 8. Benchmark results of the four SpecioSLM variants.

Model	Size	TPS (Tokens per Second) [Target > 30]	TTFT (Time to First Token) [Target < 1]	BERTScore (Fidelity) [Target > 90]
SpecioSLM_ Phi-3-Mini (3.8B)	2281.66 MB	91.59	0.1705	90.27
Qwen 2.5 (7B)	~14 GB	81.80	0.1767	84.71
Llama-3.1 (8B)	~16 GB	75.63	0.1791	84.56

Table 9. ARM64-based simulation and projected results for SpecioSLM-Phi-3-Mini (3.8B).

Environment	TPS (Tokens per Second) [Target > 30]	TTFT (Time to First Token) [Target < 1]	BERTScore (Fidelity) [Target > 90]
Docker w/QEMU (CPU-Bound Simulation)	0.70	226.81	93.03
Jetson Orin NX (Projected Results)	35.90–40.90	<1.0	93.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yusuf, C.H.b.; Ong, L.-Y. A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines. Appl. Sci. 2026, 16, 6630. https://doi.org/10.3390/app16136630

AMA Style

Yusuf CHb, Ong L-Y. A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines. Applied Sciences. 2026; 16(13):6630. https://doi.org/10.3390/app16136630

Chicago/Turabian Style

Yusuf, Campanale Haakim bin, and Lee-Yeng Ong. 2026. "A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines" Applied Sciences 16, no. 13: 6630. https://doi.org/10.3390/app16136630

APA Style

Yusuf, C. H. b., & Ong, L.-Y. (2026). A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines. Applied Sciences, 16(13), 6630. https://doi.org/10.3390/app16136630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines

Abstract

1. Introduction

2. Literature Review

2.1. Limitations of Large Language Models

2.2. The Shift to Small Language Models

2.3. Small Language Models Development Processes

2.4. Malaysian Clinical Practice Guidelines as a Domain-Specific Knowledge Source

2.5. Small Language Models: Current Situation

2.6. Benchmarking Language Models

2.7. Research Gap

3. Proposed Pipeline

3.1. Teacher and Student Selection

3.2. Data Extraction

3.3. Synthetic Tuning Dataset Generation

3.4. QLoRA Fine-Tuning

3.5. Compression

3.6. RAG Framework Setup and Runtime Inference

3.7. Edge Validation

4. Experiment Setup

4.1. Datasets

4.2. Evaluation Metrics

4.3. Benchmarking Strategy

5. Results and Discussion

5.1. Experiments

5.2. Discussion

5.2.1. Inference Performance: Throughput and Latency

5.2.2. Semantic Fidelity and Reasoning Quality

5.2.3. The “Sweet Spot” Model: Phi-3-Mini (3.8B)

5.3. Retrieval Strategy Impact

5.4. Baseline Comparisons

5.5. Edge Simulation Findings

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI