A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines
Abstract
1. Introduction
- The design of a custom end-to-end pipeline implementable on consumer-grade GPUs that successfully transforms unstructured knowledge into SLMs.
- A systematic evaluation of multiple SLMs processed through this pipeline to define the optimal trade-off architecture between execution throughput, memory footprint, and semantic fidelity.
- The implementation of an ARM-based simulation to validate the practical viability and execution constraints of the optimal SLM on a targeted edge architecture.
2. Literature Review
2.1. Limitations of Large Language Models
2.2. The Shift to Small Language Models
2.3. Small Language Models Development Processes
- Teacher and Student Selection: Two pre-existing, well-trained LLMs are selected based on their performance and compressibility. The first model, known as the teacher model, is a large and highly performing model with typically 70 billion parameters or more. Leveraging its rich linguistic understanding, broad world knowledge, and strong reasoning capabilities acquired during large-scale pre-training, this model is used as a gold-standard source of knowledge from which the student model will learn during the knowledge distillation process. The second model, known as the student model, is a smaller and more efficient general-purpose LLM, typically containing fewer than 8 billion parameters. During the distillation process, the student model learns and distills rich linguistic understanding and general reasoning abilities [16]. This selection is a critical design choice that should balance performance, compressibility and license constraints because it directly affects SLM’s performance, accuracy and size [17].
- Fine-Tuning: The model is trained on a smaller, task-specific dataset generated by the teacher model to adapt the student model’s general knowledge to the specific domain desired. This phase is essential to improve the accuracy and linguistic fluency of the specialized model [18]. Fine-Tuning can be conducted using Full Fine-Tuning or Parameter-Efficient Fine-Tuning (PEFT). In Full Fine-Tuning, all model parameters are updated, resulting in stronger task adaptation. However, this entails high computational requirements. On the other hand, PEFT reduces memory and computing requirements by keeping the majority of the model’s weights frozen and training only small, added adapter modules [18]. PEFT techniques include Low-Rank Adaptation (LoRA), which trains a small set of trainable parameters through lightweight adapter matrices; Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3), which introduces a small number of trainable scaling vectors to modify transformer activations and Quantized LoRA (QLoRA), which combines LoRA with 4-bit quantization, allowing fine-tuning to be performed with much lower memory usage [18].
- Compression Techniques: To minimize memory footprint while preserving performance, the model undergoes one or more compression techniques among pruning, knowledge distillation, and quantization. Pruning removes less important parameters, either individually (unstructured) or by deleting entire structural components like neurons or layers (structured). Knowledge Distillation trains a smaller student model to mimic a larger teacher using soft labels, via white-box (internal access) or black-box (outputs only) methods. Quantization lowers memory and computation by converting weights to lower precision (e.g., 8-bit or 4-bit), using Post-Training Quantization (PTQ) after training or Quantization-Aware Training (QAT) during training [15]. Studies demonstrate that these lower-precision models are able to achieve significant memory and computational savings while preserving more than 98% of accuracy [19].
2.4. Malaysian Clinical Practice Guidelines as a Domain-Specific Knowledge Source
2.5. Small Language Models: Current Situation
2.6. Benchmarking Language Models
- Time To First Token (TTFT): TTFT is a key metric for evaluating LLM inference performance, measuring the latency between the user’s query (input prompt) and the start of the model’s response (first output token) [37]. Lower TTFT values indicate better responsiveness of the model. Based on Nielsen’s classic usability guidelines [38], Human–Computer Interaction (HCI) research identifies TTFT as the primary metric for conversational capability with a recommended threshold of less than 1 s [39,40]. Indeed, a TTFT exceeding 1 s creates a noticeable delay that degrades the user experience in conversational systems.
- Tokens Per Second (TPS): Alongside TTFT, TPS is used to measure the generation speed (throughput), which refers to the model generation rate, and so the average number of tokens generated per second, after the first token has appeared [41]. Psycholinguistic evidence suggests that humans typically read complex technical text with an average of ~250 words/minute speed, which is approximately 5–6 tokens/second [42]. Therefore, a minimum of 30 TPS is considered fast enough to create the perception of instant text availability in the human mind.
- Fidelity (BERTScore): This metric evaluates how faithful a generated response is not only by matching exact words but by matching the intended meaning. Similar to accuracy, fidelity is used to evaluate correctness, but it is more suitable for specific-domain settings because it also considers semantic equivalence, considering different terms that carry the same meaning (e.g., “renal failure” vs. “kidney failure”) [36]. BERTScore leverages a pre-trained transformer architecture to generate contextual token embeddings for both the generated response and the reference text. These embeddings are then compared using cosine similarity to compute Precision, Recall, and F1 score, which together quantify the semantic overlap. With a score ranging from 0 to 1, in several NLP applications, BERTScore values between 80 and 90 are commonly observed as high-quality system performance [43].
2.7. Research Gap
3. Proposed Pipeline
3.1. Teacher and Student Selection
3.2. Data Extraction
3.3. Synthetic Tuning Dataset Generation
3.4. QLoRA Fine-Tuning
3.5. Compression
3.6. RAG Framework Setup and Runtime Inference
3.7. Edge Validation
4. Experiment Setup
4.1. Datasets
- Synthetic Fine-Tuning Dataset: A collection of 15,903 instruction input–output samples in JSON Alpaca format, generated by the 3B teacher model, with one sample produced for each document object retrieved from the NoSQL database. This dataset is used to transfer domain-specific knowledge and linguistic style to the student models through QLoRA fine-tuning. During training, the dataset is split into 90% training and 10% validation subsets.
- Test Dataset: A collection of 141 question–answer pairs manually created by human annotators. The questions are generated by randomly selecting three sections from each CPG PDF document, reviewing their contents, and formulating questions based on the identified context. This dataset is used to evaluate the performance of SpecioSLM.
- Vector Database: A collection of dense vectors generated from CPG documents using an embedding model. Vectors are indexed and stored in local storage to enable similarity search retrieval during the inference stage of SpecioSLM.
4.2. Evaluation Metrics
4.3. Benchmarking Strategy
5. Results and Discussion
5.1. Experiments
5.2. Discussion
5.2.1. Inference Performance: Throughput and Latency
5.2.2. Semantic Fidelity and Reasoning Quality
5.2.3. The “Sweet Spot” Model: Phi-3-Mini (3.8B)
5.3. Retrieval Strategy Impact
5.4. Baseline Comparisons
5.5. Edge Simulation Findings
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sharif, A.; Gurbuz, E.; Ay, S. The impact of AI on employment and jobs: A comprehensive analysis. Lond. J. Interdiscip. Sci. 2023, 1, 50–55. [Google Scholar] [CrossRef]
- Gao, R.X.; Krüger, J.; Merklein, M.; Möhring, H.-C.; Váncza, J. Artificial Intelligence in manufacturing: State of the art, perspectives, and future directions. CIRP Ann. 2024, 73, 723–749. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.; Lee, J.; Chung, H.W.; Scales, N.; Tanwani, A.K.; Cole-Lewis, H.; Pfohl, S.; et al. Publisher Correction: Large language models encode clinical knowledge. Nature 2023, 620, E19. [Google Scholar] [CrossRef] [PubMed]
- Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.-B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Pontes Reis, E.; Seehofnerova, A.; et al. Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Res. Sq. 2023, preprint. [Google Scholar] [CrossRef] [PubMed]
- Örpek, Z.; Tural, B.; Destan, Z. The language model revolution: LLM and SLM analysis. In Proceedings of the 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP), Online, 21–22 September 2024; pp. 1–4. [Google Scholar] [CrossRef]
- Jang, S.; Morabito, R. Edge-first language model inference: Models, metrics, and tradeoffs. arXiv 2025, arXiv:2505.16508. [Google Scholar] [CrossRef]
- Shams, D.; Salama, I.; Callixtus, I. Exploring the landscape of large and small language models: Advancements, trade-offs, and future directions. Preprints 2025, preprint. [Google Scholar] [CrossRef]
- Ammanath, B. Small language models (SLMs). IEEE Softw. 2025, 42, 112–115. Available online: https://ieeexplore.ieee.org/document/11024079 (accessed on 7 January 2026).
- Garg, M.; Raza, S.; Rayana, S.; Liu, X.; Sohn, S. The rise of small language models in healthcare: A comprehensive survey. arXiv 2025, arXiv:2504.17119. [Google Scholar] [CrossRef]
- Yuan, L.; Han, D.-J.; Wang, S.; Brinton, C.G. Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings. arXiv 2025, arXiv:2502.11007. [Google Scholar] [CrossRef]
- Perlindungan Data Peribadi (PDP). Personal Data Protection Act 2010 [Act 709]. 2025. Available online: https://www.pdp.gov.my/ppdpv1/en/akta/pdp-act-2010-en/ (accessed on 12 April 2026).
- Ramachandran, A. Empowering Edge AI with Small Language Models: Architectures, Challenges, and Transformative Enterprise Applications. ResearchGate. 2024. Available online: https://www.researchgate.net/publication/385783062_Empowering_Edge_AI_with_Small_Language_Models_Architectures_Challenges_and_Transformative_Enterprise_Applications (accessed on 15 February 2026).
- Lu, Z.; Li, X.; Cai, D.; Yi, R.; Liu, F.; Liu, W.; Luan, J.; Zhang, X.; Lane, N.D.; Xu, M. Demystifying Small Language Models for Edge Deployment. In Proceedings of the Association for Computational Linguistics (ACL), Vienna, Austria, 27 July–1 August 2025; pp. 14747–14764. [Google Scholar] [CrossRef]
- Corradini, F.; Leonesi, M.; Piangerelli, M. State of the Art and Future Directions of Small Language Models: A Systematic Review. Big Data Cogn. Comput. 2025, 9, 189. [Google Scholar] [CrossRef]
- Wang, F.; Zhang, Z.; Zhang, X.; Wu, Z.; Mo, T.; Lu, Q.; Wang, W.; Li, R.; Xu, J.; Tang, X.; et al. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness. arXiv 2024, arXiv:2411.03350. [Google Scholar] [CrossRef]
- Gu, Y.; Dong, L.; Wei, F.; Huang, M. Knowledge Distillation of Large Language Models. arXiv 2023, arXiv:2306.08543. [Google Scholar] [CrossRef]
- Xu, X.; Li, M.; Tao, C.; Shen, T.; Cheng, R.; Li, J.; Xu, C.; Tao, D.; Zhou, T. A Survey on Knowledge Distillation of Large Language Models. arXiv 2024, arXiv:2402.13116. [Google Scholar] [CrossRef]
- Xu, L.; Xie, H.; Qin, S.-Z.J.; Tao, X.; Wang, F.L. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. arXiv 2023, arXiv:2312.12148. [Google Scholar] [CrossRef]
- Sparrenberg, L.; Deußer, T.; Berger, A.; Sifa, R. Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama. cpp. In 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA); IEEE: Birmingham, UK, 2025; pp. 1–10. [Google Scholar] [CrossRef]
- Ministry of Health Malaysia. Health White Paper for Malaysia: Strengthening People’s Health, Future-Proofing the Nation’s Health System. 2023. Available online: https://www.moh.gov.my/images/04-penerbitan/kertas-putih/Kertas_Putih_Kesihatan_ENG_compressed.pdf (accessed on 12 April 2026).
- Economic Planning Unit. Twelfth Malaysia Plan 2021–2025 (RMK-12); Prime Minister’s Department: Putrajaya, Malaysia, 2021. Available online: https://rmke12.ekonomi.gov.my/en/documents/twelfth-plan (accessed on 12 April 2026).
- Economic Planning Unit. Thirteenth Malaysia Plan (RMK-13) 2026–2030: Melakar Semula Pembangunan/Restructuring Development; Prime Minister’s Department: Putrajaya, Malaysia, 2025. Available online: https://rmk13.ekonomi.gov.my/wp-content/uploads/2025/09/Executive_Summary_Thirteenth_Malaysia_Plan.pdf (accessed on 12 April 2026).
- Shiffman, R.N. Clinical Practice Guidelines: Supporting Decisions, Optimizing Care. In Pediatric Informatics; Lehmann, C.U., Kim, G.R., Johnson, K.B., Eds.; Springer: New York, NY, USA, 2009; pp. 185–197. [Google Scholar] [CrossRef] [PubMed]
- Harrison, M.B.; Legare, F.; Graham, I.D.; Fervers, B. Adapting clinical practice guidelines to local context and assessing barriers to their use. Can. Med. Assoc. J. 2009, 182, E78–E84. [Google Scholar] [CrossRef] [PubMed]
- Fortmann, J.; Lutz, M.; Spreckelsen, C. System for Context-Specific Visualization of Clinical Practice Guidelines (GuLiNav): Concept and Software Implementation. JMIR Form. Res. 2022, 6, e28013. [Google Scholar] [CrossRef] [PubMed]
- Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M.; et al. BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text. arXiv 2024. [Google Scholar] [CrossRef]
- Luukkonen, R.; Komulainen, L.; Luoma, J.; Eskelinen, A.; Kanerva, J.; Kupari, H.-M.; Ginter, F.; Laippala, V.; Muennighoff, N.; Piktus, A.; et al. FinGPT: Large Generative Models for a Small Language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Singapore, 2023. [Google Scholar] [CrossRef]
- Hugging Face. dmis-lab/meerkat-7b-v1.0. 2024. Available online: https://huggingface.co/dmis-lab/meerkat-7b-v1.0 (accessed on 12 April 2026).
- Kim, H.; Hwang, H.; Lee, J.; Park, S.; Kim, D.; Lee, T.; Yoon, C.; Sohn, J.; Choi, D.; Kang, J. Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks. arXiv 2024. [Google Scholar] [CrossRef]
- Hugging Face. adityak74/medfit-llm-3B. 2025. Available online: https://huggingface.co/adityak74/medfit-llm-3B (accessed on 12 April 2026).
- Rao, A.K.G.; Jaggi, A.; Naidu, S. MEDFIT-LLM: Medical Enhancements Through Domain-Focused Fine Tuning of Small Language Models. In Proceedings of the 2025 2nd International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE), Chennai, India, 7–8 May 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Hugging Face. mesolitica/mallam-1.1B-4096. 2025. Available online: https://huggingface.co/mesolitica/mallam-1.1B-4096 (accessed on 12 April 2026).
- Hugging Face. mesolitica/mallam-3B-4096. 2025. Available online: https://huggingface.co/mesolitica/mallam-3B-4096 (accessed on 12 April 2026).
- Hugging Face. mesolitica/mallam-5B-4096. 2025. Available online: https://huggingface.co/mesolitica/mallam-5B-4096 (accessed on 12 April 2026).
- Agrawal, A.; Agarwal, A.; Kedia, N.; Mohan, J.; Kundu, S.; Kwatra, N.; Ramjee, R.; Tumanov, A. Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems. arXiv 2024. [Google Scholar] [CrossRef]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2020, arXiv:1904.09675. [Google Scholar] [CrossRef]
- Liu, J.; Chen, B.; Zhang, C. Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation. In Proceedings of the 42nd International Conference on Machine Learning, PMLR, Vancouver, BC, Canada, 13–19 July 2025; pp. 38188–38209. Available online: https://proceedings.mlr.press/v267/liu25g.html (accessed on 3 April 2026).
- Nielsen, J. Usability Engineering; Nielsen Norman Group: Fremont, CA, USA, 1993; Available online: https://www.nngroup.com/books/usability-engineering/ (accessed on 12 April 2026).
- Conde, J.; González, M.; Reviriego, P.; Gao, Z.; Liu, S.; Lombardi, F. Speed and Conversational Large Language Models: Not All Is About Tokens per Second. Computer 2024, 57, 74–80. [Google Scholar] [CrossRef]
- Fu, Y.; Xue, L.; Huang, Y.; Brabete, A.-O.; Ustiugov, D.; Patel, Y.; Mai, L. ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models. arXiv 2024, arXiv:2401.14351. [Google Scholar] [CrossRef]
- Patwari, R.; Sirasao, A.; Das, D. Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling. arXiv 2025. [Google Scholar] [CrossRef]
- Brysbaert, M. How many words do we read per minute? A review and meta-analysis of reading rate. J. Mem. Lang. 2019, 109, 104047. [Google Scholar] [CrossRef]
- Jahan, I.; Rahman, T.; Peng, C.; Huang, J.X. A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks. Comput. Biol. Med. 2024, 171, 108189. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, V.A.; Ha, T.B.N.; Tran, M.N.; Pham, N.T.M.; Nguyen, T.L.; Vuong, T.Q.T. Quantifying the speed-accuracy trade-off of large language models on oral and maxillofacial surgery multiple-choice questions. Sci. Rep. 2025, 15, 40657. [Google Scholar] [CrossRef] [PubMed]
- Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
- Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Archet, A.; Gac, N.; Orieux, F.; Ventroux, N. Embedded AI performances of Nvidia’s Jetson Orin SoC series. In Proceedings of the 17ème Colloque National du GDR SOC2, Lyon, France, 12–14 June 2023. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023. [Google Scholar] [CrossRef]
- Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2023. [Google Scholar] [CrossRef]
- Bhat, S.R.; Rudat, M.; Spiekermann, J.; Flores-Herr, N. Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis. arXiv 2025. [Google Scholar] [CrossRef]
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv 2020, arXiv:2002.10957. [Google Scholar] [CrossRef]
- Ministry of Health Malaysia; Malaysian Health Technology Assessment Section (MAHTAS). Clinical Practice Guidelines (CPG). Available online: https://mymahtas.moh.gov.my/index.php/docman-list/publications/cpg-list (accessed on 28 December 2025).
- Dean, J.; Barroso, L.A. The tail at scale. Commun. ACM 2013, 56, 74–80. [Google Scholar] [CrossRef]
- Parker, A.I. Boosting Cross-Architectural Emulation Performance by Foregoing the Intermediate Representation Model. arXiv 2025, arXiv:2501.03427. [Google Scholar] [CrossRef]
- Dipert, B. NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules—Edge AI and Vision Alliance. Edge AI and Vision Alliance. 2025. Available online: https://www.edge-ai-vision.com/2025/01/nvidia-jetpack-6-2-brings-super-mode-to-nvidia-jetson-orin-nano-and-jetson-orin-nx-modules/ (accessed on 20 April 2026).
- Gomez-Cabello, C.A.; Prabha, S.; Haider, S.A.; Genovese, A.; Collaco, B.G.; Wood, N.G.; Bagaria, S.; Forte, A.J. Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support. Bioengineering 2025, 12, 1194. [Google Scholar] [CrossRef] [PubMed]









| Small Language Model | Development Approach | Dataset Size and Source | Computational Resources |
|---|---|---|---|
| BioMedLM | Approach 1 | 34.6B tokens (Existing Datasets) | Cluster (128× A100 GPUs) |
| FinGPT | Approach 1 | 38B tokens (Web Scraping) | Cluster (LUMI supercomputer with 192 nodes) |
| Meerkat | Approach 2 | 441,034 samples (Existing Datasets + Medical Textbooks) | Cluster (8× A100 GPUs) |
| MEDFIT-LLM | Approach 2 | 10,000 synthetic samples (Ungrounded) | Apple Silicon (MLX) ~16–32 GB |
| Model | Role | License | Parameters Count | Original Size (Approx. FP16/BF16) |
|---|---|---|---|---|
| Qwen 2.5 (0B) | Student | Apache 2.0 | 0.49 billion | ~0.94 GB |
| Qwen 2.5 (1.5B) | Student | Apache 2.0 | 1.54 billion | ~3.08 GB |
| Qwen 2.5 (3B) | Student and Teacher | Apache 2.0 | 3.09 billion | ~6.18 GB |
| Phi-3-Mini (3.8B) | Student | MIT permissive | 3.82 billion | ~7.64 GB |
| Hyperparameter | Value | Reason |
|---|---|---|
| LORA_R | 16 | Provide enough dimensions to capture domain features while keeping the model efficient |
| LORA_ALPHA | 16 | Maintain a 1x ratio to ensure weights updates follows the natural scale of adapters |
| LORA_DROPOUT | 0.05 | Apply light regularization to reduce overfitting while preserving stable learning |
| Feature | Emulated Environment (Docker + QEMU) | Native NVIDIA Jetson Orin NX |
|---|---|---|
| CPU Architecture | ARM64 (via QEMU Translation) | ARM64 (Native) |
| RAM Limit | 8 GB (Artificially Restricted) | 8 GB (Native Unified Memory) |
| Compute Cores | 8 Threads (CPU Host) | 8-Core ARM Cortex-A78AE |
| GPU Acceleration | None (CPU Only) | NVIDIA Ampere (1024 CUDA Cores, 32 Tensor Cores) |
| NPU/NVDLA | None | 2× NVDLA Engines |
| Inference Hardware | Sequential Processing (High Translation Overhead) | Highly Parallelized (CUDA/Mixed-Precision) |
| Device | Processor (CPU) | Graphics (GPU) | Memory | Storage | Software Environment | Role |
|---|---|---|---|---|---|---|
| Desktop Workstation | AMD Ryzen 9 3900X (12-core) | NVIDIA GeForce RTX 3080 (10 GB VRAM) | 32 GB DDR4 | 512 GB | Windows 11 w/ CUDA Parallel Computing | Dataset Generation, SLM Development and Benchmarking |
| Laptop (Acer Predator Helios Neo 16) | 13th Gen Intel® Core™ i7-13700HX (2.10 GHz) | NVIDIA GeForce RTX 4050 (6 GB VRAM) | 32 GB DDR5 | 1 TB | Windows 11 | Inference and Benchmarking |
| Docker Desktop w/QEMU | 8 Threads ARM64 (via QEMU emulation) | N/A (CPU-Based) | 8 GB (Artificially Restricted) | 256 GB (Host Allocated) | Ubuntu 22.04 LTS (Docker Container) | ARM-Architecture Validation |
| SpecioSLM Variants | LLM Original Size (fp16) | Specialized SLM Size (4-bit) | TPS (Tokens per Second) [Target > 30] | TTFT (Time to First Token) [Target < 1] | BERTScore (Fidelity) [Target > 90] |
|---|---|---|---|---|---|
| SpecioSLM_ Qwen 2.5 (0.5B) | ~1 GB | 373.71 MB | 193.81 | 0.1265 | 82.84 |
| SpecioSLM_ Qwen 2.5 (1.5B) | ~3 GB | 934.69 MB | 152.46 | 0.1476 | 83.74 |
| SpecioSLM_ Qwen 2.5 (3B) | ~6 GB | 1834.82 MB | 113.12 | 0.1841 | 84.12 |
| SpecioSLM_ Phi-3-Mini (3.8B) | ~8 GB | 2281.66 MB | 91.59 | 0.1705 | 90.27 |
| Embedding Strategy | TPS (Tokens per Second) [Target > 30] | TTFT (Time to First Token) [Target < 1] | BERTScore (Fidelity) [Target > 90] |
|---|---|---|---|
| Baseline (all-MiniLM-L6-v2) | 95.48 | 0.1685 | 89.44 |
| Domain-Specific (PubMedBERT) | 91.59 | 0.1705 | 90.27 |
| Hybrid Search (PubMedBERT + BM25) | 83.40 | 0.1737 | 90.35 |
| Model | Size | TPS (Tokens per Second) [Target > 30] | TTFT (Time to First Token) [Target < 1] | BERTScore (Fidelity) [Target > 90] |
|---|---|---|---|---|
| SpecioSLM_ Phi-3-Mini (3.8B) | 2281.66 MB | 91.59 | 0.1705 | 90.27 |
| Qwen 2.5 (7B) | ~14 GB | 81.80 | 0.1767 | 84.71 |
| Llama-3.1 (8B) | ~16 GB | 75.63 | 0.1791 | 84.56 |
| Environment | TPS (Tokens per Second) [Target > 30] | TTFT (Time to First Token) [Target < 1] | BERTScore (Fidelity) [Target > 90] |
|---|---|---|---|
| Docker w/QEMU (CPU-Bound Simulation) | 0.70 | 226.81 | 93.03 |
| Jetson Orin NX (Projected Results) | 35.90–40.90 | <1.0 | 93.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yusuf, C.H.b.; Ong, L.-Y. A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines. Appl. Sci. 2026, 16, 6630. https://doi.org/10.3390/app16136630
Yusuf CHb, Ong L-Y. A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines. Applied Sciences. 2026; 16(13):6630. https://doi.org/10.3390/app16136630
Chicago/Turabian StyleYusuf, Campanale Haakim bin, and Lee-Yeng Ong. 2026. "A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines" Applied Sciences 16, no. 13: 6630. https://doi.org/10.3390/app16136630
APA StyleYusuf, C. H. b., & Ong, L.-Y. (2026). A Pipeline for Domain-Specialized Small Language Models from Unstructured Data: A Test Case Using Malaysian Clinical Practice Guidelines. Applied Sciences, 16(13), 6630. https://doi.org/10.3390/app16136630

