Next Article in Journal
ViX-MangoEFormer: An Enhanced Vision Transformer–EfficientFormer and Stacking Ensemble Approach for Mango Leaf Disease Recognition with Explainable Artificial Intelligence
Previous Article in Journal
Application of Graphics Processor Unit Computing Resources to Solution of Incompressible Fluid Dynamics Problems
Previous Article in Special Issue
Use of Explainable Artificial Intelligence for Analyzing and Explaining Intrusion Detection Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

A Framework for Domain-Specific Dataset Creation and Adaptation of Large Language Models

by
George Balaskas
1,2,*,
Homer Papadopoulos
1,3,
Dimitra Pappa
1,
Quentin Loisel
4 and
Sebastien Chastin
4,5
1
Institute of Informatics and Telecommunications, NCSR Demokritos, Ag. Paraskevi, 153 41 Athens, Greece
2
Department of Digital Systems, University of Piraeus, Karaoli ke Dimitriou, 185 34 Pireas, Greece
3
Syndesis Ltd., Ag. Paraskevi, 153 41 Athens, Greece
4
School of Health and Life Sciences, Glasgow Caledonian University, Cowcaddens Rd., Glasgow G4 0BA, UK
5
Department of Movement and Sports Science, Ghent University, BE-9000 Ghent, Belgium
*
Author to whom correspondence should be addressed.
Computers 2025, 14(5), 172; https://doi.org/10.3390/computers14050172
Submission received: 6 March 2025 / Revised: 17 April 2025 / Accepted: 25 April 2025 / Published: 2 May 2025
(This article belongs to the Special Issue Using New Technologies in Cyber Security Solutions (2nd Edition))

Abstract

This paper introduces a novel framework for addressing domain adaptation challenges in large language models (LLMs), emphasising privacy-preserving synthetic data generation and efficient fine-tuning. The proposed framework employs a multi-stage approach that includes document ingestion, relevance assessment, and automated dataset creation. This process reduces the need for extensive technical expertise while safeguarding data privacy. We evaluate the framework’s performance on domain-specific tasks in fields such as biobanking and public health, demonstrating that models fine-tuned using our method achieve results comparable to larger proprietary models. Crucially, these models maintain their general instruction-following capabilities, even when adapted to specialised domains, as shown through experiments with 7B and 8B parameter LLMs. Key components of the framework include continuous pre-training, supervised fine-tuning (SFT), and reinforcement learning methods such as direct preference optimisation (DPO), which together provide a flexible and configurable solution for deploying LLMs. The framework supports both local models and API-based solutions, making it scalable and accessible. By enabling privacy-preserving, domain-specific adaptation without requiring extensive expertise, this framework represents a significant step forward in the deployment of LLMs for specialised applications. The framework significantly lowers the barrier to domain adaptation for small- and medium-sized enterprises (SMEs), enabling them to utilise the power of LLMs without requiring extensive resources or technical expertise.
Keywords: dataset creation; model adaptation; model fine-tuning; deep learning; large language models dataset creation; model adaptation; model fine-tuning; deep learning; large language models

Share and Cite

MDPI and ACS Style

Balaskas, G.; Papadopoulos, H.; Pappa, D.; Loisel, Q.; Chastin, S. A Framework for Domain-Specific Dataset Creation and Adaptation of Large Language Models. Computers 2025, 14, 172. https://doi.org/10.3390/computers14050172

AMA Style

Balaskas G, Papadopoulos H, Pappa D, Loisel Q, Chastin S. A Framework for Domain-Specific Dataset Creation and Adaptation of Large Language Models. Computers. 2025; 14(5):172. https://doi.org/10.3390/computers14050172

Chicago/Turabian Style

Balaskas, George, Homer Papadopoulos, Dimitra Pappa, Quentin Loisel, and Sebastien Chastin. 2025. "A Framework for Domain-Specific Dataset Creation and Adaptation of Large Language Models" Computers 14, no. 5: 172. https://doi.org/10.3390/computers14050172

APA Style

Balaskas, G., Papadopoulos, H., Pappa, D., Loisel, Q., & Chastin, S. (2025). A Framework for Domain-Specific Dataset Creation and Adaptation of Large Language Models. Computers, 14(5), 172. https://doi.org/10.3390/computers14050172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop