Automated Taxonomy Construction Using Large Language Models: A Comparative Study of Fine-Tuning and Prompt Engineering

Vu, Binh; Naik, Rashmi Govindraju; Nguyen, Bao Khanh; Mehraeen, Sina; Hemmje, Matthias

doi:10.3390/eng6110283

Open AccessArticle

Automated Taxonomy Construction Using Large Language Models: A Comparative Study of Fine-Tuning and Prompt Engineering

by

Binh Vu

^1,2,*,†,

Rashmi Govindraju Naik

^1,†,

Bao Khanh Nguyen

²,

Sina Mehraeen

¹ and

Matthias Hemmje

²

¹

Applied Data Science and Analytics, SRH University Heidelberg, 69123 Heidelberg, Germany

²

Faculty of Mathematics and Computer Science, University of Hagen, 58097 Hagen, Germany

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Eng 2025, 6(11), 283; https://doi.org/10.3390/eng6110283 (registering DOI)

Submission received: 29 April 2025 / Revised: 24 September 2025 / Accepted: 16 October 2025 / Published: 22 October 2025

Download

Browse Figures

Versions Notes

Abstract

Taxonomies provide essential hierarchical structures for classifying information, enabling effective retrieval and knowledge organization in diverse domains such as e-commerce, academic research, and web search. Traditional taxonomy construction, heavily reliant on manual curation by domain experts, faces significant challenges in scalability, cost, and consistency when dealing with the exponential growth of digital data. Recent advancements in Large Language Models (LLMs) and Natural Language Processing (NLP) present powerful opportunities for automating this complex process. This paper explores the potential of LLMs for automated taxonomy generation, focusing on methodologies incorporating semantic embedding generation, keyword extraction, and machine learning clustering algorithms. We specifically investigate and conduct a comparative analysis of two primary LLM-based approaches using a dataset of eBay product descriptions. The first approach involves fine-tuning a pre-trained LLM using structured hierarchical data derived from chain-of-layer clustering outputs. The second employs prompt-engineering techniques to guide LLMs in generating context-aware hierarchical taxonomies based on clustered keywords without explicit model retraining. Both methodologies are evaluated for their efficacy in constructing organized multi-level hierarchical taxonomies. Evaluation using semantic similarity metrics (BERTScore and Cosine Similarity) against a ground truth reveals that the fine-tuning approach yields higher overall accuracy and consistency (BERTScore F1: 70.91%; Cosine Similarity: 66.40%) compared to the prompt-engineering approach (BERTScore F1: 61.66%; Cosine Similarity: 60.34%). We delve into the inherent trade-offs between these methods concerning semantic fidelity, computational resource requirements, result stability, and scalability. Finally, we outline potential directions for future research aimed at refining LLM-based taxonomy construction systems to handle large dynamic datasets with enhanced accuracy, robustness, and granularity.

Keywords:

taxonomy construction; Large Language Models (LLMs); fine-tuning; prompt engineering; clustering; keyword extraction; knowledge management

1. Introduction

The systematic organization of information is paramount in navigating the ever-expanding digital landscape. Taxonomies, defined as hierarchical classification systems, serve as fundamental tools for structuring knowledge, facilitating efficient information retrieval, and enabling better understanding within various domains, ranging from biological sciences to modern e-commerce platforms and digital libraries [1]. Traditionally, the creation of taxonomies has been a meticulous labor-intensive process undertaken by domain specialists and knowledge engineers [2]. This manual approach, while potentially yielding high-quality results, suffers from significant drawbacks. The emergence of sophisticated Artificial Intelligence (AI) techniques, particularly in the fields of NLP and machine learning (ML), offers compelling alternatives for automating taxonomy construction. Recent breakthroughs, driven significantly by the development of LLMs, have provided unprecedented capabilities for machines to understand, interpret, and generate human language with remarkable nuance [3]. LLMs, such as those based on the Transformer architecture [4], are pre-trained on vast text corpora, enabling them to capture complex linguistic patterns and semantic relationships. This inherent understanding can be leveraged to automatically identify key concepts, group related terms, and infer hierarchical structures directly from raw data.

1.1. Motivation

Our motivation for this research stems from direct experience with the challenges of manual knowledge organization. In previous work, we developed the Content and Knowledge Management Ecosystem (KM-EP) to provide a platform for managing scientific as well as educational content and knowledge resources. Furthermore, the KM-EP was designed to act as a framework for researchers to deploy their work without spending time reimplementing basic functionalities, such as user management and task scheduling. A key component of the KM-EP is the Taxonomy Manager (TM). This system was built to support the construction, collaboration, management, and evolution of taxonomies, enabling users to develop and manage their own classification schemes. With the support of its version control system, users can manage changes to their taxonomies, track a complete history of modifications, and even create multiple branches of a taxonomy. Multimedia objects, publications, and other assets within the KM-EP can be classified with the TM’s support, allowing users to search and browse content quickly and easily. A persistent identifier for each term enables taxonomy evolution without affecting existing classifications, and a crowd-voting rating system helps to evaluate taxonomy quality [5]. Nevertheless, a critical limitation remains: taxonomies are still being built manually by users. This manual bottleneck highlights a broader challenge as manual methods are not only time-consuming and expensive but also inherently subjective and struggle to keep pace with the rapid growth and dynamic nature of information, particularly unstructured text data [6]. Moreover, the ongoing tasks of managing, developing, and evolving taxonomies as information landscapes change are associated with their own significant challenges, necessitating better support systems [7]. Therefore, our motivation is to support the automatic construction of taxonomies.

1.2. Problem Statement

Building upon this motivation, the specific problem statement that guides this research is examined. A core problem related to this motivation is the efficient and effective automation of domain-specific taxonomy building, a task traditionally hampered by the scalability, cost, and consistency issues of manual curation [8]. This multifaceted process inherently involves three fundamental challenges: first, extracting the essential vocabularies and concepts within a given domain; second, inferring the types of relationships that exist between these concepts; and third, organizing these vocabularies into a coherent and logical hierarchical knowledge representation structure [9]. Although various computational methods for taxonomy construction exist, this paper focuses specifically on the capabilities of LLMs. While LLMs show promise for many NLP problems, it is not yet clear how effectively they can support both vocabulary extraction and hierarchical relationship building. This uncertainty requires a comparative analysis of the two dominant LLM application paradigms: resource-intensive fine-tuning of LLMs for specialized accuracy versus the more agile but potentially less consistent method of prompt engineering.

1.3. Research Questions

To address these fundamental challenges, our research is guided by three specific questions, each corresponding to a core stage of taxonomy construction:

How effectively can LLMs identify and extract the vocabularies present in a domain-specific corpus?
To what degree can the two competing LLM methodologies, fine-tuning and prompt engineering, implicitly determine the correct hierarchical knowledge representation structure between these vocabularies?
Which approach yields a more accurate, stable, and logically sound multi-level taxonomy, and what are the inherent trade-offs between them?

Answering these questions is crucial for navigating the challenges of designing efficient and scalable LLM-based taxonomy systems [10,11].

1.4. Approaches

To answer the research questions, our study employs a structured methodology grounded in established knowledge discovery and machine learning [12]. The process begins with foundational data preparation stages, including data collection, cleaning, and concept identification via semantic keyword extraction and clustering. With this common semantic base established, our work then diverges to systematically evaluate two distinct LLM approaches across the subsequent stages of taxonomy construction.

The fine-tuning approach tackles this by using the clustered vocabularies to construct a hierarchically structured training dataset. A powerful base model, Meta-Llama-3 8B Instruct [13,14], is then fine-tuned on this data using the QLoRA technique. This process is designed to explicitly teach the model the domain’s hierarchical relationships, aiming for high fidelity and consistency. In contrast, the prompt-engineering approach addresses the same questions by guiding a pre-trained instruction-following LLM, Microsoft Phi-3.5 Instruct [15,16], to generate the hierarchy iteratively. The clustered concepts are provided as context within carefully engineered prompts, instructing the model to infer relationships and organize the structure level-by-level. This method leverages the existing knowledge of the model without parameter modification, prioritizing speed and adaptability. By systematically comparing these two state-of-the-art LLM approaches on a real-world dataset, this study aims to provide valuable insights into their respective strengths, weaknesses, and practical implications for automated knowledge organization. The findings contribute to understanding the trade-offs involved and inform the selection of appropriate methodologies for future automated taxonomy-building applications. The selection of LLM fine-tuning and prompt engineering as the two comparative approaches in this study is deliberate as they represent distinct and widely adopted strategies for harnessing the power of LLMs for complex NLP tasks like taxonomy building.

While employing the same base model for both methodologies would offer a more controlled comparison, our approach aims to evaluate two distinct strategic setups using models well-suited to the typical goals of each paradigm. Llama-3 8B has approximately 8 billion parameters, while Phi-3.5 Instruct is significantly smaller, around 3.8 billion parameters for its “mini” variant. Llama-3 8B was chosen for its strong baseline performance and proven responsiveness to fine-tuning, making it suitable for tasks where maximizing accuracy through specialization is a primary goal. This path is often pursued when maximal accuracy and consistency are paramount, albeit with higher upfront computational and data preparation costs [11]. Phi-3.5 is specifically optimized for strong instruction-following and efficient inference. Its selection reflects scenarios where the priorities are rapid development, lower resource demands, and adaptability by directly harnessing a model’s pre-trained intelligence. By comparing these two contemporary and powerful methodologies, this study aims to provide a clear understanding of their respective strengths, weaknesses, and the critical trade-offs concerning semantic accuracy, hierarchical coherence, computational expense, and result stability in the context of automated taxonomy generation from real-world e-commerce data. This direct comparison is essential for guiding practitioners and researchers in selecting the most appropriate LLM strategy for their specific needs and constraints. Therefore, the remainder of this paper is structured as follows: Section 2 reviews the relevant literature. Section 3 describes the acquisition and preprocessing of the dataset used in the study. Section 4 details the proposed methodologies and evaluation metrics. Section 5 presents and discusses the experimental results. Section 6 concludes the paper, summarizing the findings and outlining directions for future work.

2. Related Work

Automated taxonomy construction, a field closely related to ontology learning, integrates techniques from NLP, ML, and information systems. The research trajectory has evolved from foundational pattern-based and statistical methods towards sophisticated approaches involving deep learning and LLMs.

Early attempts sought to automate parts of the manual process [2]. A significant line of research involved leveraging lexico-syntactic patterns, famously pioneered by Hearst [17], to extract hypernymy (‘is-a’) relationships directly from text corpora. These pattern-based methods, while intuitive, often exhibit low recall and necessitate careful domain-specific pattern design. Concurrently, statistical approaches gained traction, analyzing term co-occurrence frequencies within documents or large corpora [18]. Distributional similarity methods, based on the hypothesis that words appearing in similar contexts have similar meanings, were also employed. Terms were represented by vectors capturing their co-occurring words, and similarity measures (e.g., Cosine Similarity) were used to infer relationships. Keyword extraction techniques like Term Frequency–Inverse Document Frequency (TF–IDF) [19] were frequently used in conjunction with these methods to identify salient terms before attempting relationship extraction or clustering [20].

ML, especially unsupervised clustering, has been pivotal. Algorithms like Hierarchical Agglomerative Clustering (HAC) are naturally suited due to their ability to generate tree-like dendrograms, mirroring taxonomic structures, as illustrated conceptually in Figure 1 [10,21]. These methods typically group terms or documents based on semantic similarity derived from embeddings or other distributional features [4]. Probabilistic models, such as Bayesian Rose Trees proposed by Song et al. [21], offered a more principled framework for hierarchical clustering for keyword data. Other clustering algorithms like K-Means were also adapted, often requiring methods to determine the optimal number of clusters and interpret the resulting flat partitions hierarchically [22]. While clustering could identify thematic groups, the main challenges involved mapping these clusters to meaningful ontological concepts, establishing explicit hierarchical relationships (beyond simple containment), and labeling the nodes appropriately. Graph-based methods represent another important paradigm, modeling terms as nodes and relationships as edges [23]. Algorithms operating on these graphs, such as minimum spanning tree computations on similarity graphs or community detection algorithms, can reveal latent hierarchical organizations. Velardi et al. [23], for instance, demonstrated a graph-based algorithm for taxonomy induction. Graph methods can capture more complex relationships than simple clustering but depend heavily on the quality of the initial graph construction and can face scalability issues with very large vocabularies.

The deep learning era brought significant advancements, particularly through improved representations of word meaning. Static word embeddings like Word2Vec and GloVe [4] provided dense lower-dimensional vectors that captured richer semantics than sparse methods, allowing words with similar meanings to have closer vector representations (conceptualized in Figure 2). However, the advent of contextual embeddings from pre-trained Transformer models like Bidirectional Encoder Representations from Transformers (BERT) [25], RoBERTa, and their successors marked a paradigm shift [4]. These models, often based on the Transformer architecture shown in Figure 3, generate embeddings that vary based on the word’s context, effectively handling polysemy and capturing nuanced meaning [26]. This improved semantic understanding facilitated more accurate relation extraction and hierarchical classification, often achieved by fine-tuning these models for specific tasks [25,27]. For example, fine-tuned BERT models were successfully applied to hierarchical product classification [25]. While fine-tuning yields strong performance, it typically requires substantial labeled data for the targeted hierarchical structure.

Most recently, LLMs have shown remarkable potential for knowledge structure induction, including taxonomy construction, often operating in zero-shot or few-shot settings [3]. Several strategies are being explored to leverage LLMs effectively for this purpose. One major strategy involves prompt-based methods, which rely on the in-context learning capabilities of LLMs [11,28]. These methods offer considerable advantages in terms of rapid development, significantly lower computational barriers as no model retraining is needed, and high flexibility, allowing practitioners to adapt models to new tasks or nuances by simply refining the input prompt [29]. They are particularly potent for zero-shot or few-shot learning scenarios, where labeled training data is scarce [30,31], and for interactive applications like chatbots and dynamic content generation systems, e.g., summarization, creative writing, and preliminary data exploration, where quick adjustments are beneficial [32]. Carefully crafted prompts, sometimes augmented with examples (few-shot prompting), guide the LLM to generate taxonomic relationships or classify entities without altering the model’s weights. Advanced prompting techniques, such as chain-of-layer (CoL) prompting for taxonomy induction [33] or more general chain-of-thought reasoning [28], aim to improve coherence and the model’s ability to tackle complex tasks by breaking them down. However, prompt engineering is not without its challenges. Performance can be highly sensitive to the precise wording and structure of the prompt, often termed “prompt brittleness” [34]. Ensuring consistent output quality and factual accuracy can be difficult. Finally, tackling highly specialized or deeply nuanced tasks may require impractically complex prompts or may simply be beyond the reach of prompting alone without supplemental mechanisms [11,33,35].

Figure 2. Conceptual example of word embeddings, representing words as vectors in a multi-dimensional space where semantically similar words are positioned closer together [36].

Figure 3. The Transformer model architecture, highlighting the multi-head self-attention mechanisms and feed-forward networks that enable sophisticated contextual understanding in models like BERT and GPT [37]. The models utilized in this study are specific variations of this full architecture.

An alternative strategy is fine-tuning, adapting a pre-trained LLM’s parameters to better suit the specific task and domain [11,25,38]. This approach generally aims for achieving state-of-the-art or significantly higher accuracy and consistency within the target domain as the model’s weights are directly modified to learn task-specific patterns, nuances, and knowledge from the training data [39]. It is often the preferred method for applications demanding high precision and robustness, such as specialized text classification (e.g., sentiment analysis in niche markets and legal document review), sequence labeling tasks (e.g., domain-specific NER), or tailored generative tasks (e.g., medical report generation), provided that sufficient curated training data is available [40]. While fine-tuning can lead to superior performance on the specialized task, it traditionally requires a substantial corpus of high-quality labeled training data, which can be a significant bottleneck and expense [41], and considerable computational resources for the training process, making it less agile than prompting for rapid iteration or adaptation to frequently changing requirements [42]. Furthermore, care must be taken to avoid issues like “catastrophic forgetting” of general capabilities if the fine-tuning dataset is too narrow or the process is not managed well [43]. Parameter-Efficient Fine-Tuning (PEFT) techniques, notably Low-Rank Adaptation (LoRA) [44] and its variants like QLoRA [45], have made fine-tuning large models more feasible by significantly reducing the number of trainable parameters and, consequently, the computational burden and memory footprint, thereby democratizing access to fine-tuning capabilities [46,47]. Hybrid approaches are being investigated, combining LLMs with other techniques. This might involve using LLMs to generate candidate relationships, which are subsequently validated using symbolic methods or external knowledge bases, or integrating LLM-derived insights with graph-based or clustering methods [48]. Direct comparisons, such as the study by Chen et al. [11], highlight the fundamental trade-offs between the flexibility and lower cost of prompting versus the potentially higher accuracy and stability of fine-tuning. Despite rapid progress, challenges related to controlling the quality, consistency, factual grounding, and structural validity of LLM-generated hierarchies persist.

Evaluating automatically generated taxonomies also remains a critical challenge. While comparison against gold-standard taxonomies is common, methodologies have evolved from purely structural comparisons to incorporating semantic understanding [49]. Modern evaluation relies heavily on semantic similarity metrics applied to node labels or descriptions. Contextual embedding-based metrics like BERTScore [50] and semantic Cosine Similarity using models such as Sentence-BERT [51] provide more nuanced assessments of meaning alignment than older lexical methods [52]. Our work contributes by applying these advanced semantic metrics to directly compare state-of-the-art PEFT fine-tuning (Llama-3) and prompt-engineering (Phi-3.5) techniques on a substantial real-world e-commerce dataset.

3. Data Acquisition and Preprocessing

This section details the pipeline for acquiring and preparing the dataset used in this study, which forms the foundation for both the fine-tuning and prompt-engineering approaches. The process involved several distinct stages, from initial data gathering to refined keyword extraction.

3.1. Data Source and Cleaning

The foundation for both methodologies was a dataset derived from the eBay online e-commerce platform, chosen for its vast and diverse catalog representing a real-world scenario for product classification. The data underwent a rigorous multi-stage cleaning and preprocessing pipeline:

Data Acquisition: Product information was systematically acquired using eBay’s Marketplace RESTful API [53]. Endpoints for retrieving item summaries across various categories were queried to gather an initial set of 5000 product listings, ensuring sufficient size and diversity for analysis and model training.
Feature Selection: From the raw JSON data retrieved via the API, only the product title field was retained for its descriptive value. This field was designated as the ‘Product Description’. All other metadata (item IDs, pricing, seller details, condition, etc.) were discarded to focus the analysis on the textual content used for classification.
Deduplication and Null Handling: To ensure data quality and uniqueness, entries with identical ‘Product Description’ texts were removed. Additionally, records with missing or null descriptions were filtered out. This refinement process reduced the dataset to approximately 4000 unique product descriptions, forming the core corpus for the study.
Text Normalization: Standard text normalization procedures were applied to the ‘Product Description’ field. This included converting all text to lowercase, using regular expressions to remove special characters, punctuation, and extraneous symbols, and normalizing whitespace by collapsing multiple spaces into single spaces. This standardization minimizes vocabulary variations unrelated to semantic meaning.

3.2. Keyword Extraction and Refinement

To distill the core concepts from the descriptions for clustering and LLM processing, a keyword-based approach was implemented.

Keyword Extraction: The KeyBERT library [54] was employed to extract the most representative keywords. This technique utilizes the ‘all-MiniLM-L6-v2’ Sentence-BERT model [51] to generate contextual embeddings for each description. Based on these embeddings, KeyBERT identifies and extracts the short keywords or keyphrases that best capture the semantic essence of the product. For this study, the model was configured to extract the single most relevant 1–2 word phrase from each description. This method provides a semantically grounded representation superior to simple frequency-based techniques.
Keyword Cleaning: The keywords extracted by KeyBERT underwent a final semi-automatic cleaning step. This involved a review of the extracted terms to filter out overly generic words (e.g., “item”, “new”, and “sale”) or any remaining artifacts from the extraction process that lack discriminative power for categorization. This step, while involving some manual oversight, was crucial to ensure that the keywords forming the basis for subsequent clustering were both semantically meaningful and relevant for distinguishing between different product types. Figure 4 illustrates examples of original product descriptions and the corresponding cleaned keywords extracted by this process.

4. Methodology

This study implements and evaluates two distinct approaches for automating the construction of a three-level hierarchical taxonomy (Categories I, II, and III) from the processed eBay product data described in Section 3. Both approaches utilize the same foundational data involving keyword extraction and clustering but differ fundamentally in how they leverage Large Language Models to generate the final taxonomy. The first approach focuses on specializing an LLM through fine-tuning, while the second relies on guiding a pre-trained LLM using carefully engineered prompts. The following subsections detail the specific steps involved in each of these comparative approaches.

4.1. Approach 1: Chain-of-Layer Clustering and LLM Fine-Tuning

This methodology focuses on creating a high-quality structured hierarchical dataset through iterative clustering and semantic labeling, which is then used to fine-tune a powerful LLM for the final taxonomy generation task. The goal is to imbue the LLM with domain-specific hierarchical knowledge for consistent and accurate classification. The conceptual workflow is depicted in Figure 5.

Initial Clustering (Base Level): The process began with clustering the cleaned extracted keywords derived from product descriptions. HAC was chosen for this step due to its ability to build clusters bottom-up without pre-specifying the number of clusters (although a cut-off or target number is typically used) and its natural generation of a dendrogram structure suitable for taxonomic exploration. Using Cosine Similarity on keyword embeddings (derived implicitly via KeyBERT’s underlying SentenceTransformer) as the distance metric and average linkage for cluster merging, approximately 350 initial clusters were formed. This number was determined through experimentation, balancing granularity with cluster coherence, guided by silhouette score analysis. This step effectively grouped products with highly similar core keywords, forming the foundational Category I level and assigning each product description to one of these base clusters.

Hierarchical Layer Generation (Iterative Clustering): Subsequently, deeper hierarchical layers were generated iteratively. To generate Category II, the textual content representing each Category I cluster (either the primary keywords or initial LLM-generated names) was aggregated. These aggregated textual representations were then vectorized using TF–IDF, which effectively weights terms based on their importance within and across the Category I groupings. K-Means clustering, known for its efficiency on large datasets, was then applied to these TF–IDF vectors to partition the Category I clusters into approximately 90–120 broader Category II groups. This process was repeated to create Category III: the textual representations of the newly formed Category II groups were vectorized using TF–IDF, and K-Means clustering was applied again to consolidate these into approximately 40 top-level categories. This iterative application of TF–IDF and K-Means aimed to efficiently group increasingly abstract semantic concepts derived from the lower levels.

LLM-Based Category Naming: Assigning meaningful human-readable labels to the machine-generated clusters at each level (Categories I, II, and III) was crucial. For this, a capable instruction-following LLM, Microsoft Phi-3.5 Instruct [15], was employed. For each cluster at each hierarchical level, a prompt was constructed. This prompt included the most frequent or representative terms (either the original keywords for Category I or the constituent lower-level category names for Categories II and III) associated with that cluster. The LLM was specifically instructed to generate a concise, professional, and relevant category name, ideally consisting of 2–3 words, reflecting the semantic core of the provided terms. The prompts used for this category naming task can be found in Appendix A.

Name Refinement: Raw LLM outputs often include conversational filler, formatting inconsistencies, or extraneous text. Therefore, a dedicated name refinement step was implemented. Regular expressions and string processing techniques were systematically applied to the generated names to remove common artifacts, such as introductory phrases (“The category name is…”), trailing punctuation, markdown formatting, numerical list prefixes, and other non-essential text. The goal was to produce clean, consistent, and directly usable category labels for the final taxonomy structure.

Fine-Tuning Data Preparation: With the three-level hierarchy (Categories I, II, and III) names generated and refined for all initial product descriptions, a dataset suitable for instruction fine-tuning was constructed. Each data instance typically mapped an input (comprising the original ‘Product Description’ potentially augmented with a task instruction like “Categorize this product:”) to a structured output representing the full hierarchical path (e.g., a JSON-like string: “‘Category I’: ‘Specific Name’, ‘Category II’: ‘Broader Name’, ‘Category III’: ‘Top Level Name’”). This formatting creates clear examples for the LLM to learn the mapping from description to hierarchical classification.

LLM Fine-Tuning (QLoRA): Finally, the Meta-Llama-3 8B Instruct model [13,14] was adapted using the prepared dataset. This model was selected for the fine-tuning approach due to its status as a powerful open-weight foundational model with 8 billion parameters, offering a strong capacity for learning domain-specific nuances through fine-tuning. Its ‘Instruct’ variant provides an excellent starting point for specialization, and its scale is conducive to achieving significant performance gains with PEFT techniques like QLoRA within a research context. PEFT [46], specifically the QLoRA technique [45], was used for efficiency. This involved quantizing the base model to 4-bits using the ‘bitsandbytes’ library [55] and adding trainable LoRA adapters [44] to specific layers (self-attention query, key, value, and output projections, and MLP gate, up, and down projections). Key LoRA hyperparameters included rank (r = 32), alpha (16), and dropout (0.05). The model was trained using the prepared instruction dataset with the ‘trl’ library’s ‘SFTTrainer’ [56]. Training spanned 2 epochs with a maximum sequence length of 512 tokens, employing the paged AdamW 8-bit optimizer, a learning rate of 1 × 10⁻⁴, and gradient accumulation to manage memory usage [57]. The resulting fine-tuned adapter weights, representing the learned task-specific knowledge, were saved for later use during inference. Key parameters for this fine-tuning phase are summarized in Table 1.

4.2. Approach 2: Prompt Engineering for Context-Aware Taxonomy

This methodology leverages the inherent capabilities of a pre-trained LLM, guided solely by prompts, to construct the hierarchical taxonomy without undergoing any parameter updates via fine-tuning. It prioritizes computational efficiency and adaptability [32]. The computational efficiency of prompt engineering arises primarily because it uses LLMs in their pre-trained state, thereby circumventing the substantial computational costs associated with training or fine-tuning. The latter processes involve intensive computations for gradient calculations and weight updates across potentially large datasets, often necessitating specialized hardware (e.g., GPUs) and significant processing time. In contrast, prompt engineering relies solely on inference, which, while still requiring computation, is considerably less resource-intensive per task instance than a full fine-tuning cycle. Adaptability is a key advantage as behavior can be modified by altering the input prompts, allowing for much faster iteration and adjustment to new requirements compared to retraining a model. For this approach, the Microsoft Phi-3.5 Instruct model [15,16] was selected. The rationale for choosing Phi-3.5 Instruct was multifaceted: at the time of model selection, it represented a compelling balance of strong instruction-following capabilities, crucial for a prompt-based approach, and notable efficiency due to its relatively smaller architecture compared to larger foundational models. This efficiency is particularly advantageous for iterative prompting tasks across a dataset, as performed in this study for generating hierarchical levels. Its design emphasizes high performance on complex reasoning and instruction tasks with fewer parameters, making it an excellent candidate to showcase the potential of prompt engineering with modern optimized LLMs.

While other LLMs could also be suitable for prompt-based taxonomy generation, their inclusion was constrained by the study’s scope, which aimed to compare two distinct strategic approaches (fine-tuning a moderately sized model vs. prompting an efficient capable model) rather than exhaustively benchmarking all available LLMs. For instance, larger models, such as OpenAI’s GPT-4 or more extensive Llama variants, while possessing powerful instruction-following abilities, often come with higher computational overhead or API costs that might not align with the efficiency aspect explored in this prompt-engineering approach. Conversely, other similarly sized efficient instruction-tuned models were available; however, Phi-3.5 Instruct was chosen as a strong representative of this class, allowing for a focused comparison of the fine-tuning versus prompt-engineering paradigms. The conceptual workflow is illustrated in Figure 6.

Initial Keyword Processing and Clustering: This foundational step closely mirrored Approach 1. Keywords relevant to product descriptions were extracted using KeyBERT. These keywords were then grouped using unsupervised clustering techniques (e.g., Agglomerative Clustering or K-Means on TF–IDF vectors derived from the keyword sets associated with each product). This initial clustering aimed to identify semantically coherent groups of products based solely on their most descriptive terms, forming the basis upon which the LLM would build the hierarchy. A dataset linking the original product descriptions to these initial keyword clusters was prepared for subsequent processing.

Hierarchical Generation via Iterative Prompting: The core of this approach relied on guiding the Microsoft Phi-3.5 Instruct model [15,16] through a sequence of carefully engineered prompts to generate the three taxonomic levels iteratively. The prompts used for this step can be found in Appendix A. The process was designed to build the hierarchy from specific to general:

Level 1 (Category I) Generation: For each initial keyword cluster identified in the previous step, a specific prompt was constructed. This prompt provided the LLM with a set of representative keywords (e.g., the top 15 most frequent keywords) from that cluster as the primary context. The instruction within the prompt explicitly asked the LLM to generate a concise (ideally 2-word) professional-sounding category name suitable for the eBay e-commerce context. Emphasis was placed on ensuring the name was relevant to product classification and avoided ambiguity. A system message reinforcing the LLM’s role as a “product categorization expert” was potentially included to frame the task. To promote consistency and reduce randomness in the naming, a low generation temperature (e.g., 0.3) was used.
Level 2 (Category II) Generation: Once Category I names were generated for all initial clusters, the products were regrouped based on these assigned Category I labels. For each resulting group (containing multiple similar Category I names), a new prompt was formulated. This prompt presented the list of constituent Category I names as context and instructed the LLM to abstract a broader common theme, generating a suitable 2–3 word Category II name encompassing the characteristics of the input Category I names. This abstraction process is illustrated conceptually in Figure 7.
Level 3 (Category III) Generation: The iterative process was repeated one final time. Products were grouped according to their assigned Category II names. For each of these broader groups, a prompt containing the relevant Category II names was sent to the LLM, requesting the generation of the most general top-level Category III name appropriate for that collection of categories.

Post-Processing: A crucial component, applied after each level of LLM generation (I, II, and III), was rigorous post-processing of the raw outputs. LLMs often generate text that includes conversational filler, justifications, inconsistent formatting (like bullet points or numbering), or superfluous punctuation. Regular expressions and string manipulation functions were employed systematically to strip these artifacts, removing introductory phrases, ensuring consistent casing, eliminating unwanted characters, and extracting only the core category name. This cleaning ensured that the labels were standardized and suitable both for use in the final taxonomy and as clean input context for generating the next hierarchical level in the iterative prompting process. The final output was the structured three-level classification for each product description.

4.3. Evaluation Metrics

The taxonomies generated by both approaches were evaluated against a manually curated or existing eBay taxonomy structure serving as the ground truth. A separate test set of 1000 product descriptions was used for inference. The alignment was measured using two complementary semantic similarity metrics.

First, Cosine Similarity [51] was calculated by generating Sentence-BERT embeddings (

v_{p r e d}

and

v_{r e f}

) for the predicted and ground-truth category labels at each of the three levels and computing the cosine of the angle between them. It measures the orientation similarity independent of magnitude. Scores closer to 1 indicate higher semantic similarity. The formula is as follows:

Cosine Similarity (v_{p r e d}, v_{r e f}) = \frac{v_{p r e d} \cdot v_{r e f}}{∥ v_{p r e d} ∥ ∥ v_{r e f} ∥} = \frac{\sum_{i = 1}^{n} v_{p r e d, i} v_{r e f, i}}{\sqrt{\sum_{i = 1}^{n} v_{p r e d, i}^{2}} \sqrt{\sum_{i = 1}^{n} v_{r e f, i}^{2}}}

(1)

Average scores per level and overall were calculated.

Second, BERTScore [50] provided a finer-grained comparison by matching tokens (

x_{i}

in predicted sentence x,

y_{j}

in reference sentence y) based on contextual embeddings (e.g., from DeBERTa). It computes precision (

P_{B E R T}

), recall (

R_{B E R T}

), and F1-score (

F_{B E R T}

) based on maximal Cosine Similarity matching between tokens, potentially weighted by token importance (e.g., IDF weights

w (y_{j})

). The core formulas are [50] as follows:

\begin{matrix} R_{B E R T} & = \frac{1}{| y |} \sum_{y_{j} \in y} w (y_{j}) \max_{x_{i} \in x} v {(y_{j})}^{⊤} v (x_{i}) \end{matrix}

(2)

\begin{matrix} P_{B E R T} & = \frac{1}{| x |} \sum_{x_{i} \in x} w (x_{i}) \max_{y_{j} \in y} v {(x_{i})}^{⊤} v (y_{j}) \end{matrix}

(3)

\begin{matrix} F_{B E R T} & = 2 \frac{P_{B E R T} \cdot R_{B E R T}}{P_{B E R T} + R_{B E R T}} \end{matrix}

(4)

where

v (\cdot)

represents the contextual embedding of a token. The F1-score served as the primary measure of token-level semantic overlap and contextual relevance [52]. Scores per level and overall were calculated using BERTScore.

5. Results and Discussion

The performance of the two automated taxonomy construction methodologies, fine-tuning (LLaMA-3 8B) and prompt engineering (Phi-3.5 Instruct), was quantitatively evaluated using BERTScore and Cosine Similarity against the eBay ground-truth taxonomy. Inference was performed on a held-out test set of 1000 eBay product descriptions. Table 2 presents a high-level summary of the core findings, directly comparing the two approaches across key dimensions: overall accuracy, computational resource usage, and processing speed. This table highlights the central trade-offs observed in the study. The following subsections provide a more detailed breakdown and discussion of the results for each methodology individually.

5.1. Fine-Tuning Approach

The fine-tuned LLaMA-3 8B model demonstrated a strong capacity to generate taxonomies closely aligned with the reference structure, showcasing the benefits of specialized training. As presented in Table 3, the model achieved a robust overall average BERTScore F1 of 70.91%. This score indicates a high degree of semantic overlap at the token level between the generated category names and the ground-truth labels. Notably, the performance was relatively consistent across the hierarchy, achieving F1-scores of 68.14% for Category I, 69.60% for Category II, and peaking at 75.00% for Category III. This consistent performance, especially the strength at the broader Category III level, suggests that the fine-tuning process successfully enabled the model to internalize not only specific category semantics but also the abstract relationships forming the higher levels of the hierarchy. The overall average Cosine Similarity score of 66.40% further reinforces this finding, indicating good global semantic alignment of the generated labels.

To provide a clearer qualitative understanding of the fine-tuned model’s performance, Table 4 presents a selection of successful and unsuccessful category generations at different hierarchical levels, along with a brief analysis. This qualitative analysis reveals that the model frequently succeeds by generating either identical or synonymous category names. Many partial successes are, in fact, reasonable semantic alternatives, where the model might use more modern phrasing, e.g., “Health, Wellness, and Personal Care” vs. “Health and Beauty”, or describe a function over a form, e.g., “Health Monitoring and Tracking Devices” vs. “Wearable Health Devices”, demonstrating a deep contextual understanding. Failures typically occur when the model becomes overly specific, latching onto technical keywords from product descriptions, e.g., “Grid Suspension Systems” instead of “Suspended Ceiling Systems”, or when it generates overly broad and less functional categories.

Qualitatively, the output from the fine-tuned model, exemplified in Table 5, appeared to be well-structured, with consistent and logical category names. This stability and reliability make the fine-tuning approach particularly suitable for applications where a dependable and accurate taxonomy is paramount. However, this improved performance comes at a significant cost, which can be quantified in terms of computational resources and time. Our experiments were conducted on a cloud instance with an NVIDIA A100 GPU, using open-weight models without incurring direct API fees. The fine-tuning process itself, even utilizing efficient methods like QLoRA, required considerable resources (as detailed in Table 2) and careful preparation of the training dataset derived from the 4000 product descriptions. This data preparation phase, including keyword extraction, layered clustering, and initial LLM-based naming, represented a substantial upfront investment of approximately 1.5 h. The model training itself took 31 min and 17 s. In contrast, the prompt-engineering approach required no training time. Furthermore, adapting the fine-tuned model to new data or evolving category structures would likely necessitate retraining, adding to the maintenance overhead.

5.2. Prompt-Engineering Approach

The prompt-engineering approach, leveraging the zero-shot capabilities of Phi-3.5 Instruct, presented a contrasting profile characterized by speed and efficiency but lower overall fidelity and consistency. The evaluation scores are detailed in Table 6. A summary comparing the overall performance and computational trade-offs of both methodologies is presented in Table 2.

The overall average BERTScore F1 was 61.66%, and the overall average Cosine Similarity was 60.34%, both metrics falling noticeably short of the fine-tuned model’s performance. A striking aspect of the results was the performance disparity across the hierarchical levels. This approach achieved its best results at the most granular level (Category I), with a BERTScore F1 of 72.05% and a Cosine Similarity of 79.79%. This suggests that, when provided with direct specific context (i.e., the keywords from the initial clustering), the LLM could effectively generate relevant low-level category names via prompting. However, the performance significantly deteriorated at Category II (57.32% F1, 52.20% Cosine) and Category III (55.61% F1, 49.02% Cosine). These levels were generated iteratively, using the model’s own output from the previous level as the input for the next prompt. This iterative process appears to be prone to error propagation and semantic drift, making it difficult for the model to maintain coherence and accurately abstract broader categories purely through prompting. While the method successfully generated structurally complete taxonomies (example in Figure 8) with minimal computational resources and time, the lower quality and, critically, observed variability in category naming (even with deterministic temperature settings) pose significant challenges. This lack of consistency makes the prompt-engineering approach less suitable for applications requiring stable, reproducible, and easily maintainable taxonomies. A qualitative error analysis reveals that the performance degradation at higher levels is primarily due to semantic drift. While the Category I generation is grounded in specific product keywords, the generation of Category II relies on the model’s own potentially imperfect Category I outputs. This iterative process continues for Category III, progressively distancing the model from the original data context. This leads to errors such as (1) over-generalization, where the model focuses on common words like “gear” or “essentials” to create generic labels, e.g., Wilderness Essentials instead of Sporting Goods; (2) stylistic naming, where it produces marketing-oriented names, e.g., Luxury Collection instead of Jewelry and Watches; and (3) literal interpretation, leading to overly simple names, e.g., Kit Collection instead of Toys and Hobbies.

5.3. Task-Based Evaluation

To address the practical utility of the generated taxonomy beyond semantic similarity metrics, we conducted a task-based evaluation. We applied our best-performing model (the fine-tuned LLaMA-3) to the public Kaggle e-commerce text classification dataset [58]. This dataset contains thousands of product descriptions, each with a ground-truth label from one of four broad categories: “Electronics”, “Household”, “Books”, or “Clothing and Accessories”. Our goal was to assess whether the taxonomy generated by our model would structure these products in a way that is coherent and aligned with this human-defined classification scheme. The hierarchical structure generated by the model for this dataset is illustrated in Table 7, demonstrating the model’s ability to create a coherent multi-level organization from unstructured product descriptions.

5.3.1. Qualitative Analysis

Our model generated a three-level taxonomy for each product. The alignment between our generated top-level category (Category III) and the ground-truth label was highly consistent and demonstrated a deep semantic understanding. Table 8 presents illustrative examples.

The analysis reveals that the generated taxonomy is not only semantically sound but often provides more specific and useful categorizations than the broad ground-truth labels. The model’s ability to categorize based on content (e.g., the “Born to Run” example) highlights its potential for enhancing product discovery in a real-world e-commerce search application.

5.3.2. Quantitative Alignment

To quantify this alignment, we mapped our model’s generated Category III labels to the four ground-truth categories from the Kaggle dataset. The results, summarized in Table 9, show a very high degree of correlation, confirming the structural coherence of our generated taxonomy.

This task-based evaluation demonstrates that the taxonomy generated by our fine-tuned model is not just an abstract semantic structure but a practically useful tool that can organize products in a coherent, granular, and semantically rich manner, consistent with human-defined categories.

5.4. Discussion

The comparative analysis underscores a critical trade-off inherent in current LLM-based taxonomy automation strategies. Fine-tuning (Approach 1) excels in producing high-fidelity, consistent, and hierarchically coherent taxonomies. By explicitly training the model on structured examples from the target domain, it effectively internalizes the required semantic relationships and structural patterns, leading to superior overall performance, especially in capturing broader conceptual levels. To formally validate this, we performed a paired t-test on the per-instance scores from all three hierarchical levels combined. This analysis confirmed that the fine-tuning approach was statistically superior to the prompt-engineering approach across both overall BERTScore F1 (p < 0.001) and overall Cosine Similarity (p < 0.001). This provides strong evidence that the fine-tuning methodology yielded genuinely superior performance in this study, aligning with related work suggesting that fine-tuning enhances domain specialization [11,33]. However, this accuracy and reliability come at the cost of substantial computational resources, the need for potentially complex training data preparation (involving prior clustering and naming), and reduced flexibility for rapid adaptation to new domains or structural changes.

Prompt engineering (Approach 2) represents the opposite end of the spectrum, prioritizing speed, flexibility, and computational efficiency. Its strong performance at the initial most granular level demonstrates the power of LLMs for direct classification or naming tasks when provided with clear, specific context via prompts. However, the results indicate that relying solely on iterative prompting for multi-level abstraction introduces significant challenges in maintaining semantic consistency and avoiding error propagation. The observed drop in performance at higher levels suggests that the current prompting strategy struggled to guide the LLM effectively through the necessary conceptual aggregation. The inherent variability in LLM generation, even at low temperatures, further complicates the use of prompting for tasks requiring deterministic and stable outputs integrated into larger systems.

Situating these findings within the rapidly evolving literature provides further context. While direct numerical comparison to other studies is challenging due to disparate datasets and evaluation metrics, we can compare methodological trends. Our finding that fine-tuning yields more consistent and accurate hierarchical structures than simple iterative prompting aligns with the motivations behind more advanced prompting frameworks developed recently. For instance, the chain-of-layer method proposed by Zeng et al. [33] aims to improve taxonomy induction by explicitly managing the generation process layer by layer, an approach designed to overcome the very issues of semantic drift and error propagation we observed in our prompt-engineering approach. This indicates broader recognition of the challenges involved in maintaining hierarchical coherence with simple prompting. These advanced methods, while promising, often introduce additional layers of complexity. Therefore, our study’s comparison of a standard PEFT fine-tuning approach against an efficient iterative prompting strategy serves as a crucial baseline. It provides a pragmatic benchmark that quantifies the performance trade-offs of two accessible and widely used paradigms, complementing other research that focuses on pushing the performance ceiling of a single more complex methodology.

It is crucial to re-emphasize the foundational role of the initial keyword extraction and clustering steps for both methodologies. The quality of the concepts and groupings identified in these early phases fundamentally dictates the potential quality of the final taxonomy. Errors or ambiguities introduced here will inevitably hinder the LLM’s ability to generate an accurate hierarchy, whether through fine-tuning or prompting. A pertinent consideration is the scalability of these approaches to significantly larger and potentially noisier datasets than the approximately 4000 product titles used in this study. For the fine-tuning approach, while more data could enhance model performance and generalization, the upstream pipeline for creating structured fine-tuning examples (keyword extraction, layered clustering, and LLM-based naming) would need to scale efficiently. Noisier input data would pose a considerable challenge, potentially degrading the quality of this preparatory data and requiring more robust preprocessing or noise-tolerant fine-tuning strategies. The QLoRA technique helps to manage the resource demands of fine-tuning itself.

For the prompt-engineering approach, inference costs would scale with dataset size. While this is generally more manageable per instance than fine-tuning, processing millions of items would still be substantial. This approach might also be sensitive to noisier data as the quality of prompts (often derived from keywords or product details) directly impacts output quality, and errors could propagate through iterative hierarchical generation. Robust keyword extraction and prompts designed to handle ambiguity would become even more critical. Therefore, while this study demonstrates comparative efficacy on a moderately sized dataset, future work focusing on industrial-scale deployment would need to rigorously evaluate and optimize the entire pipeline, from data ingestion and cleaning to hierarchical inference, for both throughput and robustness to real-world data imperfections. The choice between fine-tuning and prompt engineering ultimately depends on a pragmatic assessment of project requirements: prioritizing accuracy, stability, and long-term maintainability favors fine-tuning, while prioritizing speed, adaptability, and resource conservation favors prompt engineering, potentially accepting lower fidelity or requiring more sophisticated prompt design and validation mechanisms.

Furthermore, the current study focused on assigning a single hierarchical path to each product. Real-world applications, particularly in e-commerce, often require products to be classified under multiple relevant categories. This is a limitation of the current work’s scope. Both the fine-tuning and prompt-engineering approaches could potentially be extended to support multi-category assignments. For instance, the fine-tuning approach could be trained on multi-label data if available, learning to predict multiple category paths. The prompt-engineering approach could involve explicitly instructing the LLM to list all the applicable categories or iteratively querying for additional relevant classifications. Implementing and evaluating such multi-label capabilities would require appropriately structured ground-truth data and multi-label evaluation metrics, representing an important direction for future development to enhance practical utility.

A notable limitation of this study is the absence of a direct comparison with traditional machine learning, e.g., SVMs, or classic deep learning classifiers, e.g., CNNs and RNNs. This omission is primarily due to a fundamental difference in task formulation. The LLM-based approaches evaluated here perform taxonomy construction—a generative task where category labels and structure are dynamically created from data. In contrast, traditional classifiers are discriminative: they excel at assigning items to a pre-defined and fixed set of categories and cannot inherently create the taxonomy itself. A direct comparison would necessitate reframing the problem as a pure classification task into a fixed set of labels, thereby not evaluating the core generative capabilities of the LLMs. Nonetheless, future work could explore complex hybrid benchmarks. For example, one could use the taxonomy generated by our fine-tuned LLM as the ground truth for training an SVM or a BERT-based classifier. Comparing the performance of this classifier against the LLM’s original assignments would provide valuable insights into the coherence and utility of the generated structure. Such a study, while beyond our current scope, would help to quantify the advantages of dynamic taxonomy construction versus classification into a static hierarchy.

6. Conclusions and Future Work

This paper presented a comparative study of two distinct LLM-based approaches for automating the construction of hierarchical taxonomies, using e-commerce product descriptions as a case study. We investigated a fine-tuning methodology leveraging Meta-Llama-3 8B adapted with QLoRA on data structured by layered clustering, and a prompt-engineering methodology using Microsoft Phi-3.5 Instruct guided by clustered keywords. Our quantitative evaluation, employing BERTScore and Cosine Similarity against a ground-truth eBay taxonomy, provides clear insights into the capabilities and limitations of each approach.

The primary finding is that fine-tuning the LLM yields a taxonomy with significantly higher overall semantic accuracy and consistency (70.91% BERTScore F1; 66.40% Cosine Similarity) compared to the prompt-engineering approach (61.66% BERTScore F1; 60.34% Cosine Similarity). The fine-tuned model demonstrated better coherence, particularly at broader category levels, indicating its ability to learn and effectively internalize domain-specific hierarchical structures. This stability makes fine-tuning more suitable for applications demanding high reliability and integration into persistent knowledge systems. However, this performance comes at the cost of substantial computational resources and time investment for training.

Conversely, prompt engineering offers a computationally lightweight and significantly faster alternative for taxonomy generation. It excelled at creating specific granular categories (Level I) where prompts could directly leverage focused keyword sets. Its performance, however, diminished at higher levels of abstraction, and the generated outputs exhibited variability, posing challenges for reproducibility and long-term scalability. Prompt engineering remains valuable for rapid prototyping, dynamic environments, or situations with constrained computational budgets. This research underscores the critical dependence of both LLM approaches on the quality of upstream data processing, specifically keyword extraction and clustering. The accuracy and relevance of the initial concepts fed to the LLM fundamentally constrain the quality of the final taxonomy.

Future work should pursue several avenues to advance LLM-based taxonomy automation. Enhanced preprocessing, utilizing more sophisticated NLP techniques for keyword/keyphrase extraction and exploring advanced clustering algorithms, is crucial for building a more robust foundation. A critical area for future work is reducing manual intervention. The current pipeline involves some manual or semi-automatic steps for keyword cleaning and category name refinement. To enhance overall scalability and fully automate the system, these steps need further development. For instance, future iterations could employ advanced filtering algorithms for keywords. Another approach is to train dedicated models specifically for cleaning tasks. Alternatively, LLM prompts could be further refined to ensure that they produce cleaner and more structured outputs directly. These improvements would significantly reduce the need for human oversight. Investigating methods for reliably generating deeper, more granular hierarchies, possibly incorporating structural constraints or external ontological knowledge, as well as robustly handling multi-category assignments for products, is another key direction. Exploring hybrid LLM strategies, such as few-shot fine-tuning or prompt-tuning, might offer a balance between stability and flexibility. Refining prompt-engineering techniques, perhaps using chain-of-thought or self-correction mechanisms, could improve consistency. Developing more comprehensive evaluation frameworks, including structural and task-specific metrics, is necessary. Finally, addressing the challenge of dynamically updating taxonomies in response to evolving data without complete regeneration, and thoroughly evaluating the scalability and robustness of these methods on web-scale noisy datasets, are crucial for real-world applicability. Addressing these areas will contribute to building more powerful, scalable, and reliable automated taxonomy construction systems.

Author Contributions

Conceptualization, R.G.N. and B.V.; methodology, R.G.N.; investigation, R.G.N.; resources, S.M.; data curation, R.G.N.; writing—original draft, B.V.; writing—review and editing, B.K.N. and M.H.; validation, M.H.; supervision, S.M.; formal analysis, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available via eBay’s Marketplace RESTful API at https://developer.ebay.com/api-docs/static/ebay-rest-landing.html (accessed on 20 December 2024), reference number [53].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
BERT	Bidirectional Encoder Representations from Transformers
CoL	Chain-of-Layer
GPU	Graphics Processing Unit
GPT	Generative Pre-Trained Transformer
HAC	Hierarchical Agglomerative Clustering
IDF	Inverse Document Frequency
JSON	JavaScript Object Notation
LLM	Large Language Model
LoRA	Low-Rank Adaptation
ML	Machine Learning
NLP	Natural Language Processing
PEFT	Parameter-Efficient Fine-Tuning
QLoRA	Quantized Low-Rank Adaptation
TF–IDF	Term Frequency–Inverse Document Frequency

Appendix A. Prompts

This appendix provides the prompts used for LLM-based category naming during the fine-tuning data preparation (Approach 1) and for the iterative hierarchical category generation (Approach 2).

Appendix A.1. Prompt for LLM-Based Category Naming

System: You are a helpful assistant specializing in e-commerce product categorization.

User: I have a cluster of products represented by the following keywords: keywords.

Based on these keywords, please generate a concise, professional, and relevant category name for an e-commerce platform like eBay. The category name should ideally consist of 2–3 words and accurately reflect the core theme of these keywords. Output only the category name.

Appendix A.2. Prompt for Level 1 (Category I) Generation

System: You are an expert in e-commerce product classification for eBay.

User: Consider the following set of representative keywords extracted from a group of similar product descriptions: keywords.

Your task is to generate a concise (ideally 2 words, maximum 3 words) and professional-sounding Category I name that best describes products characterized by these keywords. This name will be used as the most specific level in a 3-level product taxonomy. Ensure the name is relevant for product classification and avoids ambiguity. Output only the Category I name.

Appendix A.3. Prompt for Level 2 (Category II) Generation

System: You are an expert in e-commerce product classification for eBay, focused on creating hierarchical taxonomies.

User: I have a group of products that fall under the following Category I names: names.

Your task is to abstract a broader, common theme from these Category I names and generate a suitable 2–3 word Category II name. This Category II name should encompass the characteristics of the input Category I names and represent the next higher level in our product taxonomy. Output only the Category II name.

Appendix A.4. Prompt for Level 3 (Category III) Generation

System: You are an expert in e-commerce product classification for eBay, focused on creating hierarchical taxonomies.

User: I have a collection of product categories represented by the following Category II names: names.

Your task is to generate the most general, top-level Category III name that is appropriate for this collection of Category II names. This Category III name should be concise (2–3 words) and represent the highest level in our 3-level product taxonomy. Output only the Category III name.

References

Tahseen, Q. Taxonomy-The Crucial yet Misunderstood and Disregarded Tool for Studying Biodiversity. J. Biodivers. Bioprospecting Dev. 2014, 1, 3. [Google Scholar] [CrossRef]
Ross, N.J. “What’s That Called?” Folk Taxonomy and Connecting Students to the Human-Nature Interface. In Innovative Strategies for Teaching in the Plant Sciences; Quave, C.L., Ed.; Springer: New York, NY, USA, 2014; pp. 121–134. [Google Scholar] [CrossRef]
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Barnes, N.; Mian, A.S. A Comprehensive Overview of Large Language Models. arXiv 2023, arXiv:2307.06435. [Google Scholar] [CrossRef]
Mars, M. From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough. Appl. Sci. 2022, 12, 8805. [Google Scholar] [CrossRef]
Vu, B.; deVelasco, M.; Mc Kevitt, P.; Bond, R.; Turkington, R.; Booth, F.; Mulvenna, M.; Fuchs, M.; Hemmje, M. A Content and Knowledge Management System Supporting Emotion Detection from Speech. In Conversational Dialogue Systems for the Next Decade; Springer: Singapore, 2021; pp. 369–378. [Google Scholar]
Wang, X. The application of NLP in information retrieval. Appl. Comput. Eng. 2024, 42, 290–297. [Google Scholar] [CrossRef]
Vu, B.; Mertens, J.; Gaisbachgrabner, K.; Fuchs, M.; Hemmje, M. Supporting taxonomy management and evolution in a web-based knowledge management system. In Proceedings of the 32nd International BCS Human Computer Interaction Conference, Belfast, UK, 4–6 July 2018. BCS Learning & Development. [Google Scholar]
Sujatha, R.; Rao, B.R.K. Taxonomy construction techniques-issues and challenges. Int. J. Comput. Sci. Inf. Technol. 2016, 7, 706–711. [Google Scholar]
Le, T.T.; Cao, T.; Xuan, X.; Pham, T.D.; Luu, T. An Automatic Method for Building a Taxonomy of Areas of Expertise. In Proceedings of the 15th International Conference on Agents and Artificial Intelligence, ICAART 2023, Lisbon, Portugal, 22–24 February 2023; Rocha, A.P., Steels, L., van den Herik, H.J., Eds.; SCITEPRESS: Setúbal, Portugal, 2023; Volume 3, pp. 169–176. [Google Scholar] [CrossRef]
Punera, K.; Rajan, S.; Ghosh, J. Automatic Construction of N-ary Tree Based Taxonomies. In Proceedings of the Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), Hong Kong, China, 18–22 December 2006; pp. 75–79. [Google Scholar]
Chen, B.; Yi, F.; Varró, D. Prompting or Fine-Tuning? A Comparative Study of Large Language Models for Taxonomy Construction. In Proceedings of the 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), Västerås, Sweden, 1–6 October 2023; pp. 588–596. [Google Scholar]
Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. From data mining to knowledge discovery in databases. AI Mag. 1996, 17, 37–54. [Google Scholar]
Meta. Meta Llama 3 8B Instruct. 2024. Available online: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (accessed on 5 January 2025).
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Microsoft. Phi-3.5 Mini Instruct. 2024. Available online: https://huggingface.co/microsoft/Phi-3.5-mini-instruct (accessed on 5 January 2025).
Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.H.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
Hearst, M.A. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics, Nantes, France, 23–28 August 1992; Volume 2, pp. 539–545. [Google Scholar]
Murthy, K.; Faruquie, T.A.; Subramaniam, L.V.; Prasad, K.H.; Mohania, M. Automatically generating term-frequency-induced taxonomies. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; ACL 2010 Conference Short Papers. pp. 126–131. [Google Scholar]
Qaiser, S.; Ali, R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. Int. J. Comput. Appl. 2018, 181, 25–29. [Google Scholar] [CrossRef]
Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic Keyword Extraction from Individual Documents. In Text Mining: Applications and Theory; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2010; pp. 1–20. [Google Scholar]
Song, Y.; Liu, S.; Liu, X.; Wang, H. Automatic Taxonomy Construction from Keywords via Scalable Bayesian Rose Trees. IEEE Trans. Knowl. Data Eng. 2015, 27, 1861–1874. [Google Scholar] [CrossRef]
Yin, H.; Aryani, A.; Petrie, S.; Nambissan, A.; Astudillo, A.; Cao, S. A Rapid Review of Clustering Algorithms. arXiv 2024, arXiv:2401.07389. [Google Scholar] [CrossRef]
Velardi, P.; Faralli, S.; Navigli, R. OntoLearn Reloaded: A graph-based algorithm for taxonomy induction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 623–633. [Google Scholar]
Devdai1y. A Rapid Review of Clustering Algorithms. 2025. Available online: https://velog.io/@devdai1y/A-Rapid-Review-of-Clustering-Algorithms (accessed on 24 September 2025).
Zahera, H.M.; Sherif, M.A. ProBERT: Product Data Classification with Fine-Tuning BERT Model. In Proceedings of the Mining the Web of HTML-Embedded Product Data Workshop, Athens, Greece, 2–6 November 2020. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
Brinkmann, A.; Bizer, C. Improving Hierarchical Product Classification Using Domain-Specific Language Modelling. IEEE Data Eng. Bull. 2021, 44, 14–25. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: New York, NY, USA, 2022; pp. 24824–24837. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. A Survey on Prompt Engineering for Large Language Models: Progress, Methods, and Challenges. arXiv 2023, arXiv:2302.11382. [Google Scholar]
Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv 2022, arXiv:2206.04615. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; Curran Associates, Inc.: New York, NY, USA, 2020; pp. 1877–1901. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
Zeng, Q.; Bai, Y.; Tan, Z.; Feng, S.; Liang, Z.; Zhang, Z.; Jiang, M. Chain-of-Layer: Iteratively Prompting Large Language Models for Taxonomy Induction from Limited Examples. arXiv 2024, arXiv:2402.07386. [Google Scholar]
Zhao, T.Z.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate Before Use: Improving Few-Shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, 18–24 July 2021; Proceedings of Machine Learning Research; Meila, M., Zhang, T., Eds.; PMLR: New York, NY, USA, 2021; Volume 139, pp. 12697–12706. [Google Scholar]
Wiegreffe, S.; Hessel, J.; Swayamdipta, S.; Riedl, M.; Choi, Y. Reframing Human-AI Collaboration for Generating Free-Text Explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA + Online, 10–15 July 2022. [Google Scholar]
Touretzky, D.S. Word Embedding Demo: Tutorial. 2022. Available online: https://www.cs.cmu.edu/~dst/WordEmbeddingDemo/tutorial.html (accessed on 24 September 2025).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 328–339. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 8342–8360. [Google Scholar]
Le Scao, T.; Rush, A.M. How Many Data Points is a Prompt Worth? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual, 6–11 June 2021; pp. 2607–2614. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
McCloskey, M.; Cohen, N.J. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In The Psychology of Learning and Motivation; Psychology of Learning and Motivation; Bower, G.H., Ed.; Academic Press: Cambridge, MA, USA, 1989; Volume 24, pp. 109–165. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Beygelzimer, A., Hsu, D., Locatello, F., Schölkopf, B., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; pp. 36175–36204. [Google Scholar]
Xu, L.; Xie, H.; Qin, S.Z.; Tao, X.; Wang, F.L. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv 2023, arXiv:2312.12148. [Google Scholar]
Lialin, V.; Deshpande, V.; Rumshisky, A. Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. J. Mach. Learn. Res. 2023, 24, 1–51. [Google Scholar]
Zhang, C.; Tao, F.; Chen, X.; Shen, J.; Jiang, M.; Sadler, B.; Vanni, M.; Han, J. Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2701–2709. [Google Scholar]
Bordea, G.; Lefever, E.; Buitelaar, P. SemEval-2016 task 13: Taxonomy extraction evaluation (texeval-2). In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 1081–1091. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
eBay Inc. eBay RESTful APIs. 2024. Available online: https://developer.ebay.com/api-docs/static/ebay-rest-landing.html (accessed on 20 December 2024).
Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT. 2020. Available online: https://doi.org/10.5281/zenodo.4461265 (accessed on 10 May 2025).
Hugging Face. Bitsandbytes Documentation. 2024. Available online: https://huggingface.co/docs/bitsandbytes/ (accessed on 25 February 2025).
Hu, S.; Shen, L.; Zhang, Y.; Chen, Y.; Tao, D. On Transforming Reinforcement Learning with Transformers: The Development Trajectory. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8580–8599. [Google Scholar] [CrossRef] [PubMed]
Jeong, C. Domain-specialized LLM: Financial fine-tuning and utilization method using Mistral 7B. J. Intell. Inf. Syst. 2024, 30, 93–120. [Google Scholar] [CrossRef]
Shahane, S. E-Commerce Text Classification. 2022. Available online: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification (accessed on 5 June 2025).

Figure 1. Schematic diagram illustrating Hierarchical Agglomerative Clustering, showing the dendrogram (left) and corresponding data point grouping (right). Such methods are inherently suited for taxonomy creation [24].

Figure 4. Sample data illustrating the outcome of the keyword extraction process, showing original product descriptions alongside the concise cleaned keywords (‘root node’) extracted using KeyBERT.

Figure 5. Conceptual workflow for the fine-tuning approach, involving data preparation, layered clustering, LLM naming, fine-tuning data creation, and model training.

Figure 6. Conceptual workflow for the prompt-engineering approach, involving data preparation, initial clustering, iterative LLM prompting for hierarchy generation, and post-processing.

Figure 7. Conceptual diagram illustrating the generation of a Category II name. A list of similar more specific Category I names is provided as context to the LLM, which is prompted to abstract a common theme and generate a single broader category label.

Figure 8. Example snippet of the taxonomy table generated by the prompt-engineering approach, showing category levels.

Table 1. Key parameters specific to the fine-tuning phase.

Parameter Category	Parameter	Value
Base Model	LLM	Meta-Llama-3 8B Instruct
	Technique	QLoRA (PEFT)
	Quantization	4-bit (NF4 via bitsandbytes)
	LoRA Rank (r)	32
Fine-Tuning	LoRA Alpha	16
	LoRA Dropout	0.05
	LoRA Target Modules	q_proj, k_proj, v_proj, o_proj,
		gate_proj, up_proj, down_proj
	Library	TRL (SFTTrainer)
	Optimizer	Paged AdamW 8-bit
	Learning Rate	1 × 10⁻⁴
Training	Epochs	2
	Max Sequence Length	512 tokens
	Batch Size (Train/Eval)	2/4
	Gradient Accumulation Steps	4

Table 2. Comparative summary of performance, resource usage, and speed. Experiments were conducted on Google Colab Pro with an NVIDIA A100 GPU.

Category	Metric	Fine-Tuning	Prompt Engineering
Accuracy	Overall BERTScore F1 (%)	70.91	61.66
Accuracy	Overall Cosine Sim. (%)	66.40	60.34
Resource Usage	Peak System RAM	9.2 GB	5.7 GB
Resource Usage	Peak GPU RAM	13.5 GB	7.7 GB
Time and Speed	Model Training Time	31 min 17 s	Not Applicable
Time and Speed	Avg. Inference Speed (Token/s)	47.37	43.48

Table 3. Evaluation scores for fine-tuned LLaMA-3 8B model.

Metric	Category I	Category II	Category III	Overall Average
BERTScore Precision (%)	66.25	71.72	76.10	71.36
BERTScore Recall (%)	70.27	67.68	74.29	70.74
BERTScore F1 (%)	68.14	69.60	75.00	70.91
Cosine Sim. (%)	65.93	65.37	67.90	66.40

Table 4. Qualitative examples of fine-tuned model (LLaMA-3) outputs vs. ground truth.

Result Type and Level	Category Comparison	Analysis
Successes: Strong Semantic and Lexical Alignment
Success (Level III)	Gen: Clothing, Accessories, and Fashion
	Truth: Clothing, Shoes, and Accessories	Excellent match. The model correctly identified the top-level concept.
Success (Level II)	Gen: Sports and Outdoors
	Truth: Sporting Goods	Strong semantic and lexical alignment for this mid-level category.
Success (Level I)	Gen: Antique Maps
	Truth: Antique Maps	Perfect lexical match, indicating precise learning from specific data.
Partial Successes: Semantically Correct, Lexically Different
Partial (Level III)	Gen: Health, Wellness, and Personal Care
	Truth: Health and Beauty	Semantically correct. Generated a valid, arguably more modern, term.
Partial (Level II)	Gen: Material Equipment and Building Supplies
	Truth: Business and Industrial	Semantically related, but scope is different. The model created a more specific sub-category within the broader “Business and Industrial” domain.
Partial (Level I)	Gen: Health Monitoring and Tracking Devices
	Truth: Wearable Health Devices	Correct concept but focuses on function (“tracking”) versus form (“wearable”).
Mismatches and Failures
Mismatch (Level III)	Gen: Energy Solutions and Eco-Friendly Products
	Truth: Eco-Home	Generated category is too broad and conflates two distinct concepts. The ground truth is more focused.
Mismatch (Level II)	Gen: Decorative Hardware
	Truth: Automotive	A clear mismatch. The model likely miscategorized a sub-group of products, leading to an incorrect mid-level category.
Mismatch (Level I)	Gen: AI-Powered Code Assistants
	Truth: Developer Tools	Too specific. The model focused on a niche product type instead of the general category.

Table 5. Example snippet of the taxonomy visualization generated by the fine-tuned LLaMA-3 model, illustrating hierarchical structure. Category III represents the highest-level grouping. Category II is an intermediate sub-category. Category I is the most granular classification used as a variable.

Category III	Category II	Category I
Health, Wellness, and Personal Care	Health and Personal Care	Airbrushing and Compressors
		Pain Relief Devices
		First Aid and Training Tools
		Skin Care Products
		Personal Care
		First Aid and Safety Equipment
		Hair and Grooming
		Health and Wellness
		Airbrush Tanning and Accessories
	Fitness and Wellness	Fitness Trackers
	Fitness and Wellness	Weight Management and Nutrition
	Safety Equipment	Emergency Safety Kits
		Firefighting and Ignition Equipment
		Health Monitoring and Diagnostics
	Medical Equipment	Detectors and Sensors
		Health Monitoring and Tracking Devices
		Therapeutic Devices
		Pain Relief and Therapy Devices
		Health Monitoring Sensors
		Diagnostic Medical Kits
Technology Gadgets and Consumer Electronics	Consumer Electronics	Cameras and Accessories
		Computers and Accessories
		Mobile Phones
		Safety Tech Accessories
		Entertainment Electronics
		Tech Gadgets and Accessories
		Latest Tech Gadgets
	Entertainment	Virtual Reality Accessories
	Technology and Software	Software Development and Tech Tools
		Information Technology and Software
		Media Editing and Production
	Luxury Items	High-End Smartwatches

Table 6. Evaluation scores for prompt-engineering (Phi-3.5 Instruct) model.

Metric	Category I	Category II	Category III	Overall Average
BERTScore Precision (%)	72.35	57.97	55.20	61.84
BERTScore Recall (%)	72.23	57.25	56.25	61.91
BERTScore F1 (%)	72.05	57.32	55.61	61.66
Cosine Sim. (%)	79.79	52.20	49.02	60.34

Table 7. Taxonomy structure generated for the Kaggle e-commerce dataset.

Category III	Category II	Category I
Clothing, Accessories, and Fashion	Apparel	Yoga and Fitness Wear
	Clothing and Accessories	Rain and Waterproof Gear Base Layers and Thermals Women’s Clothing
	Beauty and Makeup	Makeup Products
	Luxury and Designer Items	Designer Sunglasses
Cultural, Educational, and Artistic Collections	Books and Literature	Literary Works Inspirational Leaders Antique Books
	Fine Art	Art Prints and Frames
	Instruments	Drums and Percussion
Health, Wellness, and Personal Care	Health and Personal Care	Hair and Grooming
Health, Wellness, and Personal Care	Safety Equipment	Emergency Safety Tools Cleaning Supplies
Home Essentials, Furniture, and Decor	Home Improvement	Measurement Tools Fans and Ventilation
Home Essentials, Furniture, and Decor	Kitchenware	Kitchen Tools and Gadgets Tea Kettles and Makers
Industrial Equipment and Building Supplies	Construction Supplies	Roofing and Waterproofing Supplies
Industrial Equipment and Building Supplies	Industrial Supplies	Process Control Equipment
Miscellaneous	Food and Beverages	Packaged Food and Beverages Ethnic and Regional Foods
	Seasonal Accessories	Holiday Decorations
	Home Essentials	Home and Kitchen
	Miscellaneous	Miscellaneous Labels and Labeling
	Equestrian Apparel and Accessories	Saddlery
	Furniture and Decor	Stools
Sports, Outdoor Activities, and Leisure	Sports and Outdoors	Outdoor Adventure Packs Running and Athletic Gear
Technology Gadgets and Consumer Electronics	Consumer Electronics	Latest Electronics Gadgets Gaming Accessories Mobile Phones
Transportation, Travel Gear, and Accessories	HVAC and Refrigeration Equipment	Air and Refrigeration Systems and Accessories

Table 8. Qualitative examples from task-based evaluation.

Product Description (Summary)	Ground Truth	Generated Category III	Analysis
AREO Yoga Pant	Clothing and Accessories	Clothing, Accessories, and Fashion	Direct Match: The model correctly identifies the product’s primary domain.
Born to Run: A Hidden Tribe, Superathletes…	Books	Sports, Outdoor Activities, and Leisure	Semantic Match: Categorizes by the book’s content (running), not its media type (book), showing deep contextual understanding.
Durastrip SBS Bitumen Self Adhesive Bitumen Flashing Tape…	Household	Industrial Equipment and Building Supplies	Granular and Accurate: Provides a much more specific and useful category than the broad “Household” label.
TopMate C5 12-15.6 inch Gaming Laptop Cooler…	Electronics	Technology Gadgets and Consumer Electronics	Direct Match: Correctly identifies the electronics domain with a more descriptive label.

Table 9. Quantitative alignment of generated vs. ground-truth categories.

Ground-Truth Category (Kaggle)	Primary Generated Category III (Our Model)
Clothing and Accessories	Clothing, Accessories, and Fashion
Electronics	Technology Gadgets and Consumer Electronics
Household	Home Essentials, Furniture, and Decor
Books	Cultural, Educational, and Artistic Collections *

* Or other thematic categories based on book content (e.g., sports).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vu, B.; Naik, R.G.; Nguyen, B.K.; Mehraeen, S.; Hemmje, M. Automated Taxonomy Construction Using Large Language Models: A Comparative Study of Fine-Tuning and Prompt Engineering. Eng 2025, 6, 283. https://doi.org/10.3390/eng6110283

AMA Style

Vu B, Naik RG, Nguyen BK, Mehraeen S, Hemmje M. Automated Taxonomy Construction Using Large Language Models: A Comparative Study of Fine-Tuning and Prompt Engineering. Eng. 2025; 6(11):283. https://doi.org/10.3390/eng6110283

Chicago/Turabian Style

Vu, Binh, Rashmi Govindraju Naik, Bao Khanh Nguyen, Sina Mehraeen, and Matthias Hemmje. 2025. "Automated Taxonomy Construction Using Large Language Models: A Comparative Study of Fine-Tuning and Prompt Engineering" Eng 6, no. 11: 283. https://doi.org/10.3390/eng6110283

APA Style

Vu, B., Naik, R. G., Nguyen, B. K., Mehraeen, S., & Hemmje, M. (2025). Automated Taxonomy Construction Using Large Language Models: A Comparative Study of Fine-Tuning and Prompt Engineering. Eng, 6(11), 283. https://doi.org/10.3390/eng6110283

Article Menu

Automated Taxonomy Construction Using Large Language Models: A Comparative Study of Fine-Tuning and Prompt Engineering

Abstract

1. Introduction

1.1. Motivation

1.2. Problem Statement

1.3. Research Questions

1.4. Approaches

2. Related Work

3. Data Acquisition and Preprocessing

3.1. Data Source and Cleaning

3.2. Keyword Extraction and Refinement

4. Methodology

4.1. Approach 1: Chain-of-Layer Clustering and LLM Fine-Tuning

4.2. Approach 2: Prompt Engineering for Context-Aware Taxonomy

4.3. Evaluation Metrics

5. Results and Discussion

5.1. Fine-Tuning Approach

5.2. Prompt-Engineering Approach

5.3. Task-Based Evaluation

5.3.1. Qualitative Analysis

5.3.2. Quantitative Alignment

5.4. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Prompts

Appendix A.1. Prompt for LLM-Based Category Naming

Appendix A.2. Prompt for Level 1 (Category I) Generation

Appendix A.3. Prompt for Level 2 (Category II) Generation

Appendix A.4. Prompt for Level 3 (Category III) Generation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI