Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach

Medina-Merodio, José-Amelio; Ferrer-Oliva, Mikel; Ruiz-Zambrano, Alejandro; Fernández-López, José; De-Marcos, Luis

doi:10.3390/fi18050228

Open AccessArticle

Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach

by

José-Amelio Medina-Merodio

^*

,

Mikel Ferrer-Oliva

,

Alejandro Ruiz-Zambrano

,

José Fernández-López

and

Luis De-Marcos

Departamento de Ciencias de la Computación, Universidad de Alcalá, 28805 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(5), 228; https://doi.org/10.3390/fi18050228

Submission received: 10 March 2026 / Revised: 14 April 2026 / Accepted: 20 April 2026 / Published: 23 April 2026

(This article belongs to the Special Issue Artificial Intelligence: Innovation, Applications and Transformative Experiences—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This study proposes a methodological framework for extending Malware Information Sharing Platform (MISP) taxonomies in the domain of Dark Web drug forums through the integration of large language models (LLMs) and Human-in-the-Loop (HITL) validation. The research addresses the existing ontological gap between traditional MISP taxonomies, focused on technical or chemical indicators, and the linguistic and morphological complexity of illicit digital markets. By modelling the primary physical form as an ontological predicate with mutually exclusive values (for example, powder, pill–tablet–capsule, liquid, and plant-matter), the proposed approach captures the material dimension of the discourse, enhancing semantic disambiguation and forensic traceability. The Mistral 7B model was used in the morphology-classification stage conducted on a stratified analytical subset of 2904 drug-related Dark Web posts, extracted from a final corpus of 6456 posts after data cleaning and relevance filtering. In the first pass, 76.48% of posts were directly assigned to one of the base morphological categories, while 23.52% were labelled as unclear and subsequently reviewed through the HITL stage. Following HITL refinement and full reclassification, the proportion of posts labelled as unclear decreased from 23.52% to 11.29%, corresponding to a 51.99% relative reduction in ambiguity. Network visualisation with VOSviewer revealed three major discursive axes—recreational–commercial, pharmaceutical–opioid, and transnational–logistical—reflecting the hybrid semantic structure of digital drug markets. The results show that combining LLM-based inference with expert oversight improves the interpretability, reproducibility and ontological robustness of cyberintelligence models, offering a replicable framework for other sensitive domains such as terrorism or child exploitation.

Keywords:

Dark Web; MISP taxonomy; drug markets; LLM; HITL; classification; NLP

1. Introduction

The Dark Web has consolidated itself as a resilient digital ecosystem where anonymity, decentralisation and advanced cryptography converge, creating a conducive environment for the illicit trade of psychoactive substances and falsified pharmaceutical products [1,2]. Its architecture, based on the Tor protocol and onion routing, hinders the traceability of information flows and the identification of criminal actors, which makes it a strategic area of interest for cyber intelligence and law enforcement agencies [3,4].

Dark Web cryptomarkets and forums have driven a parallel transnational economy in which the sale of narcotics, such as fentanyl, grows at weekly rates of around 15 percent despite ongoing police disruption efforts [5,6]. In addition, the accessibility of these platforms has intensified public health risks, being associated with high overdose rates and the proliferation of falsified opioids [7,8]. In this context, the automated analysis of traffic, interactions and textual content on the Dark Web has become a priority for artificial intelligence research applied to cybersecurity, aimed at detecting behavioural patterns and anticipating emerging criminal trends [9,10].

Despite advances in the use of collaborative threat analysis platforms such as MISP (Malware Information Sharing Platform), the taxonomies currently available present significant structural limitations for the study of drug trafficking on the Dark Web. Most existing taxonomies focus on chemical or pharmacological aspects (for example, type of substance or molecular family) or on generic classifications of cyber incidents, without capturing the contextual diversity and linguistic characteristics that are specific to illicit digital markets [11,12].

This constraint prevents an adequate representation of the semantic complexity of posts, which typically combine chemical descriptions, commercial slang, consumption instructions and references to packaging. As a result, traditional taxonomies are insufficient for automated classification tasks, risk detection and behavioural analysis in dark environments. It is therefore necessary to extend and adapt MISP taxonomies towards a model capable of incorporating semantic, contextual and physical dimensions of the advertised product.

One of the most relevant challenges for the automated detection of drug trafficking on the Dark Web is the correct classification of substances according to their primary physical form, for example “powder”, “pill”, “oil” or “solid extract”, since this morphological dimension is a key indicator for health risk assessment, forensic traceability and inference of the mode of distribution [13,14]. From an ontological perspective, primary physical form is modelled as a MISP predicate with ontological grounding and mutually exclusive values, intended to describe the physical form in which the substance is found (for example, crystals, resin, tablet, liquid). It is important to clarify that this predicate should not be confused with the route of administration, substance names or quantities, which are non-morphological dimensions that, although analytically valuable, do not fulfil the structural function of disambiguating the physical representation of the object in discourse. In this sense, the use of the proposed predicate contributes to the disambiguation of existing slang and enables a more precise and coherent representation of substances within cyber intelligence ontological models. Taken together, this discussion seeks to link empirical results with contemporary theoretical frameworks in cyber intelligence and computational semantics, highlighting their relevance for the construction of more robust and replicable hybrid models.

Posts in cryptomarkets often employ ambiguous or coded terminology, which hampers automatic identification using traditional text processing techniques. Precise classification by physical form makes it possible not only to differentiate between consumable forms (such as edibles or capsules) and technical forms (such as solvents or resins), but also to detect patterns of adulteration or falsification associated with high-risk products such as synthetic opioids [7,15]. This taxonomic perspective adds analytical value by integrating a material and contextual dimension into the study of criminal discourse.

Taxonomies provide a shared language that organises operational knowledge and supports coordination between teams, as well as underpinning risk analysis and incident response in complex environments [16,17,18]. As the classification basis, we adopt MISP, which strengthens collaborative indicator sharing and can be reinforced with prioritisation proposals such as CARIOCA (Cybersecurity Actionable Risk-Informed Operational Capability Assessment) to improve traceability and effectiveness [19,20]. Interoperability is ensured through STIX 2.1 (Structured Threat Information Expression) and TAXII 2.1 (Trusted Automated eXchange of Intelligence Information), which standardise the representation and distribution of intelligence and enable the modelling of relationships between technical threats, targets and events for automated consumption [21,22].

Human-in-the-Loop (HITL) is integrated here as a natural extension of this interoperability framework, acting on the same taxonomic and sharing artefacts to resolve contextual ambiguity and ensure decision traceability. In high-risk domains, expert intervention guides system learning and corrects edge cases, which reduces bias and stabilises labels generated by NLP (Natural Language Processing) and cybersecurity models [23,24]. In the specific context of cryptomarkets, the HITL layer operates as a quality control and audit mechanism over the classification that is interoperable with MISP and STIX/TAXII, maintaining the internal coherence of the pipeline and recording iterative feedback for operational exploitation [23,24,25].

Despite the growing use of Machine Learning and Deep Learning models for the detection of illicit content, critical methodological gaps persist that limit reproducibility and comparability across studies [9,12] and in the specific case of mining drug forums [26,27]. First, there are no taxonomies specifically adapted to the domain of drug cryptomarkets, which forces researchers to rely on ad hoc categories or models trained on non-standardised datasets. Second, much of the current research is based on small or partial forum samples, which affects the generalisation of results [2]. Finally, the reliance on language models without documentation of the validation process reduces transparency and hinders replication [28].

These methodological limitations highlight the urgency of establishing open and replicable frameworks in which the generation and evaluation of taxonomic categories can be documented, audited and shared among researchers and agencies. This situation is even more acute in the case of drug forums, where there is no framework for extending MISP taxonomies using Large Language Models (LLMs) in a controlled and verifiable way.

Within this framework, the present research focuses on the analysis of drug cryptomarkets on the Dark Web, with the aim of proposing and adapting MISP (Malware Information Sharing Platform) taxonomies to a complex thematic domain in which technological, social and economic aspects converge around drugs. By applying deep learning models and semantic clustering techniques, and using LLMs in a controlled and verifiable manner, this study seeks to identify emerging categories and reduce ambiguity in the classification of content related to illicit substances. Ultimately, the purpose is to offer a scalable frame of reference that contributes to improving interoperability between cyber intelligence systems and to strengthening the capabilities of law enforcement agencies in the fight against digital drug trafficking.

The main contributions of this study are threefold. First, it proposes a reproducible framework for extending MISP taxonomies to the specific domain of drug-related discourse in Dark Web forums, thereby addressing the limitations of existing taxonomies that are mainly oriented towards chemical classification or generic cyber indicators. Second, it develops a hybrid LLM+HITL pipeline for the classification of primary physical forms, combining automated semantic inference with expert validation to improve traceability, consistency and ambiguity resolution. Third, it evaluates the proposed framework on a corpus of Dark Web forum posts, showing that the extension of the taxonomy improves semantic coverage and strengthens the interpretability of drug-related cyber intelligence analysis.

Accordingly, this work is structured around six research questions (RQs) that guide the methodological development and interpretation of the results:

RQ1. How can the MISP taxonomy be adapted to the domain of drugs in Dark Web forums?

RQ2. What impact does the integration of an LLM (Mistral 7B) have on the initial classification?

RQ3. How does the HITL component contribute to the reduction in ambiguities?

RQ4. Which new categories or mergers emerge from the extension process?

RQ5. How do class proportions vary after the final reclassification?

RQ6. Which semantic or thematic patterns are observed in the network representation?

This paper is organised into eight sections. Section 1 introduces the context of the study, highlighting the relevance of the Dark Web as a space for cyber-intelligence analysis and the need to extend MISP taxonomies to the drugs domain. Section 2 reviews the theoretical background and related work on Dark Web cryptomarkets, drug trafficking dynamics, trust mechanisms, and current taxonomic and methodological limitations. Section 3 presents the methodology, including the design of the hybrid LLM+HITL pipeline, dataset preparation, initial taxonomy construction, and ambiguity detection. Section 4 details the human review and taxonomic extension process, including cue extraction, threshold definition, and reclassification with the extended taxonomy. Section 5 reports the quantitative results of the classification and reclassification process. Section 6 presents the semantic and thematic analysis of the corpus through network visualisation. Section 7 discusses the main findings, together with their practical, organisational, and methodological implications. Finally, Section 8 sets out the conclusions, limitations of the study, and future lines of research.

2. Related Work

2.1. The Dark Web as a Criminal and Technological Ecosystem

The Dark Web is a socio-technical ecosystem sustained by anonymity, encryption and decentralisation, primarily through the Tor network and onion routing, which enables concealed communication and identity obfuscation [3]. While initially conceived to protect privacy and freedom of expression, this infrastructure has progressively facilitated the emergence of highly specialised illicit activities, positioning the Dark Web as a key area for cyber intelligence and digital forensics research [1,2].

Studies on the topology of the Dark Web reveal a highly fragmented and redundant system in which around 82% of domains correspond to replicas or mirrors, which complicates tracking and the systematic collection of information [11,12]. Within this volatile environment, cryptomarkets constitute the economic core, replicating e-commerce logics through reputation systems, escrow services and dispute resolution mechanisms that substitute institutional trust with digitally mediated credibility [29,30,31].

In financial terms, the clandestine economy is sustained by the use of cryptocurrencies such as Bitcoin and Monero, which enable anonymous transactions and money laundering through mixing services or cryptomixing [27]. These operations break down economic flows into multiple channels, which makes it practically impossible to trace funds and consolidates the autonomy of the illicit market [32,33].

Although recent advances in machine learning have enabled high-accuracy detection of illicit network traffic [9,10], most approaches remain focused on technical indicators and overlook the semantic and discursive dimensions through which criminal practices are articulated. This limitation has motivated hybrid approaches that integrate NLP techniques and semantic taxonomies to analyse forum content and user interactions [11].

2.2. Drug Trafficking Dynamics in Cryptomarkets

Drug trafficking represents one of the most profitable and resilient activities within Dark Web cryptomarkets, with sustained growth in synthetic opioids and falsified pharmaceuticals despite recurrent law enforcement interventions [5,6]. Vendors operate as microentrepreneurs who optimise reputation and customer trust through rating systems and multihoming strategies, reinforcing a decentralised and adaptive criminal economy [31,34,35].

The financial infrastructure of these markets relies on cryptocurrencies and mixing services, which hinder transaction traceability and reinforce market autonomy [27,33]. From a public health perspective, the proliferation of adulterated or counterfeit drugs poses significant risks, including increased overdose rates and the circulation of falsified opioids and prescription medicines [7,8,36].

Several studies emphasise the analytical relevance of classifying substances according to their primary physical form, as this dimension is closely linked to consumption practices, distribution logistics and health risks [14,15]. However, this morphological perspective remains largely absent from generic cybersecurity taxonomies, limiting their applicability to drug-related forum analysis [14].

Recent work has also shown that darknet drug markets are not uniform environments but differentiated socio-technical ecologies shaped by platform architecture, language, and geographic scope. Comparative evidence from Finnish-, Polish-, and English-language platforms suggests that trust-building, product presentation, and interaction patterns vary substantially across localised and transnational settings, which is directly relevant for any taxonomy intended to classify drug-related discourse across heterogeneous forums. Likewise, studies of AlphaBay cocaine listings indicate that product descriptions and declared purity operate as market signals, although their effects on perceived quality, sales, and revenues are non-linear and mediated by credibility. Together, these findings reinforce the need for classification frameworks that capture not only substance categories, but also the linguistic and contextual features through which drug products are described online [37,38].

2.3. Trust, Anonymity and Policing Limitations

The persistence of illicit trade on the Dark Web is underpinned by socio-technical trust mechanisms that replace formal regulation. Feedback systems, escrow services and community arbitration generate a form of clandestine self-regulation in which reputation substitutes legal identity [29,31]. The rise in single-vendor shops further individualises trust, transforming reputation into a strategic asset and intensifying competition within the ecosystem [30,35].

Law enforcement interventions, including marketplace takedowns, generally produce only temporary deterrent effects, as vendors and users rapidly migrate to new platforms [26,39]. International initiatives such as CReDO highlight the need for coordinated intelligence sharing, yet their effectiveness remains constrained by jurisdictional fragmentation and the lack of interoperable analytical frameworks [40]. This initiative promotes information sharing between law enforcement institutions, health authorities and cybersecurity experts, facilitating the identification of transnational criminal networks. However, the effectiveness of these mechanisms remains limited by jurisdictional fragmentation and the lack of common legal frameworks [40].

2.4. Automatic Detection Models, Taxonomies and Methodological Gaps

While machine learning and deep learning models have demonstrated strong performance in detecting illicit Dark Web activity, these studies generally focus on network-level anomalies, traffic classification, marketplace detection, or binary identification of illegal content rather than fine-grained semantic analysis of forum discourse [9,12]. Their evaluation is commonly reported through conventional performance metrics such as accuracy, precision, recall, and F1-score, and they are often trained on traffic captures, marketplace snapshots, or restricted forum corpora rather than on taxonomically annotated drug-forum datasets [9,10,12]. As a result, their findings are difficult to transfer to the problem of morphologically classifying drug-related posts according to primary physical form [11].

Similarly, although NLP and LLM-based approaches have opened new possibilities for analysing criminal language and user-generated illicit content, their effectiveness still depends on the availability of coherent taxonomic structures, reproducible annotation criteria, and expert validation procedures [11,23,24,25]. In the specific case of Dark Web drug forums, prior studies have not established a controlled framework that combines semantic classification, explicit taxonomy extension, and Human-in-the-Loop review within a MISP-compatible environment.

The recent literature highlights the value of hybrid Human-in-the-Loop approaches to mitigate these limitations by combining LLM-based inference with expert validation, thereby reducing ambiguity, stabilising labels and ensuring traceability [23,24,25]. However, controlled and verifiable frameworks for extending MISP taxonomies in the drug domain remain underdeveloped. Addressing this gap constitutes the primary motivation for the methodological proposal advanced in this study.

Taken together, prior studies provide important advances in Dark Web analysis, but they also reveal four persistent limitations. First, many approaches rely on network-level or binary detection tasks, rather than fine-grained semantic classification of forum discourse. Second, several studies are evaluated on partial or domain-specific datasets, which constrains comparability and external validity. Third, reported performance metrics are typically oriented to detection effectiveness (e.g., accuracy, F1-score, precision, recall), but do not address ontological adequacy or taxonomic interpretability. Fourth, existing taxonomies remain either chemically oriented or too generic for the linguistic and morphological diversity observed in drug-related cryptomarket forums. In this context, the present study differs from prior work by combining a domain-adapted MISP taxonomy, an LLM-based classification stage, and a Human-in-the-Loop validation process specifically designed to improve semantic coverage, ambiguity reduction, and reproducibility.

Beyond Dark Web detection studies, recent computational drug research has increasingly relied on heterogeneous graph learning and network-based inference. For example, multitype interaction models have been proposed to improve drug-target interaction prediction by exploiting knowledge diversity across drug–drug, drug–target, drug–enzyme, and related link types, while network-enhancement approaches have been developed to identify spurious drug–drug interactions and improve the reliability of drug-interaction graphs. These studies demonstrate the maturity of network-based computational drug analysis; however, they address biomedical interaction prediction rather than the semantic classification of drug-related discourse in illicit online markets. Our work differs in both data source and objective: instead of predicting pharmacological interactions, we extend a MISP-compatible taxonomy to classify how drug products are linguistically and morphologically represented in Dark Web forums [41,42].

To provide a structured comparison with the existing literature, Table 1 summarises the main characteristics, limitations, and methodological differences between prior studies and the present work.

3. Methodology

The preliminary analysis of the literature and of the taxonomies implemented in the MISP reveals the absence of a formal framework for their extension using LLMs in a controlled, verifiable and reproducible manner. Existing experiences are based mainly on manual curation of categories or on undocumented ad hoc contributions, which generates semantic inconsistencies, conceptual overlaps and a lack of traceability in the results. This methodological gap is even more pronounced in non-traditional thematic domains, such as drug trafficking on the Dark Web, where the linguistic and contextual diversity of content exceeds the limits of conventional chemical taxonomies.

To overcome these limitations, a HITL methodological pipeline is proposed, designed specifically for verifiable taxonomic extension within MISP. This procedure combines automated processing using an LLM model (in this case, Mistral 7B) with an expert human review phase that validates and adjusts the results according to criteria of coherence, semantic justification and ontological compatibility.

The pipeline consists of four main stages:

Definition of the review subset (S): selection of records that present ambiguity or uncertain classification.
Extraction and normalisation of morphological cues: identification of linguistic patterns that indicate physical form or type of substance.
Support calculation by neutral families and deduplication: consolidation of equivalent terms through neutral semantic groupings.
Application of conservative thresholds and textual justification: acceptance of new categories or mergers only if they meet verifiable statistical and semantic criteria.

The output of this flow is a verifiable taxonomic patch, aligned with the original MISP structure and measurable before and after implementation. In this way, the reproducibility of the process and interoperability between different intelligence analysis and sharing systems are ensured.

In this context, the present study differs from prior work by combining a domain-adapted MISP taxonomy, an LLM-based classification stage, and a Human-in-the-Loop validation process specifically designed to improve semantic coverage, ambiguity reduction, and reproducibility in the analysis of Dark Web drug-forum discourse.

The comparative overview presented in Table 2 summarises the main conceptual differences between the original MISP “drugs” taxonomy and the approach proposed in this work.

Rather than replacing chemistry-oriented taxonomies, the proposed approach introduces a complementary morphology-oriented layer that is better aligned with the linguistic and commercial structure of Dark Web forum discourse.

3.1. Dataset and Data Preparation

The dataset used comes from six compressed files in .onion.war.gz format, which contain the full pages of different Dark Web forums in WARC (Web ARChive) format. These files comprise a total of 11,101 posts extracted from forums that are representative of the ecosystem of drug and illicit substance trading.

Due to the sensitive nature of the source material, the study relied exclusively on passive analysis of previously archived textual content from Tor-based forums. No interaction with forum participants took place, no transactions or purchases were conducted, and no authentication barriers were bypassed for the purposes of this research. Because the corpus derives from illicit-platform environments, the manuscript reports only aggregated results and does not disclose forum identifiers, onion addresses, usernames, wallet addresses, or other potentially identifying information.

The initial extraction and structuring were performed using a Python 3.12 script, with regular expressions employed to capture the variables of interest, including:

Site_name: name of the forum or .onion domain.
Page_title: title of the post, which generally provides a brief description of the content.
Content: full content of the post, including user replies.
Authors_vendors: name of the author or vendor mentioned.
Prices: prices expressed in text or cryptocurrencies.
Cryptocurrencies: references to digital means of payment (BTC, XMR, LTC, etc.).
Emails: visible email addresses.
Telegram_handles: Telegram user identifiers.
Onion_links: internal references to other .onion sites.

Of all these columns, the variables Page_title and Content were the most relevant for the analysis, as they concentrate the main description of the content. Using the extracted data, a JSON file was built with all the aforementioned attributes, representing one record per post.

Due to the linguistic diversity of the forums, which include content in English, German, Spanish and other minority languages, automatic translation was applied to all texts using the Python deep-translation module, based on the Google Translate API. This process generated an additional column named content_translated, which normalises the content into English.

Translation quality assessment. Language identification of the full corpus revealed that 9352 out of 9360 posts (99.91%) were written in English, with the remaining 8 posts distributed across other languages, including German, Spanish and Romanian, among others. Given this near-monolingual composition, large-scale automatic translation was not required and does not constitute a substantive processing stage in the pipeline. The 8 non-English posts were automatically translated and subsequently subjected to full manual review by the authors, who assessed whether each translation preserved (i) the meaning of the substance reference, (ii) the morphological cue, and (iii) the transactional context. Semantic adequacy was judged acceptable in all 8 cases, with no instances of nuance loss or alteration of physical-form interpretation identified. These findings confirm that translation introduces no meaningful source of error in the present corpus, and that the working dataset can be treated as effectively monolingual for the purposes of downstream classification.

Subsequently, the Mistral 7B model (base version, non-quantised and executed locally via Ollama) was used to extract a minimum of three representative keywords per post, enabling a preliminary understanding of the semantic content and preparing the ground for subsequent classification. This model was selected for its balance between computational efficiency and contextual depth.

The data cleaning and preprocessing phase removed 1741 duplicated posts and 2904 posts unrelated to drugs, using a script named drugs-base.py. As shown in Table 3, the initial cleaning stage resulted in 9360 unique posts after duplicate removal. This script employs inference with the Mistral 7B model to distinguish between relevant posts (illicit drugs, narcotics, medicines of abuse, paraphernalia, distribution logistics) and other non-pertinent categories. After filtering, a final set of 6456 posts directly linked to drug-related content was obtained. To provide an initial visual overview of the lexical patterns identified in the filtered corpus, Figure 1 presents a bubble-based representation of the extracted keywords.

Accordingly, the final working corpus used for classification consisted of 6456 drug-related posts, whereas 2904 unique posts were excluded as not relevant to the drugs domain (Table 4).

These results confirm the consistency of the cleaning process, allowing only records with analytical relevance for the study to be retained.

For clarity, the dataset construction followed three sequential stages: (i) 11,101 raw extracted posts; (ii) 9360 unique posts after duplicate removal; and (iii) 6456 posts retained as the final drug-related working corpus after excluding 2904 posts classified as not drug-related. However, the morphology-classification analyses reported in Section 3.2 onward were conducted on a stratified analytical subset of 2904 posts, and all percentages in the corresponding classification tables are calculated relative to that subset.

3.2. Initial Taxonomy via LLM

Once the study corpus had been delimited, an initial ad hoc taxonomy was developed, named machinetag_packing.json, defining a single predicate:

form = primary physical form of the substance.

The initial categories considered were:

Pill-tablet-Capsule, Powder and Crystal-rock.

This taxonomy was applied to the dataset using the script drugs-initial.py, which used the Mistral 7B model to assign each post to one of the proposed values. The model was instructed through prompt engineering to behave as a narcotics specialist, required to select strictly one of the categories or to label the post as “unclear” if the content did not allow a confident classification.

In this manuscript, “unclear” refers exclusively to the classifier output label, whereas semantic ambiguity is treated as an analytical property of unresolved or under-specified cases.

To improve reproducibility, the core prompts used in the study are reported below in representative form. The prompts were kept stable across runs, with only the admissible output labels being updated when the taxonomy was extended. All runs were executed with temperature = 0 in order to minimise output variability, the prompts are shown in Appendix A.

The results of the initial classification are presented in Table 5:

The distribution of the initial classification is shown in Figure 2, where the predominance of powder, crystal-rock and unclear can be observed, confirming the need to refine the scheme before the human phase using the pre_distribution.csv data.

The results show that the model clearly classified 76.48% of the records (powder, crystal-rock and pill-tablet-capsule), with a predominance of the powder category. However, the 683 records assigned the label “unclear” (23.52%) indicated a substantial number of unresolved cases in the corpus, which in turn motivated a second methodological phase of taxonomic extension and refinement.

Under this scheme, the task assigned to the LLM was to combine the initial proposals within this classification with its own language processing capabilities in order to classify content and propose taxonomic extensions to the original scheme.

Mistral 7B was selected as the base inference model because it produced stable deterministic outputs under fixed prompting conditions whilst remaining computationally viable in a local environment. Nevertheless, the use of a single non-fine-tuned LLM does not allow model-specific effects to be ruled out entirely. A formal inter-model robustness analysis and an out-of-sample validation on held-out data are identified as relevant directions for future work, in order to determine whether the main ambiguity patterns observed are attributable to model-specific behaviour or instead reflect structural properties of the corpus and the proposed taxonomy.

3.3. Identification of Ambiguous Records and Basis for Extension

The analysis of the preliminary results showed that the high proportion of posts assigned the label “unclear” was associated with two main causes:

The diversity of expressions and slang specific to the forums, which include colloquial or coded descriptions;
The limitation of the initial categories, which were insufficient to represent all the morphological manifestations observed.

From the 683 posts initially assigned the label “unclear”, a review subset (S) was constructed, to which the extension HITL phase was applied. This subset was reprocessed using Mistral 7B to detect morphological cues by combining the fields content-translated and keywords, thus enabling the inference of descriptive patterns that suggested new potential classes (for example, edible solid, oil extract, vape cartridge or gel capsule).

Subsequent human review verified the linguistic coherence of the proposals and consolidated those categories with sufficient statistical support and contextual grounding. This iteration significantly reduced the proportion of ambiguous records and broadened the semantic coverage of the taxonomy.

Taken as a whole, the applied pipeline, from data preparation to HITL validation, constitutes a reproducible and scalable methodology that combines automated inference, expert control and documentary traceability. The final outcome is an expanded and verifiable taxonomy, aligned with MISP standards and specifically adapted to the domain of drug trafficking on the Dark Web.

To distinguish between model misclassification within the existing taxonomy and genuine evidence of missing taxonomic categories, the label “unclear” was treated as a classifier output indicating unresolved cases at the initial stage, rather than as direct evidence of taxonomy incompleteness. During the HITL stage, each record in subset S was manually reviewed against the original three-category scheme (powder, crystal-rock, pill-tablet-capsule) before any new category was considered. Records were assigned to one of three outcomes: (i) reassignable to an existing category, indicating probable model under-classification; (ii) not reassignable but showing recurrent and semantically coherent morphological evidence, indicating a candidate taxonomic gap; or (iii) remaining ambiguous due to insufficient or non-morphological evidence. Only the second group was considered eligible for taxonomic extension.

4. Human Review and Taxonomic Extension (HITL Process)

4.1. Foundations of the HITL Approach

The human review phase constitutes the central axis of the HITL process applied in this research. This component was implemented after the initial automatic classification with the Mistral 7B model, with the aim of detecting semantic gaps, identifying emerging morphological patterns and validating the extension of the MISP taxonomy in the drugs domain.

The HITL approach makes it possible to balance the statistical inference of the model with expert judgement, ensuring that newly incorporated categories are grounded both in empirical evidence and in ontological coherence. The interaction between the model and the human reviewer is not merely corrective but also constructive and explanatory, as the system generates hypotheses based on morphological cues that are subsequently evaluated and refined by the analyst.

4.2. Selection of the Review Subset (S)

The process relies exclusively on the results of the initial classification and the base taxonomy. The review subset (S) was defined as the number of posts initially classified as unclear (683).

Importantly, inclusion in subset S did not imply that a post necessarily required a new category. Rather, S was designed as a validation stratum containing both potentially under-classified posts and genuinely out-of-taxonomy cases. This distinction was resolved during human review by testing whether the post could be confidently mapped onto one of the existing base classes using the operational definition of the primary physical form. Only when such reassignment was not justified, and when recurrent cue patterns exceeded the predefined thresholds, was the case treated as supporting taxonomic extension.

This set constitutes the subset S, representing the cases in which the model was unable to determine a primary physical form with sufficient confidence.

The corresponding file (S.csv) and the selection rules (AMBIGUITY_SELECTION.md) were documented to ensure process traceability. This subset was used as the basis for applying the HITL pipeline, in which the model and the expert collaborate in the detection, quantification and validation of morphological cues.

Human Review Protocol and Reviewer Agreement

Human review protocol. The HITL validation stage was conducted by two reviewers with complementary expertise: one researcher in cyber-intelligence and digital forensics, and one researcher in computational linguistics/NLP applied to illicit online discourse. Both reviewers independently examined the records in subset S, assessed the semantic adequacy of the extracted cues, and evaluated whether the proposed cue families justified category creation, merging, redirection, or rejection.

Inter-rater reliability. To assess annotation consistency, a double-review procedure was applied to the full subset S. Agreement was calculated at the level of final taxonomic decision (retain existing class/create new class/merge/reject as non-morphological or insufficient). In the manuscript, the inter-rater agreement is reported as Cohen’s κ = 0.82, indicating strong agreement, with a raw agreement of 89.3%. Disagreements were resolved through discussion and, where necessary, by consulting the operational definition of primary physical form adopted in the study.

Both reviewers independently examined the records in subset S using a decision protocol with three ordered questions: (1) Does the post contain sufficient morphological evidence to be assigned to one of the existing categories (powder, crystal-rock, pill-tablet-capsule)? If yes, the case was treated as probable model misclassification or under-classification within the original taxonomy. (2) If not, does the post contain recurrent and semantically coherent morphological evidence not captured by the base taxonomy? If yes, the case was marked as candidate evidence for taxonomic extension. (3) If neither condition was met, the record remained assigned to the label “unclear” due to insufficient, mixed, or non-morphological evidence. This protocol ensured that new categories were not created from isolated model errors, but only from repeated and validated out-of-taxonomy patterns. Table 6 summarises the manual review setting and the agreement achieved between reviewers, while Table 7 presents the distribution of the validation outcomes observed in the reviewed subset.

4.3. Extraction of Cues and Semantic Grouping

In this phase, the Mistral 7B model was instructed to extract, for each row in subset S, a set of morphological cues (Ci), combining the information contained in the fields content-translated and keywords. These cues are terms or expressions that function as semantic indicators of the physical form of the substance, for example: pill, crystal, gummy, rock, capsule, resin.

Each cue c has a frequency

f (c)

, defined as the number of rows

i \in S

in which it appears, and a prevalence

\hat{p} (c) = f (c) / ∣ S ∣

.

To avoid terminological ambiguities and redundancy, the cues were grouped into neutral semantic families or cue groups (G), according to the morphological similarity of the terms. The main groups defined were:

Oral_solid → {pill, tablet, capsule, bar}
Crystal_like → {crystal, rock, shard}
Powder_like → {powder, flake, dust}
Edible_matrix → {gummy, brownie, cookie, chocolate, candy}
Concentrate_solid → {hash, resin, wax, extract}
Liquid_like → {oil, syrup, droplet}

The group support s(G) is defined as the number of rows that contain at least one cue belonging to family G.

To avoid inflating support, if a single post includes several synonyms within the same group, the row is counted only once. This procedure reduces the variance associated with synonymy and improves the accuracy of the estimation of the targeted morphological concept.

4.4. Definition of Thresholds and Decision Criteria

The HITL process established conservative thresholds for deciding when to add or merge categories within the taxonomy, in order to minimise false positives arising from noise or anecdotal occurrences. The decision criteria were defined as follows:

Addition of a new form
○
Minimum prevalence: $\hat{p} (G) \geq 0.005$ (≥0.5% of the sample).
○
Minimum absolute frequency: $f (c) \geq 5$ occurrences.

Meeting both thresholds is required in order to consider the creation of a new value in the predicate form.

Merging of existing values
○
At least one of the source values must exist in the initial version of the taxonomy.
○
Group support: ≥1% of the sample (a critical mass of 18–20 real examples).

These mergers are applied when several pre-existing categories represent lexical variants or conceptual redundancies (for example, pill, tablet and capsule).

With a sample of 683 records, the thresholds correspond to:

0.5% → a minimum of 4 cases.
1.0% → a minimum of 7 cases.

The HITL model uses these combined metrics (relative and absolute) to distinguish between statistical noise and structured evidence, ensuring that each proposed extension is backed by a significant empirical volume and a coherent semantic context.

The set of 683 records in subset S was used to propose candidates for taxonomic extension, of which 475 yielded at least one extension proposal.

4.5. Results of the HITL Process

4.5.1. Consolidation and New Categories

The analysis of cue distributions revealed robust support for the plant_like and oral_solid families, which justified merging the labels plant, herb and weed under a single category named plant matter, and consolidating pill, tablet, and capsule under pill-tablet-capsule. At the same time, the 683 records initially labelled as “unclear” (23.52%) confirmed the persistence of semantic ambiguity in the corpus.

In addition, two further valid categories were identified that exceeded the defined thresholds and showed both morphological and contextual coherence, as summarised in Table 8:

The cue family associated with resin-like materials (e.g., hash, hashish, charas, resin) was examined during the HITL stage because of its conceptual relevance in illicit drug markets. However, after expert review it was not retained as an independent final category in the extended taxonomy, as its empirical support and contextual consistency were not sufficient to justify a stable standalone class under the conservative inclusion criteria adopted in this study. Instead, these cases were treated as context-dependent concentrate-like references and documented as a relevant candidate for future refinement.

4.5.2. Evaluated and Rejected Cases

The HITL process also considered candidate categories that did not reach the thresholds or that were interpreted as documentary aliases of existing values, as summarised in Table 9:

These results reinforce the non-arbitrary nature of the process: proposals arise from the corpus, are quantified empirically and are filtered according to predefined criteria before final human approval.

4.5.3. Exclusion Criteria (HITL Rejections)

The pipeline also identified non-morphological cues which, despite their frequency, do not represent a valid primary physical form. These were excluded from the final computation in order to avoid distorting the metrics or inducing erroneous categories.

The exclusion groups defined include:

Tools or utensils: needle, vial (routes of administration).
Transaction or concealment: banknotes, bills, euro bills (economic or concealment indicators).
Chemical substance: heroin, ketamine, methamphetamine (composition, not morphology).
Quantities or units: 5 g, 1 g, kg, uncut (sales magnitudes).
Composition or mixture: mixed, combo, sugar (additives or mixes).

When a record contained both valid morphological cues and exclusion cues, the system prioritised the morphological evidence. In cases with only exclusion cues, the final result was labelled as unclear.

4.5.4. Synthesis of Results and Extended Version of the Taxonomy

The HITL process concluded with a verifiable and documented extension of the MISP taxonomy for the drugs domain. The final set of categories for the predicate form = primary_physical_form is defined as:

powder, crystal-rock, plant-matter, pill-tablet-capsule, liquid, blotter.

In this way, the taxonomy moves from a chemically descriptive focus to a morphologically and linguistically contextualised classification, aligned with the discursive reality of Dark Web forums.

Each decision to add or merge categories is justified with quantitative evidence, ensuring transparency, reproducibility and ontological coherence throughout the process.

4.6. Reclassification with the Extended Taxonomy

Following validation and consolidation of the HITL process, a full reclassification of the corpus was carried out using the extended primary physical form taxonomy. The aim of this new iteration was to assess the practical effectiveness of the final scheme, quantify changes in class distribution and determine the reduction in ambiguity achieved after human intervention.

The process consisted of re-running the classifier over the entire set of posts (N = 6456), using the updated version of the taxonomy, which comprises the following values:

powder, crystal-rock, plant-matter, pill-tablet-capsule, liquid, blotter.

To guarantee the comparability of results, all experimental conditions used in the initial classification were kept constant, with only the list of available categories being modified. The conditions are described below:

Model used: Mistral 7B (same configuration as previously).
Temperature: 0, ensuring deterministic and stable behaviour in responses.
Model inputs: concatenation of the fields page_title and keywords, previously translated and semantically normalised.
Prompting strategy: identical to the previous phase, with the sole difference that the set of possible output values was updated to the final version of the extended taxonomy.
Expected output type: a single physical form value per post; in the absence of sufficient evidence, the system was required to return the unclear marker.

Model execution was automated via the script drugs-final-expanded.py, configured to record both the final prediction and the estimated contextual confidence, in order to enable subsequent comparative analyses. The total inference time was approximately 12 h in a local hardware environment equipped with two AMD EPYC 7552 48-Core processors (96 cores/192 threads total), six NVIDIA Quadro RTX 5000 GPUs (16 GB GDDR6 each, 96 GB total VRAM), and 640 GB of DDR4 ECC RAM at 3200 MHz, processing batches of 256 posts per iteration.

This second classification constitutes the comparative evaluation stage of the work, making it possible to observe how the incorporation of new categories affects corpus redistribution and the reduction in ambiguous cases under controlled conditions. Because a fully annotated ground-truth dataset was not available for the full corpus, ambiguity reduction was not treated as sufficient evidence of performance improvement on its own.

The following section details the quantitative results obtained after this reclassification, including the evolution of class proportions, the decrease in the unclear category and the semantic implications derived from the application of the extended taxonomy.

5. Analysis of Results

Two different ambiguity indicators were considered during the pipeline: (i) local ambiguity reduction within the review subset during the HITL refinement stage, and (ii) global ambiguity reduction in the full corpus after final reclassification. The manuscript reports the second indicator as the primary summary measure, in order to avoid confusion between intermediate and corpus-level effects.

5.1. General Classification Statistics

After running the Mistral 7B model with the extended primary physical form taxonomy, the corpus-level results show a substantial reduction in semantic ambiguity and a more differentiated class distribution. However, because these in-corpus comparisons are based on the same dataset used to derive the taxonomy extension, they are interpreted as descriptive evidence of improved fit rather than as sufficient proof of generalizable performance.

Consequently, reductions in the “unclear” label are presented as changes in the classifier output, while semantic ambiguity is interpreted as a broader analytical construct.

Comparing the initial classification (v1) with the subsequent reclassification (v2) makes it possible to observe the concrete effects of the taxonomic extension and the HITL process on class distribution.

In both cases, the reported percentages are calculated relative to the 2904 posts included in the morphology-classification subset used for direct PRE/POST comparison.

In the initial version, the proportion of posts assigned the label unclear reached 23.52% of the total, indicating a substantial number of unresolved cases in the identification of primary physical form. After reclassification, this value fell to 11.29%, representing a decrease of 12.23 percentage points, equivalent to a 51.99% relative reduction. This change constitutes the clearest classifier-output indication of the positive effect of the HITL pipeline and, at corpus level, is consistent with a reduction in semantic ambiguity.

Importantly, the reduction in posts assigned the label unclear should not be interpreted exclusively as evidence of missing categories in the original taxonomy. Manual validation showed that a substantial fraction of the reviewed cases could in fact be reassigned to existing classes, indicating model under-classification, whereas only a smaller but recurrent subset provided evidence for genuine taxonomic extension. Accordingly, in this manuscript, unclear is treated as a classifier output label, while semantic ambiguity is interpreted as a broader analytical property of unresolved or under-specified cases.

The behaviour of the crystal-rock category provides an additional indicator of structural stability in the classification scheme. This category remained broadly stable after reclassification, changing from 20.90% in the initial version to 23.14% in the final classification, a slight increase of 2.24 percentage points. This relative stability suggests that the extension process mainly affected ambiguous or under-specified records, while posts already associated with crystal- or rock-related forms remained consistently classified across both versions.

Taken together, the initial classification (v1) are summarised in Table 10 as follows::

The behaviour of the remaining categories confirms the coherence of the reclassification. For example, crystal-rock increased only marginally from 20.90% to 23.14% (2.24 pp), which suggests strong stability in contexts where markers such as shard, rock or crystal are present. By contrast, powder decreased from 39.36% to 35.88% (−3.48 pp), a result consistent with the reassignment of certain records to more specific categories such as plant-matter.

Finally, the new categories introduced in the taxonomic extension show real, albeit limited, coverage. Plant-matter reaches 8.23% (239 rows), liquid accounts for 3.27% (95 rows) and blotter represents 1.34% (39 rows). Although modest, these percentages are consistent with the expected distribution of such posts in the forums analysed.

Overall, the final classification (v2) can be summarised as follows:

Table 11 presents the final classification, which shows a reduction in posts assigned the label unclear and an increase in pill-tablet-capsule that is consistent with the consolidation of oral solid forms. The subsequent comparison between PRE and POST summarises the changes in percentage points (Δ pp), highlighting the drop in the unclear label and the reassignment of part of these previously unresolved cases into more specific categories, including pill-tablet-capsule, as well as moderate adjustments in powder, crystal-rock and liquid. Sources: pre_distribution.csv and post_distribution.csv.

Figure 3 compares the PRE and POST distribution of posts by physical form. Its purpose is to provide a concise overview of the changes introduced by the taxonomic reclassification in the morphological structure of the corpus.

These data show that the system has succeeded in reducing uncertainty and rebalancing class proportions according to a more precise semantic structure, thereby validating the methodological impact of the taxonomic extension pipeline.

5.2. Transition Analysis and Structural Stability After Reclassification

Analysis of the transition matrix between the initial classification (v1) and the extended classification (v2) makes it possible to examine how category migrations occurred within the same analytical subset and which reclassification flows accounted for the main changes introduced by the extended taxonomy.

The diagonal of the matrix, which represents exact matches between the two versions, concentrates most of the cases, indicating high structural stability in the model and strong coherence in the already consolidated categories. However, the most relevant transitions occur precisely in those cases where a direct effect of the extension process was expected, particularly in the reassignment of records initially labelled as unclear.

The most significant migrations were:

unclear → plant-matter: 113 cases.
unclear → liquid: 57 cases.
unclear → blotter: 17 cases.

The most significant migration flows are summarised visually in Figure 4.

These three transitions account for a substantial part of the reduction in records initially assigned the label unclear. Together, they represent the effective reassignment of 27.37% of the initially unresolved records, showing that the refinement of morphological categories improved the interpretability of cases that were previously under-specified at the classifier-output level.

The remaining transitions reflect more minor adjustments, such as the reassignment powder → plant-matter (118 cases), powder → liquid (36 cases) and pill-tablet-capsule → blotter (21 cases). These migrations can be interpreted as the natural result of introducing more precise morphological markers, for example wax, shatter, crumble, which were previously subsumed under more generic categories such as powder or paste.

In general terms, the transition matrix confirms that the greatest reclassification flow is concentrated in transitions from unclear to newly differentiated categories, especially plant-matter. This trend reinforces the hypothesis that a substantial part of the initial semantic ambiguity stemmed from posts containing imprecise references to forms that the original model was unable to discriminate adequately under the initial category structure.

5.3. Evaluation of Ambiguity and Model Stability

The reduction in the percentage of posts assigned the label unclear is the main classifier-output indicator of improvement in the model, as shown in Table 12. Moving from 23.52% to 11.29% implies not only a numerical decrease in unresolved outputs, but also a more precise fit between the classification scheme and the morphological patterns present in the corpus. Analysis of the reclassified cases shows that the HITL process did not generate overfitting or distort the overall structure of the taxonomy. In fact, the most relevant percentage variations are concentrated in classes directly affected by the new definitions or mergers, such as pill-tablet-capsule, while the remaining categories remain practically stable.

This stability is evidence of the ontological maturity of the model: the introduction of new values did not significantly alter the global distribution, which suggests that the taxonomic extension did not add noise to the system but rather improved the local precision of classification.

The reduction in unresolved cases at the classifier-output level, together with the stability of proportions and the observed semantic coherence, indicates that the HITL methodology applied was both effective and scalable.

6. Contextual Application of the Extended Taxonomy Through Co-Occurrence Network Analysis

The network representation generated with VOSviewer 1.6.20 made it possible to explore the semantic relationships between the most frequent terms in Dark Web drug forums and to identify the underlying thematic structure of the classified corpus.

Beyond its exploratory value, the co-occurrence network was used here as a contextual application layer for the extended taxonomy. Rather than introducing a separate line of analysis, this section examines whether the final morphology-based categories are embedded in coherent semantic environments within the forum discourse. In this sense, the network analysis contributes to RQ6 by showing how the taxonomic extensions identified through the LLM+HITL pipeline relate to broader thematic, commercial, and logistical structures in the corpus.

To construct the map, a minimum threshold of eight occurrences per term was established, applying a lexical normalisation thesaurus (thesaurus_drugs) that unified variants and synonyms.

This threshold was selected as a compromise between semantic coverage and visual interpretability: lower thresholds generated excessively dense maps dominated by rare or idiosyncratic terms, whereas higher thresholds removed relevant domain-specific vocabulary and reduced thematic diversity. In practical terms, the threshold of eight retained 127 interpretable nodes while filtering out sparse lexical noise. The normalisation method employed was Association Strength, with full counting, which ensures a proportional representation of semantic co-occurrence between terms. The final result comprised 127 nodes distributed across six thematic clusters (C1–C6), interpreted as semantic communities that reflect the discourses, products and dynamics of the cryptomarkets analysed.

Cluster detection was performed using the VOSviewer built-in weighted modularity-based clustering procedure, which groups nodes according to co-occurrence strength while maximising within-cluster association. The clustering resolution parameter was set to 1.00.

To support the methodological justification for the selected threshold, Table 13 summarises a brief sensitivity check comparing three minimum-occurrence values in VOSviewer. This comparison illustrates how the threshold choice directly affected the number of retained nodes and the interpretability of the semantic map. As shown below, the value of eight occurrences provided the most suitable balance between lexical coverage and analytical clarity.

6.1. General Structure of the Network

The semantic network presents a clearly modular configuration in which several nodes act as organising axes of the conversation. The terms packing and distribution occupy central positions and concentrate a high number of links, indicating that the description of packaging and distribution processes constitutes a discursive meeting point across multiple substances and sales modalities. Around these nodes cluster references to types of drugs (heroin, ketamine, cocaine, MDMA, cannabis), forms of presentation (pills, tabs, blisters, crystal, shards) and logistical elements (shipping, worldwide_shipping, expresspost, uk2uk, marketplace). This overall network configuration is illustrated in Figure 5.

The global structure thus combines two partially overlapping dimensions. On the one hand, a productive dimension sustained by differentiation between opioids, benzodiazepines, stimulants and cannabis derivatives. On the other, a logistical dimension focused on describing shipping modes, the degree of visibility of the vendor (physical_vendor, vendor, veteran_vendor) and geographical routes (Afghanistan, Iran, Germany, Netherlands, Canada, UK, Argentina, Peru, Venezuela). The intersection of these two dimensions gives rise to a discursive ecosystem in which product identity is defined jointly by its chemical composition, its origin and the promise of safe and discreet delivery.

6.2. Identified Thematic Clusters

Cluster 1—“Multiproduct packaging and global distribution” (Red)

Cluster 1 (29 items) brings together a heterogeneous set of substances and commercial brands articulated around the semantics of packaging and shipping. It includes classic psychedelics (lsd, dmt, ecstasy, mdma, xtc_pills), analgesics and opioids (tramadol, tapentadol, suboxone), cannabis derivatives (blueberry_weed, power_plant_weed) and references to pill shapes or designs (mickey_mouse, tesla, supreme, rolls_royce). These products are linked to operational terms such as pack, packing, pills, delivery, distribution and worldwide_shipping, as well as explicit mentions of dark markets (darkdock_market, darknet_market). The cluster reflects a multiproduct discourse in which the variety of substances is integrated under a shared logic of attractive packaging, international shipping and affiliation with consolidated marketplaces.

From the standpoint of the extended taxonomy, this cluster supports the interpretability of categories such as pill-tablet-capsule and blotter, showing that these forms are embedded not only in substance naming but also in recurrent commercial and logistical discourse.

Cluster 2—“Recreational cannabis and cocaine market with geographical anchoring” (Dark green)

Cluster 2 (24 items) is organised around high-demand recreational drugs and spatial references situating the offer in a transnational context. It includes cannabis strains (afghan_kush, amherst_sour_diesel_hun, auto_american_pie, white_russian, weed, pot), depressants and anxiolytics (alprazolam, benzos, xanax), cocaine and generic terms (drugs, 1g, clearance). These terms combine with logistical and geopolitical markers (marketplace, darknet_market, tor, uk, argentina, peru, venezuela), suggesting the existence of a recreational market aimed at consumers seeking information on origin, volume and type of cultivation. The cluster represents the space of an everyday consumption economy, where emphasis falls on cannabis varieties, unit doses and the geographical location of the supplier.

Taxonomically, this cluster reinforces the contextual distinctiveness of plant-matter and powder, as the dominant lexical environment consistently links these forms to recurrent patterns of retail description, quantity signalling, and product presentation.

Cluster 3—“Synthetic wholesalers, discounts and counterfeiting” (Dark blue)

Cluster 3 (21 items) concentrates vocabulary associated with higher volume transactions and explicit commercial strategies. Terms such as bulk, quarter_ounce, discount, discreet, packaging, shipping, worldwide_shipping refer to medium or large-scale operations, while methamphetamine, xtc and ketamine (in its variants ketamine_hcl, ketamine_s_isomer, ketamine_shards) highlight the importance of synthetic stimulants. The presence of distribution_asap_market and distribution_worldwide links these offers to specific marketplaces and to a global projection. The inclusion of counterfeit, euro and india_import points to an overlap between drug trafficking and monetary or document counterfeiting, where the same distribution channels are used to move both substances and fraudulent products.

From a taxonomic perspective, this cluster supports the analytical separation between crystal-rock and powder, as the co-occurring terms reflect distinct modes of presentation and circulation in wholesale and synthetic-drug discourse.

Cluster 4—“Ketamine and import circuits” (Light green)

Cluster 4 (19 items) places ketamine at the centre of a semantic network that combines chemical purity, physical form and shipping routes. The node ketamine is connected with isomer, s-isomer, s-ketamine, racemic_rocks, shard, shards, sugar_s-isomer, indicating a high degree of specialisation in the description of product variants and textures (crystal, racemic, sugar-like). Alongside these appear geographical references (Afghanistan, Germany, India) and specific logistical operators (dhlgermany, expresspost, drugpearl, drugzfromnl), composing a narrative in which origin and supply chain function as authenticity markers. This cluster reflects a professionalised discourse around ketamine, where distinctions between isomers and crystalline forms are used both as quality arguments and as identity markers for certain vendors.

This cluster provides particularly strong contextual support for the crystal-rock category, since its semantic core is structured around lexical cues that refer to crystalline texture, shard-like presentation, and visually recognisable solid forms rather than to chemical denomination alone.

Cluster 5—“Pharmaceutical opioids and physical distribution channel” (Purple)

Cluster 5 (18 items) groups terms linked to prescription opioids and to the pharmaceutical presentation of the product. It includes direct references to heroin from Afghanistan (afghan_heroine), high-potency opiates and opioids (dilaudid, hydromorphone, opium, oxy, oxycodone, percocet, ghb), together with markers of sales format (blisters, m30, press, tabs) and quality (high_quality, pure, quality). The terms physical and uk2uk suggest the coexistence of physical and digital channels, particularly in domestic shipments within the United Kingdom that seek to minimise customs risks. The cluster describes a segment of the market that reproduces the language of the formal pharmaceutical chain but redirects it towards the illicit supply of medicines and heroin derivatives, with a strong emphasis on purity and hand-to-hand delivery.

In taxonomic terms, this cluster shows that categories such as pill-tablet-capsule, powder, and in some cases liquid are embedded in discourse where pharmaceutical naming, opioid circulation, and morphology-based presentation overlap in meaningful way

Cluster 6—“Heroin and high-purity stimulants with geopolitical emphasis” (Light blue)

Cluster 6 (16 items) is structured around heroin and the construction of a narrative of extreme purity and vendor expertise. The term heroin is linked to high_purity, uncut, powder, goldenbulk, which reveals a rhetoric focused on non-adulterated products and volume formats. The semantic network also incorporates references to amphetamine and to imports from countries traditionally associated with trafficking (iran, turkish_import, turkish_heroine, france), as well as to vendor identifiers (dutch, dutchdrugs, vendor, veteran_vendor, physical_vendor). This cluster expresses the more classic dimension of heroin trafficking, transposed to the digital environment and legitimised through references to professional experience, origin and exceptional quality.

Although this cluster does not map onto a single morphology-based category, it functions as a cross-cutting contextual layer that helps explain how taxonomic labels are embedded in broader evaluative and logistical discourse within Dark Web drug markets.

6.3. Relationship Between Thematic Clusters and the Extended Taxonomy

To connect the co-occurrence analysis more directly with the core contribution of the paper, Table 14 summarises the relationship between the thematic clusters identified in VOSviewer and the final morphology-based taxonomy. Rather than treating the clusters as a separate exploratory result, this mapping shows how the extended taxonomic categories are embedded in recurrent semantic, commercial, and logistical environments within the corpus.

6.4. Connection Patterns Between Nodes

Beyond the segmentation into six communities, the network displays a web of semantic trajectories that systematically connect products, routes and logistical devices. One of the most visible patterns is organised around the packing/distribution axis, which links terms from C1 with shipping-related notions from C3. Sequences such as packing (C1) → distribution (C1) → shipping (C3) → worldwide_shipping (C3) show that packaging is described as part of an integrated chain culminating in the promise of global delivery, regardless of the specific substance.

A second pattern is articulated around ketamine, which operates as a bridge between clusters C3 and C4. Chains such as ketamine_hcl (C3) → ketamine (C4) → s-isomer (C4) → sugar_s-isomer (C4) reveal a discursive continuum that moves from a generic reference to the active ingredient towards highly specific descriptors of the isomer and its physical appearance. This configuration is also associated with import routes (india_import, afghanistan, germany, drugzfromnl), reinforcing the idea of ketamine as a product with high symbolic and logistical value.

Third, heroin and opioids construct a semantic arc connecting C5 and C6. Paths such as afghan_heroine (C5) → pure (C5) → high_purity (C6) → uncut (C6) evidence a narrative continuity between pharmaceutical opioids, traditional heroin and high-purity formats offered by specialised vendors. These trajectories extend towards vendor-related nodes (veteran_vendor, physical_vendor) and transit locations (iran, turkish_import, france), integrating quality, experience and geopolitics into a single legitimising narrative.

Finally, several terms act as connectors between the recreational cannabis and cocaine market (C2) and the rest of the network. The co-occurrence of darknet_market and marketplace with drugs such as cocaine, MDMA, cannabis, weed links recreational consumption discourses with the global logistics discourses present in C1 and C3. In this way, the map reveals a continuous space in which segmentation by substance type overlaps with affiliation to shared infrastructures of trade and distribution.

6.5. Global Interpretation and Response to RQ6

The identified semantic patterns reveal a complex ecosystem in which at least three major discursive axes are combined. The first is a recreational–commercial axis centred on cannabis, cocaine, MDMA, LSD and ketamine, where references to strains, unit doses, pill design and brand-oriented marketing predominate. The second is a pharmaceutical–opioid axis structured around prescription opioids, benzodiazepines and high-purity heroin, which informally reproduces the semantics of the pharmaceutical chain (quality, dosage, origin, physical channel). The third is a transnational–logistical axis that cuts across the entire network, integrating vocabulary related to packaging, shipping, geographical routes and vendor visibility.

These three axes should be understood as higher-order interpretive dimensions emerging from the interaction among the six clusters, rather than as a replacement for the cluster structure itself.

The convergence of these three axes confirms that Dark Web forums do not merely list products but construct a shared language in which the identity of each offer is defined by the combination of substance, form of presentation and distribution guarantees. Highly central nodes (packing, distribution, ketamine, heroin, physical_vendor) act as discursive hubs connecting the different clusters and articulating a semantics of professionalised crime, in which chemical purity, vendor reputation and logistical efficiency are strategic elements for generating trust.

With respect to RQ6, the co-occurrence network does not operate as an independent exploratory result, but as a contextual application of the extended taxonomy. The six clusters show that the proposed morphology-based categories are not isolated labels but are embedded in recurrent semantic environments associated with packaging, purity, global shipping, pharmaceutical branding, and product presentation. In particular, categories such as plant-matter, crystal-rock, liquid, blotter, and pill-tablet-capsule appear linked to differentiated thematic constellations, which supports their interpretability within the discourse structure of Dark Web drug forums.

7. Discussion

This study demonstrates the methodological and conceptual feasibility of adapting MISP taxonomies to non-conventional thematic domains, such as drug trafficking on the Dark Web, through a hybrid process that combines automated inference with expert review. The integration of LLMs with HITL methodologies enables progress towards more adaptive, transparent and reproducible systems for semantic classification, overcoming the limitations identified in the state of the art.

The following section presents a general discussion of the findings obtained throughout the study, in line with research question RQ1, integrating the results derived from the different stages of ontological extension, automated classification and semantic validation, together with the responses to the research questions formulated in the introduction.

In light of the results, it is confirmed that the ontological structure of MISP can be extended through a semantic recontextualisation centred on the primary physical form of substances. In contrast to traditional approaches, where MISP taxonomies are restricted to technical incidents or chemical compositions [9,11], this work proposes a model oriented towards the linguistic and commercial morphology of discourse in onion forums. The shift from a chemical predicate to a morphological one (for example, form = primary_physical_form) aligns with trends observed in the recent literature on contextual categorisation in cyber intelligence, where phenotypic descriptions of phenomena are prioritised over rigid taxonomies [3,6].

This adaptation not only broadens the applicability of MISP, but also introduces a reproducible framework for domains in which textual information is noisy, incomplete or polysemic, a structural feature of illicit digital markets [2]. Consequently, the proposed model contributes to the convergence between computational semantics and operational ontologies in cyber intelligence.

With respect to RQ2, the incorporation of the Mistral 7B model in the initial classification stage highlights the potential of LLMs as morphological detection agents in digital criminal domains. In the present corpus, the initial classification yielded 76.48% of posts directly assigned to one of the base categories, while 23.52% were labelled as unclear. These figures should be interpreted as a baseline result obtained under noisy, multilingual, and semantically non-standard conditions, rather than as a standalone performance benchmark. Because prior studies in this area often address different tasks, datasets, and evaluation settings, the present result is not directly comparable in strict quantitative terms. It is therefore more appropriate to interpret this outcome as evidence of the practical usefulness of LLMs for assisted semantic pre-classification, whilst recognising that subsequent HITL validation was required to achieve a more robust taxonomic resolution [10].

This also indicates that Mistral 7B performed adequately as a first-pass classifier, but not as a fully autonomous solution, particularly in posts affected by slang, abbreviated vendor language, or limited morphological evidence.

Although the present study relies on a single inference model, the contribution should be understood less as a claim about the superiority of one particular LLM and more as evidence that ontology-aware HITL refinement can improve semantic classification under noisy illicit-market conditions. Whether the observed ambiguity patterns and the gains associated with taxonomic extension generalise across alternative architectures and unseen data remains an open empirical question, which is identified as a priority direction for future work.

The model exhibits substantial contextual capacity, identifying semantic patterns beyond surface-level keywords and generating coherent morphological labels even in texts affected by noise or lexical ambiguity. This property supports the hypothesis advanced by authors such as Sharma et al. [12], who argue that LLMs can operate as instruments of “assisted semantic curation” within supervised classification environments. In this case, Mistral 7B functions as an interpretative component that translates the informal language of the forum into an ontologically legible space for MISP.

Regarding RQ3, the HITL component is confirmed as a crucial mechanism for reducing semantic ambiguity. Human intervention reduced the proportion of posts classified as unclear from 23.52% to 11.29%, which represents a decrease of 12.23 percentage points and a relative reduction of 51.99%. These results reinforce the value of hybrid systems that combine automated pre-classification with expert validation in taxonomy-extension tasks Mancini et al. [6], Abbas et al. [9].

The HITL process described in this study also provides a model for documentary traceability and support quantification that was absent from previous initiatives. Whereas most community taxonomies are expanded through informal contributions, this pipeline establishes explicit acceptance criteria, frequency thresholds and verifiable textual justification. In doing so, it introduces a methodological standard that can be replicated in other sensitive domains (for example, terrorism, child exploitation or ransomware ecosystems).

In relation to RQ4, the HITL-driven extension process enabled the consolidation and expansion of the taxonomic vocabulary, generating new empirically grounded categories such as plant-matter, liquid, and blotter, and merging redundant terms (pill, tablet, capsule → pill-tablet-capsule). This semantic evolution is consistent with the trend described by Zabihimayvan et al. [2], according to which the semantics of digital drug trafficking tends to hybridise the technical and the commercial, using material descriptors rather than strictly chemical ones.

The identification of these new classes reflects the dynamic nature of drug markets on the Dark Web, where language evolves in parallel with consumption and distribution practices. From an ontological perspective, the proposed extensions are not merely labels but instruments of social observation, capable of capturing how criminal actors negotiate identity, reputation and product through discourse. This finding complements sociolinguistic approaches to digital trafficking, such as those of Broseus et al. [43] and Weimann [44], which emphasise the role of language as a marker of criminal legitimacy.

For RQ5, the global reclassification using the extended taxonomy confirmed the structural stability and internal coherence of the model, as well as the practical usefulness of the HITL process at corpus level. The reduction in ambiguity and the reconfiguration of class proportions indicate an improvement in local precision without loss of global coherence. Although the pill-tablet-capsule category increased slightly after reclassification, powder remained the dominant class in the final distribution (35.88%), whilst the proportion of unclear cases declined markedly. Taken together, these results suggest that the main effect of the extension was not to replace the overall class hierarchy, but to enable a more precise redistribution of previously ambiguous or overly generic cases. Whether these gains persist on unseen data and across alternative model architectures remains an open question that is addressed in the future lines of research.

This behaviour suggests that the initial ambiguity was concentrated in posts with vocabulary related to oral solids, which reinforces the validity of the merger and the relevance of the new taxonomic structure. Methodologically, these results provide quantitative validation for the proposal of hybrid supervised learning models, such as those outlined by Abbas et al. [9], in which human curation guides semantic convergence without compromising scalability.

With respect to RQ6, analysis of the network based on 126 terms and six clusters shows that the semantic ecosystem of Dark Web drug forums is structured around the intersection between recreational substances, prescription opioids and logistical devices. The thematic segmentation highlights, on the one hand, a recreational–commercial space dominated by cannabis, cocaine, MDMA and ketamine and, on the other, a pharmaceutical–opioid space in which heroin, synthetic opioids and high-risk pharmaceuticals are offered. Both spaces are traversed by a shared logistical axis that emphasises packaging, global distribution and the geographical specialisation of vendors.

The network patterns identified are broadly consistent with structures already described in the prior cryptomarket literature. In this study, their value lies less in novelty at market level than in showing that the extended taxonomy aligns with recurrent semantic and commercial structures observed in the corpus.

The most central nodes reveal that trust and reputation are constructed discursively through repeated emphasis on purity (high_purity, pure, uncut), origin (Afghanistan, Iran, Netherlands, Germany, UK, Canada) and the promise of discreet and reliable delivery (shipping, expresspost, uk2uk, worldwide_shipping). Taken together, these results reinforce the idea that the forums analysed operate not only as illicit marketplaces but also as spaces of symbolic production, where meanings, hierarchies and criminal affiliations are negotiated through a shared semantic repertoire that integrates products, logistics and geopolitics.

8. Conclusions

The study carried out demonstrates the feasibility of a reproducible taxonomic extension process based on empirical evidence, oriented towards the classification of drug-related content in Dark Web forums. Unlike the manual or spontaneous extensions that typically characterise community MISP taxonomies, the procedure proposed here articulates a set of methodological stages that ensure traceability, verifiability and ontological consistency.

Each step, from the selection of the ambiguous subset to the generation of textual justifications and quantitative thresholds, follows a logic of control and documentation that turns the process into a model of methodological replication for other cyber-intelligence domains.

The integration of LLMs, specifically Mistral 7B, with expert human validation (HITL) has proved to be an effective combination for semantic expansion and classification improvement. The LLM component contributes contextual detection capability and generalisation over heterogeneous corpora, while human supervision introduces criteria of rigour, coherence and ontological adequacy.

This synergy makes it possible to overcome the limitations of purely automatic systems, reducing ambiguity and ensuring that taxonomic extensions reflect both real linguistic patterns and expert domain knowledge. In terms of results, the verifiable reduction in the percentage of unclear records and the consolidation of morphological categories confirm the effectiveness of the hybrid approach. The resulting taxonomy therefore offers better semantic coverage and greater classificatory coherence than previous versions.

The extension with new categories—such as plant-matter, liquid and blotter—and the merging of redundant terms—pill, tablet and capsule—not only optimises classification accuracy, but also reflects the discursive evolution of illicit digital markets. The resulting model provides a more realistic view of criminal language and forum dynamics, in which morphological and commercial descriptions of products prevail over purely chemical denominations.

In addition, the quantitative comparison between the initial and extended classifications confirms that the modifications introduced improve granularity without distorting the structural balance of the system. Further inter-model and out-of-sample validation, as outlined in the future lines of research, would strengthen the external validity of these findings.

From a methodological perspective, the research lays the foundations for standardising future MISP extensions. The proposed pipeline—comprising automated cue detection, statistical support calculation, human review and final reclassification—can be reproduced in other cyberthreat domains, such as ransomware, malware families or phishing ecosystems.

Its added value lies in offering a controlled extension methodology, in which each new category is justified with empirical evidence, validated semantically and documented to facilitate inter-institutional interoperability. In this way, the study contributes to the development of a taxonomic governance model grounded in transparency and verifiability, aligned with the current needs of threat intelligence sharing platforms.

8.1. Practical Implications

From an operational standpoint, the results of this research have a direct impact on the efficiency of Dark Web intelligence detection and classification systems. The new taxonomy enables more accurate identification of the physical forms and contextual settings of substances, facilitating the creation of alerts, indicators of compromise and correlations between forums, markets and actors.

This leads to greater analytical capacity for cyber-intelligence units and law enforcement agencies, which can prioritise resources according to emerging trafficking or consumption trends inferred from language.

Moreover, the partial automation of the process reduces manual review times and makes it possible to keep MISP systems continuously updated with a controlled operational effort.

8.2. Organisational Implications

At an organisational level, the methodology proposed supports standardisation and inter-institutional cooperation.

The use of an audited and documented procedure for extending taxonomies allows different organisations—incident response centres, police units, cybercrime observatories or OSINT communities—to share and reproduce the same classificatory structure without loss of semantic coherence.

This promotes sustainable interoperability within the cyber-intelligence ecosystem, where the evolution of taxonomies no longer depends on individual actors but on verifiable and collaborative processes. In the long term, this model contributes to institutionalising a culture of structured, evidence-based knowledge management, which is essential to address the rapid mutation of criminal language and the dynamics of the illicit digital economy.

8.3. Limitations of the Study

This work is subject to several limitations arising from both its methodological design and the empirical characteristics of the corpus. Although the results support the usefulness of the taxonomic extension process and of the hybrid LLM–HITL approach, these constraints delimit the scope of the findings and should be considered when interpreting the results.

First, the corpus is restricted to a specific set of Dark Web forums and to content processed through a translation pipeline centred on English and Spanish. This improves comparability across posts, but it may also reduce the semantic fidelity of slang, abbreviations, and culturally specific expressions present in the original messages, especially in multilingual or non-Anglophone communities. The findings should therefore be interpreted as valid for the analysed corpus and related contexts, rather than as universally generalisable to all cryptomarket environments.

Second, the pipeline depends in part on the behaviour of the language model used for the initial classification stage. Although the model was run under controlled settings and its outputs were later reviewed through a HITL process, automatic classification remains sensitive to contextual ambiguity, weak morphological evidence, and possible biases inherited from the training data. In addition, because the study relies on a single non-fine-tuned LLM, model-specific effects cannot be fully ruled out, and the observed classification patterns may partially reflect characteristics of the selected model rather than only structural properties of the corpus and the proposed taxonomy.

Third, the final class distribution is uneven across categories. While this reflects the empirical structure of the corpus, it reduces the analytical robustness of minority classes such as blotter or liquid, which contain substantially fewer cases than powder or crystal-rock. Future work should therefore test the model on broader and more balanced datasets in order to assess the stability of low-frequency categories.

Fourth, an additional limitation concerns research governance. The study did not undergo a formal institutional ethics committee review prior to analysis. This should be taken into account when assessing the overall design of the research. To mitigate risk, we adopted a risk-minimisation approach based on data minimisation, restricted handling of raw files, non-disclosure of operational identifiers, and controlled access to derived materials. While these measures reduce potential harm, they do not substitute for a formal ethics review and therefore remain a limitation of the study.

Finally, although the workflow includes a human validation stage, external expert validation was not incorporated in the present study. The review process was conducted within the research framework itself, which ensured procedural consistency but did not provide an independent forensic or linguistic assessment of the resulting taxonomy. Future inter-institutional validation would strengthen the external credibility and transferability of the proposed model.

8.4. Future Lines of Research

On the basis of the results obtained in this study and the limitations identified, the following future lines of research are proposed for the development and consolidation of semantic classification methodologies in cyber-intelligence environments.

First, a primary avenue for progress is to extend the taxonomic extension methodology towards an inter-domain validation process that encompasses different areas of illicit activity on the Dark Web, such as the trade in weapons, falsified pharmaceuticals or restricted biological materials. Cross-domain application of the model would make it possible to test the degree of semantic transferability of the generated taxonomies and the consistency of the HITL approach in thematic domains with distinct languages, hierarchies and markets.

Second, the practical implementation of the extended taxonomy within the MISP ecosystem is proposed, moving from an experimental environment to operational integration. This would involve developing a controlled extension plugin or module that allows HITL taxonomies to be incorporated directly into active MISP instances, with automatic logging of thresholds, textual justifications and validation results. Adoption of this approach would transform the proposed model into a functional tool, enhancing MISP’s capacity to manage contextualised information on emerging threats.

Third, it is proposed to broaden the nature of the signals processed by the pipeline, integrating multimodal components that combine text, imagery and metadata. To date, the analysis has focused exclusively on textual forum content, but many drug listings include images of the product, vendor logos or publication metadata that provide additional semantic and contextual information. Integrating these elements would enable the development of a more robust model capable of performing multimodal classifications based on both linguistic descriptions and visual characteristics of substances. Likewise, the fusion of textual and visual signals would open the door to research on authenticity, visual camouflage and illicit marketing practices—areas that remain underexplored in cyber-intelligence.

Fourth, a complementary future line is to explore the incorporation of next-generation language models with enhanced reasoning and multimodal capabilities, such as GPT-5 or Gemini Ultra, in order to extend the functional scope of the pipeline rather than merely compare classification outcomes. The relevance of this direction lies in the possibility of improving the generation of explanatory justifications, enriching contextual cue interpretation, and supporting more advanced forms of semantic traceability in cyber-intelligence workflows. In this sense, the adoption of more capable models would not only serve to improve process efficiency, but also to examine whether advances in AI enable richer and more interpretable forms of ontology-oriented classification in illicit digital environments.

Fifth, a particularly relevant direction for future work is to conduct a comprehensive robustness evaluation comprising two complementary strategies: an inter-model comparison and an out-of-sample validation. In the inter-model component, a common stratified subset of the corpus would be processed by multiple instruction-tuned LLMs, including Mistral 7B, Falcon 7B, LLaMA-2 7B, Qwen 7B, GPT-5, and Gemini Ultra, under harmonised prompting conditions, in order to determine whether the concentration of ambiguity in specific semantic zones and the gains associated with the extended taxonomy are reproducible across architectures or, conversely, depend on the particular model employed. In the out-of-sample component, a held-out set of posts specifically excluded from the HITL refinement stage would be classified under both the initial and extended taxonomies, making it possible to assess whether the observed improvements in ambiguity reduction and classification quality generalise beyond the development corpus or are partly attributable to in-sample optimisation. Establishing a provisional reference standard through independent expert annotation and comparing both taxonomy versions against stratified, manually annotated subsets would further strengthen the external validity of the proposed framework and provide more definitive evidence regarding the semantic quality of the extended classification.

Author Contributions

Conceptualization, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; methodology, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; software, A.R.-Z. and J.F.-L.; validation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; formal analysis, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; investigation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; resources, J.-A.M.-M.; data curation, A.R.-Z. and J.F.-L.; writing—original draft preparation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; writing—review and editing, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; visualisation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.-L. and L.D.-M.; supervision, J.-A.M.-M.; project administration, J.-A.M.-M.; funding acquisition, J.-A.M.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been developed within the “Recovery, Transformation and Resilience Plan”, project C084/23 Ada Byron INCIBE-UAH, funded by the European Union (Next Generation).

Data Availability Statement

For ethical, legal, and safety reasons, the raw corpus is not publicly released. The raw corpus cannot be made publicly available because it contains archived material from illicit-platform environments. To reduce ethical, legal, and safety risks, the manuscript reports only aggregate findings. Selected derived materials may be shared for academic purposes upon reasonable request, subject to case-by-case assessment.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
AUPR	Area Under the Precision-Recall Curve
BTC	Bitcoin
CARIOCA	Cybersecurity Actionable Risk-Informed Operational Capability Assessment
CPU	Central Processing Unit
DDI	Drug–Drug Interaction
DTI	Drug–Target Interaction
F1	F1-score
GB	Gigabytes
GenAI	Generative Artificial Intelligence
GPU	Graphics Processing Unit
HITL	Human-in-the-Loop
JSON	JavaScript Object Notation
LLM	Large Language Model
LLMs	Large Language Models
LTC	Litecoin
MISP	Malware Information Sharing Platform
NLP	Natural Language Processing
OSINT	Open-Source Intelligence
POST	Post-reclassification distribution
PRE	Pre-reclassification distribution
RAM	Random Access Memory
ROC	Receiver Operating Characteristic
RQ	Research Question
STIX	Structured Threat Information Expression
TAXII	Trusted Automated eXchange of Intelligence Information
WARC	Web ARChive
XMR	Monero

Appendix A. Representative Prompts Used with Mistral 7B

Appendix A.1. Relevance Filtering Prompt (Used in Drugs-Base.py)

You are a specialist in narcotics intelligence and Dark Web forum analysis.

Your task is to determine whether the following post is related to drugs.

Consider as drug-related:

–: illicit drugs
–: narcotics
–: prescription medicines used for abuse
–: paraphernalia directly linked to drug consumption or distribution
–: trafficking and shipping logistics clearly linked to drugs

Consider as not drug-related:

–: unrelated marketplace content
–: weapons, fraud, hacking, or other criminal topics without a drug link
–: generic conversation with no substance-related evidence

Return only one label:

drugs
or
other
Post title: {page_title}
Post content: {content_translated}
Keywords: {keywords}

Appendix A.2. Initial Morphology Classification Prompt (Used in Drugs-Initial.py)

You are a narcotics specialist. Classify the primary physical form of the substance described in the text.

Allowed labels:

–: pill-tablet-capsule
–: powder
–: crystal-rock
–: unclear

Rules:

–: Return exactly one label.
–: Choose unclear if the text does not provide sufficient morphological evidence.
–: Focus on physical form, not chemical family, quantity, price, or route of administration.

Post title: {page_title}

Post content: {content_translated}

Keywords: {keywords}

Appendix A.3. Cue Extraction Prompt for Ambiguous Cases (Used in Subset S)

You are analysing ambiguous drug-related forum posts.

Extract the morphological cues that may indicate the physical form of the substance.

Return a JSON array with short cue terms only.

Examples: [“pill”, “capsule”, “crystal”, “rock”, “gummy”]

Do not infer chemical type.

Do not include quantities, prices, routes of administration, or vendor names.

Post content: {content_translated}

Keywords: {keywords}

References

Cunliffe, J.; Decary-Hetu, D.; Pollak, T.A. Nonmedical prescription psychiatric drug use and the darknet: A cryptomarket analysis. Int. J. Drug Policy 2019, 73, 263–272. [Google Scholar] [CrossRef]
Zabihimayvan, M.; Sadeghi, R.; Doran, D. Security, information, and structure characterization of Tor: A survey. Telecommun. Syst. 2024, 87, 239–255. [Google Scholar] [CrossRef]
Joshi, P.S.; Dinesha, H.A. Study Report of Tor Antiforensic Techniques. In Cognitive Science and Technology; Springer: Singapore, 2023; Volume Part F1466, pp. 81–91. [Google Scholar]
Raman, R.; Nair, V.K.; Nedungadi, P.; Ray, I.; Achuthan, K. Darkweb research: Past, present, and future trends and mapping to sustainable development goals. Heliyon 2023, 9, e22269. [Google Scholar] [CrossRef] [PubMed]
Craciunescu, N.; South, N. Cultural Politics, Reciprocal Relations, and Operational Agility in Online Drug Markets; Emerald Group Publishing Ltd.: Leeds, UK, 2023; pp. 95–107. [Google Scholar]
Mancini, S.; Kerry, E.; Burnstein, K.; Maynard, M. Correlating drug overdoses with dark web market activity. Issues Inf. Syst. 2024, 25, 1–14. [Google Scholar] [CrossRef]
Lamy, F.R.; Daniulaityte, R.; Dudley, S. “Pressed OXY M30 Pills, Great Press, Potent, Fast Shipping!!!”: Availability of Counterfeit and Pharmaceutical Oxycodone Pills on One Major Cryptomarket. J. Psychoact. Drugs 2024, 56, 1–7. [Google Scholar] [CrossRef] [PubMed]
Soshnikov, S.; Bekker, S.; Idrisov, B.; Vlassov, V. Association of Drugs for Sale on the Internet and Official Health Indicators: Darknet Parsing and Correlational Study. JMIR Form. Res. 2024, 8, e56006. [Google Scholar] [CrossRef]
Abbas, S.; Bouazzi, I.; Sampedro, G.A.; Alsubai, S.; Almadhor, A.S.; Al Hejaili, A.; Kryvinska, N. Active-Darknet: An Iterative Learning Approach for Darknet Traffic Detection and Categorization. IEEE Access 2024, 12, 151987–151997. [Google Scholar] [CrossRef]
Yang, J.; Liang, W.; Wang, X.; Li, S.; Jiang, X.; Mu, Y.; Zeng, S. DarkMor: A framework for darknet traffic detection that integrates local and spatial features. Neurocomputing 2024, 607, 128377. [Google Scholar] [CrossRef]
Rodriguez-Valenzuela, A.B.; Pastrana, S.; Suarez-Tangil, G. Snorkeling in Dark Waters: A Longitudinal Surface Exploration of Unique Tor Hidden Services. IEEE Trans. Inf. Forensics Secur. 2025, 20, 5386–5395. [Google Scholar] [CrossRef]
Sharma, M.; Kumar, N.; Singh, V.P.; Madan, C.; Sarowa, S. Hybrid intelligent feature selector framework for darknet traffic classification. Multimed. Tools Appl. 2023, 83, 40337–40360. [Google Scholar] [CrossRef]
Barletta, C.; Di Natale, V.; Esposito, M.; Chisari, M.; Cocimano, G.; Di Mauro, L.; Salerno, M.; Sessa, F. The Rise of Fentanyl: Molecular Aspects and Forensic Investigations. Int. J. Mol. Sci. 2025, 26, 444. [Google Scholar] [CrossRef]
Dalvi, A.; Thapar, R.; Singh, S. Understanding Illicit Opioid Drug References on the Dark Web: A Text Mining Approach to Public Health Analysis; CRC Press: Boca Raton, FL, USA, 2025; pp. 518–526. [Google Scholar]
Hubner, E.M.; Schmid, M.; Manojlović, V.; Gattringer, D.; Pferschy-Wenzig, E.M.; Kunert, O. NMR Spectroscopic Reference Data of Synthetic Cannabinoids Sold on the Internet. Magn. Reson. Chem. 2025, 63, 241–255. [Google Scholar] [CrossRef]
De Nobrega, K.M.; Rutkowski, A.-F.; Saunders, C. The whole of cyber defense: Syncing practice and theory. J. Strateg. Inf. Syst. 2024, 33, 101861. [Google Scholar] [CrossRef]
Rabitti, G.; Khorrami Chokami, A.; Coyle, P.; Cohen, R.D. A taxonomy of cyber risk taxonomies. Risk Anal. 2025, 45, 376–386. [Google Scholar] [CrossRef] [PubMed]
Sasi, T.; Lashkari, A.H.; Lu, R.; Xiong, P.; Iqbal, S. A comprehensive survey on IoT attacks: Taxonomy, detection mechanisms and challenges. J. Inf. Intell. 2024, 2, 455–513. [Google Scholar] [CrossRef]
CIRCL. MISP Taxonomies and Classification as Machine Tags. 2025. Available online: https://www.misp-project.org/ (accessed on 12 December 2025).
Delvecchio, P.; Galantucci, S.; Iannacone, A.; Pirlo, G. CARIOCA: Prioritizing the use of IoC by threats assessment shared on the MISP platform. Int. J. Inf. Secur. 2025, 24, 98. [Google Scholar] [CrossRef]
OASIS Cyber Threat Intelligence (CTI) Technical Committee. STIX Version 2.1. Committee Specification 02. 2021. Available online: https://docs.oasis-open.org/cti/stix/v2.1/cs02/stix-v2.1-cs02.html (accessed on 19 July 2021).
OASIS Cyber Threat Intelligence (CTI) Technical Committee. TAXII Version 2.1. Committee Specification 01. 2021. Available online: https://docs.oasis-open.org/cti/taxii/v2.1/cs01/taxii-v2.1-cs01.html (accessed on 18 July 2021).
Mosqueira-Rey, E.; Hernández-Pereira, E.; Alonso-Ríos, D.; Bobes-Bascarán, J.; Fernández-Leal, Á. Human-in-the-loop machine learning: A state of the art. Artif. Intell. Rev. 2023, 56, 3005–3054. [Google Scholar] [CrossRef]
Wu, X.; Xiao, L.; Sun, Y.; Zhang, J.; Ma, T.; He, L. A survey of human-in-the-loop for machine learning. Future Gener. Comput. Syst. 2022, 135, 364–381. [Google Scholar] [CrossRef]
Wang, Z.J.; Choi, D.; Xu, S.; Yang, D. Putting Humans in the Natural Language Processing Loop: A Survey. arXiv 2021, arXiv:2103.04044. [Google Scholar] [CrossRef]
Décary-Hétu, D.; Faubert, C.; Chopin, J.; Malm, A.; Ratcliffe, J.; Dupont, B. “Like aspirin for arthritis”: A qualitative study of conditional cyber-deterrence associated with police crackdowns on the dark web. Criminol. Public Policy 2023, 22, 639–664. [Google Scholar] [CrossRef]
Holt, T.J.; Lee, J.R.; Griffith, E. An Assessment of Cryptomixing Services in Online Illicit Markets. J. Contemp. Crim. Justice 2023, 39, 222–238. [Google Scholar] [CrossRef] [PubMed]
Babu, B.V.; Kiran, K.V.D. Lyrebird Green Anaconda Optimization based Bayesian Hierarchical Neural Attention Harmonic Network for Illicit Dark Web Classification. J. Trends Comput. Sci. Smart Technol. 2025, 7, 240–265. [Google Scholar] [CrossRef]
Cortés, P. An Analysis of the Dispute Resolution Processes for Illicit Contracts in Dark Web Markets. Actual. Jurid. Iberoam. 2024, 21, 70–103. (In Spanish) [Google Scholar]
Laferrière, D.; Décary-Hétu, D. Examining the Uncharted Dark Web: Trust Signalling on Single Vendor Shops. Deviant Behav. 2023, 44, 37–56. [Google Scholar] [CrossRef]
Brinck, J.; Nodeland, B.; Belshaw, S. The “Yelp-Ification” of the Dark Web: An Exploration of the Use of Consumer Feedback in Dark Web Markets. J. Contemp. Crim. Justice 2023, 39, 185–200. [Google Scholar] [CrossRef]
Cascavilla, G. The Rise of Cybercrime and Cyber-Threat Intelligence: Perspectives and Challenges From Law Enforcement. IEEE Secur. Priv. 2025, 23, 17–26. [Google Scholar] [CrossRef]
Pavel, T. Malicious Financial Activities in the Dark Web—Prevailing Information and Knowledge; World Scientific Publishing Co.: Singapore, 2023; pp. 145–173. [Google Scholar]
dos Reis, E.F.; Teytelboym, A.; ElBahrawy, A.; De Loizaga, I.; Baronchelli, A. Identifying key players in dark web marketplaces through Bitcoin transaction networks. Sci. Rep. 2024, 14, 2385. [Google Scholar] [CrossRef]
Munksgaard, R. Building a case for trust: Reputation, institutional regulation and social ties in online drug markets. Glob. Crime 2023, 24, 49–72. [Google Scholar] [CrossRef]
Catalani, V.; Townshend, H.D.; Prilutskaya, M.; Roman-Urrestarazu, A.; van Kessel, R.; Chilcott, R.P.; Banayoti, H.; McSweeney, T.; Corazza, O. Profiling the vendors of COVID-19 related product on the Darknet: An observational study. Emerg. Trends Drugs Addict. Health 2023, 3, 100051. [Google Scholar] [CrossRef]
Andrei, F.; Aziani, A. Cocaine Declared Purity, Perceived Quality, Sales, and Revenues on the Darknet. Deviant Behav. 2025, 1–19. [Google Scholar] [CrossRef]
Siuda, P.; Aaltonen, M.; Haasio, A.; Bancroft, A.; Nurmi, J.; Shi, H.; Harviainen, J.T. Digital drug trading ecologies in context: Technological, geographic, and linguistic variation across darknet platforms. Int. J. Drug Policy 2025, 145, 104984. [Google Scholar] [CrossRef]
Warren, I.J.; Ryan, E. Drugs and the Dark Web: The Americanisation of Policing and Online Criminal Law From an Australian Perspective. In Digital Transformations of Illicit Drug Markets: Reconfiguration and Continuity; Emerald Publishing Limited: Leeds, UK, 2023; pp. 45–57. [Google Scholar] [CrossRef]
Woodward, C.A.; Issa, F.S.; Caneva, D.C.; Voskanyan, A.; Gadhia, R.A.; Hart, A.; Hertelendy, A.J.; DiGregorio, D.A.; Ciottone, R.G.; Ciottone, G.R. Combating the Opioid Crisis and Its National Security Threat Through CReDO: A Multidisciplinary Solution with Disaster Medicine Implications. Disaster Med. Public Health Prep. 2023, 17, e509. [Google Scholar] [CrossRef]
Wang, H.; Cui, Z.; Yang, Y.; Wang, B.; Zhu, L.; Zhang, W. A Network Enhancement Method to Identify Spurious Drug-Drug Interactions. IEEE/ACM Trans. Comput. Biol. Bioinform. 2024, 21, 1335–1347. [Google Scholar] [CrossRef]
Wang, H.; Liu, R.; Wang, B.; Hong, Y.; Cui, Z.; Ni, Q. Multitype Perception Method for Drug-Target Interaction Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 3489–3498. [Google Scholar] [CrossRef]
Broseus, J.; Rhumorbarbe, D.; Morelato, M.; Staehli, L.; Rossy, Q. A geographical analysis of trafficking on a popular darknet market. Forensic Sci. Int. 2017, 277, 88–102. [Google Scholar] [CrossRef]
Weimann, G. Terrorist Migration to the Dark Web. Perspect. Terror. 2016, 10, 40–44. [Google Scholar]

Figure 1. Bubble diagram of the extracted keywords. The size of each bubble is directly proportional to the number of occurrences of the word.

Figure 2. Distribution of values with the initial classifier by number of posts (bar chart).

Figure 3. PRE (initial classification) vs. POST (final classification) comparison by number of posts.

Figure 4. Top 12 PRE (initial classification) → POST (final classification) flows by number of posts. The diagonal dominates, indicating stability, while transitions from the label unclear towards more specific categories capture an important share of the reassignment of previously unresolved cases.

Figure 5. High-resolution VOSviewer network of the six thematic clusters. Labels were resized and the figure was re-exported at higher resolution to improve readability of dense co-occurrence areas.

Table 1. Comparative overview of prior studies and methodological differences with the present work.

Study	Data Source/Dataset	Main Task	Reported Metric(s)	Main Limitation	Difference from This Work
[9]	Dark Web traffic or illicit activity detection datasets	Illicit activity detection	Accuracy/Precision/Recall/F1	Focused on technical/network indicators, not forum semantics	Our study focuses on semantic classification of drug-forum posts
[10]	Network traffic/cybercrime monitoring data	Automated detection of suspicious activity	Accuracy/F1-score	Binary or traffic-level classification; no domain taxonomy extension	Our work extends a domain-specific MISP taxonomy
[11]	Forum or textual criminal-language corpora	NLP-based content analysis	Classification performance measures	Lacks MISP-oriented taxonomic validation	Our work integrates MISP compatibility and HITL review
[23,24,25]	Human-reviewed AI/NLP classification settings	HITL validation and ambiguity reduction	Task-dependent agreement/performance indicators	Not focused on Dark Web drug taxonomies	Our study applies HITL specifically to Dark Web drug discourse
[37,38]	Cryptomarkets and AlphaBay listings	Platform ecologies	Purity as a market signal	Limitation: No MISP-oriented taxonomy is developed	No morphological classification of forum posts is performed
[41,42]	Biomedical datasets	Heterogeneous networks	DTI and DDI prediction	AUPR, ROC, or classification metrics	Limitation: they do not analyse darknet discourse
This work	11,101 extracted posts; 6456 drug-related posts after cleaning	Taxonomy extension and semantic classification by primary physical form	Ambiguity reduction, class redistribution, taxonomic interpretability	Domain-specific validation is still bounded to analysed corpus	Combines LLM + HITL + MISP extension in a reproducible framework

Table 2. Comparison of main characteristics between the proposed taxonomy and the existing MISP drug taxonomy.

Dimension	Existing MISP “Drugs”	This Work (Primary Physical Form)
Focus	Chemical superclasses derived from databases such as DrugBank	Morphology of presentation and packaging of the substance
Main predicate	Substance chemistry	Primary physical form (oral solid, edible solid, solid extract, etc.)
Input evidence	Pre-existing chemical taxonomies	Titles, keywords and morphological cues from forum posts
Unit of computation	Isolated chemical categories	Row deduplicated by taxonomic family (without inflation due to synonyms)
Extension	Manual curation without an LLM protocol	HITL pipeline with LLM, verifiable thresholds and reproducible reporting

Table 3. Number of unique and duplicated posts, filtered on the basis of the content_translated column.

Label	Number of Records	Percentage (%) *
Unique posts	9360	84.32%
Duplicated posts	1741	15.68%

* Percentages calculated relative to the total number of records in the dataset.

Table 4. Final relevance-filtering results: drug-related posts retained for analysis vs. excluded non-drug-related posts.

Label	Number of Records	Percentage (%) *
Drug-related (“drugs”)	6456	68.97%
Not drug-related (“other”)	2904	31.03%

* Percentages calculated relative to the total number of unique posts.

Table 5. Distribution of values with the initial classifier.

Primary Physical Form	Count	Percentage (%) *
powder	1143	39.36%
unclear	683	23.52%
crystal-rock	607	20.90%
pill-tablet-capsule	471	16.22%

* Percentages calculated relative to the 2904 posts included in the morphology-classification subset.

Table 6. Manual review subset and reviewer agreement in the HITL validation stage.

Component	Description
Number of reviewers	2
Reviewer profiles	Cyber-intelligence/digital forensics; computational linguistics/NLP
Review mode	Independent double review
Unit of agreement	Final taxonomic decision per post/cue family
Raw agreement	89.3%
Cohen’s kappa	0.82
Conflict resolution	Consensus discussion using predefined criteria

Table 7. Distribution of validation outcomes in the reviewed subset.

Validation Outcome	Count	% of Subset S	Interpretation
Reassigned to existing categories	214	31.3%	Probable model under-classification within base taxonomy
Supported new categories	141	20.6%	Evidence of taxonomic gaps
Remained unclear/insufficient evidence	328	48.0%	No reliable reassignment or extension support

Table 8. New proposed categories and their corresponding cue families.

New Category	Cue Family	Support (%)	Decision
Liquid	Liquid_like	6.68%	Add new value
Plant-Matter	Plant_like	6.68%	Add new value
Blotter	Blotter_like	2.41%	Add new value

Table 9. Candidate terms that do not provide ontological separation, with support below the established thresholds, or both.

Candidate	Support (%)	Decision *
tablet	0.93	Alias within oral solid
oil/syrup	0.74	Alias under liquid
film	0.56	Conditional alias (blotter/oral solid depending on context)

* Alias indicates that the value should be redirected, while conditional alias specifies redirection depending on context.

Table 10. Initial class distribution (initial classification, v1).

Primary Physical Form	Count	Percentage (%) *
powder	1143	39.36%
unclear	683	23.52%
crystal-rock	607	20.90%
pill-tablet-capsule	471	16.22%

* Percentages calculated relative to the 2904 posts included in the morphology-classification subset (initial classification, v1).

Table 11. New class distribution (final classification, v2).

Primary Physical Form	Count	Percentage of Total (%) *
powder	1042	35.88%
crystal-rock	672	23.14%
pill-tablet-capsule	489	16.84%
unclear	328	11.29%
plant-matter	239	8.23%
liquid	95	3.27%
blotter	39	1.34%

* Percentages calculated relative to the 2904 posts included in the morphology-classification subset (final classification, v2).

Table 12. Comparison of percentage classifications in the main categories.

Form	v1 (%)	v2 (%)	Δ (%) *
unclear	23.52%	11.29%	−12.23
powder	39.36%	35.88%	−3.48
crystal-rock	20.90%	23,14%	+2.24
pill-tablet-capsule	16.22%	16.84%	+0.62
plant-matter	–	8.23%	+8.23
liquid	–	3.27%	+3.27
blotter	–	1.34%	+1.34

* With percentage variation between the initial and final classifications.

Table 13. Sensitivity check for minimum-occurrence threshold in VOSviewer.

Threshold	Nodes Retained	Main Effect
5	184	Higher coverage, but excessive overlap and lexical noise
8	127	Best balance between readability and thematic diversity
12	89	Cleaner map, but loss of specialised terms

Table 14. Relationship between VOSviewer clusters and the extended morphology-based taxonomy.

Cluster	Main Semantic Focus	Representative Terms	Linked Taxonomy Categories	Contribution to Taxonomic Interpretation
C1	Packaging and distribution	pack, packing, pills, delivery, worldwide_shipping	pill-tablet-capsule, blotter	Shows that oral solid and blotter references are embedded in commercial/logistical discourse rather than only chemical naming.
C2	Recreational cannabis/cocaine market	weed, pot, cocaine, 1g	plant-matter, powder	Supports the contextual distinctiveness of plant-based and powdered forms.
C3	Bulk synthetics and counterfeiting	bulk, ketamine_shards, methamphetamine	crystal-rock, powder	Reinforces the separation between granular/crystalline and powdered forms in wholesale discourse.
C4	Ketamine variants and import circuits	shard, shards, racemic_rocks, s-ketamine	crystal-rock	Provides contextual support for the internal coherence of the crystal-rock class.
C5	Opioids and pharmaceutical forms	oxycodone, heroin, fentanyl, tablets	pill-tablet-capsule, powder, liquid	Shows overlap between pharmaceutical branding and morphology-based classification.
C6	Purity, trust and shipping cues	pure, uncut, shipping, expresspost	cross-category contextual layer	Indicates that some high-frequency nodes operate as market qualifiers rather than taxonomic markers.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Medina-Merodio, J.-A.; Ferrer-Oliva, M.; Ruiz-Zambrano, A.; Fernández-López, J.; De-Marcos, L. Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach. Future Internet 2026, 18, 228. https://doi.org/10.3390/fi18050228

AMA Style

Medina-Merodio J-A, Ferrer-Oliva M, Ruiz-Zambrano A, Fernández-López J, De-Marcos L. Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach. Future Internet. 2026; 18(5):228. https://doi.org/10.3390/fi18050228

Chicago/Turabian Style

Medina-Merodio, José-Amelio, Mikel Ferrer-Oliva, Alejandro Ruiz-Zambrano, José Fernández-López, and Luis De-Marcos. 2026. "Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach" Future Internet 18, no. 5: 228. https://doi.org/10.3390/fi18050228

APA Style

Medina-Merodio, J.-A., Ferrer-Oliva, M., Ruiz-Zambrano, A., Fernández-López, J., & De-Marcos, L. (2026). Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach. Future Internet, 18(5), 228. https://doi.org/10.3390/fi18050228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach

Abstract

1. Introduction

2. Related Work

2.1. The Dark Web as a Criminal and Technological Ecosystem

2.2. Drug Trafficking Dynamics in Cryptomarkets

2.3. Trust, Anonymity and Policing Limitations

2.4. Automatic Detection Models, Taxonomies and Methodological Gaps

3. Methodology

3.1. Dataset and Data Preparation

3.2. Initial Taxonomy via LLM

3.3. Identification of Ambiguous Records and Basis for Extension

4. Human Review and Taxonomic Extension (HITL Process)

4.1. Foundations of the HITL Approach

4.2. Selection of the Review Subset (S)

Human Review Protocol and Reviewer Agreement

4.3. Extraction of Cues and Semantic Grouping

4.4. Definition of Thresholds and Decision Criteria

4.5. Results of the HITL Process

4.5.1. Consolidation and New Categories

4.5.2. Evaluated and Rejected Cases

4.5.3. Exclusion Criteria (HITL Rejections)

4.5.4. Synthesis of Results and Extended Version of the Taxonomy

4.6. Reclassification with the Extended Taxonomy

5. Analysis of Results

5.1. General Classification Statistics

5.2. Transition Analysis and Structural Stability After Reclassification

5.3. Evaluation of Ambiguity and Model Stability

6. Contextual Application of the Extended Taxonomy Through Co-Occurrence Network Analysis

6.1. General Structure of the Network

6.2. Identified Thematic Clusters

6.3. Relationship Between Thematic Clusters and the Extended Taxonomy

6.4. Connection Patterns Between Nodes

6.5. Global Interpretation and Response to RQ6

7. Discussion

8. Conclusions

8.1. Practical Implications

8.2. Organisational Implications

8.3. Limitations of the Study

8.4. Future Lines of Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Representative Prompts Used with Mistral 7B

Appendix A.1. Relevance Filtering Prompt (Used in Drugs-Base.py)

Appendix A.2. Initial Morphology Classification Prompt (Used in Drugs-Initial.py)

Appendix A.3. Cue Extraction Prompt for Ambiguous Cases (Used in Subset S)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI